Chapter 3

Grammars

So far we have encountered two ways of thinking about regular languages: (i) through finite state machines; (ii) through regular expressions. These approaches have their roots in the study of machines, rather than - as you had probably been expecting - the study of natural languages. The third approach, you will be relieved to hear, is one that has its roots in the study of natural languages after all.

Many years ago (when I was at school) children were taught parsing. We were told things like the following. Sentences break down into subject followed by verb followed by object. Or perhaps they break down into Noun Phrase followed by verb followed by Noun Phrase. These constituents break down in turn: noun phrases being composed of determiner followed by adjective followed by noun.

This breaking-down process is important. The idea is that the way in which we assemble a sentence from the parts into which we have broken it down will tell us how to recover the meaning of a sentence from the meanings of its constituents.

If we start off with an alphabet Σ = {dog, cat, the, some, walk, swim, all, some . . . } then

Rules like
sentence → Subject verb object
and
Noun phrase → determiner adjective noun
have the potential, once equipped with further rules like
determiner → the a some many
verb → swim
verb → walk
noun → dog
noun → cat
to generate words in a language L ⊆ Σ*. This time 'language' really does mean something like 'language' in an ordinary sense (and 'word' now means something like sentence'). But the "words" that we generate are actually things that in an ordinary context would be called sentences. And this time we think of 'dog' not as a string but as a character from an alphabet, don't we!

The languages that we have seen earlier in this coursework can be generated by rules like this. For example
S → aS
S→ε
generates every string in the language L(a*) and a set of rules like
S → aSa
S → bSb
S→ε
generates the language of palindromes over the alphabet Σ = {a, b}.

These bundles of rules are called grammars and the rules in each bundle are called productions.

There are two sorts of characters that appear in productions. There are nonterminals which appear on the left of productions (and sometimes on the right). These are things like 'noun' and 'verb'. Terminals are the characters from the alphabet of the language we are trying to build, and they only ever appear on the right-hand-side of the production. Examples here are 'swim', 'walk', 'cat' and 'dog'.

Notice that there is a grammar that generates the language of palindromes over Σ = {a, b} even though this language is not regular. Grammars that generate regular languages have special features that mark them off. One can ascertain what these features are by reverse-engineering the definition of regular language from regular expressions or finite state machines, but we might as well just give it straight off.

A Regular grammar is one in which all productions are of the form

N → TN'

or

N →T

Where N and N' are nonterminals and T is a string of terminals.

The other illustrations I have given are of grammars not having this restriction. They are context-free. Reason for this nomenclature is [Details missing here]

Next: 3.0.1 Exercises
Back: 2.3 Arden's rule and some stuff like that