Theoretical Computer Science III
COS3701
Context-Free Grammars
SYNTAX AS A METHOD FOR DEFINING LANGUAGES
Several of the adjustments that had to be made were already in use in the scientific literature for various other
reasons. For example, the use of the slash as a divide sign was already accepted by the mathematical public. Most
publishers had special symbols for the popular fractions such as ½ and ¼, but eight-elevenths was customarily
written as 8/11 .
Still, before the days of the computer no one would ever have dreamed of writing a complicated compound
fraction such as
in the parentheses-laden one-line notation
(( 1 /2) + 9)/(4 + (8/2 1 ) + (5/(3 + ( 1 /2))))
How can a computer scan over this one-line string of typewriter characters and figure out what is going on? That
is, how can a computer convert this string into its personal language of LOAD this, STORE that, and so on?
The conversion from a "high-level" language into a machine-executable language is done by a program called the
compiler. This is a superprogram. Its input data are other programs. It processes them and prints out an equivalent
program written in machine or assembler language. It cannot just look at the expression and understand it. Rules
must be given by which this string can be processed-rules.
Before we try to build a compiling machine, let us return to the discussion of what is and what is not a valid
arithmetic expression as defined in Chapter 3 by recursive definition (p. 25).
Rule 1 Any number is in the set AE.
Rule 2 If x and y are in AE, then so are
(x) - (x) (x + y) (x - y) (X * y) (x/y) (x**y)
This time we have included parentheses around every component factor. This avoids the ambiguity of expressions
like 3 + 4 * 5 and 8/4/2 by making them illegal. First, we must design a machine that can figure out how a given
input string was built up from these basic rules. Then we should be able to translate this sequence of rules into an
assembler language program, because all these rules are pure assembler language instructions.
For example, if we present the input string
((3 + 4) * (6 + 7))
and the machine discovers that the way this can be produced from the rules is by the sequence
3 is in AE
4 is in AE
(3 + 4) is in AE
6 is in AE
7 is in AE
(6 + 7) is in AE
((3 + 4) * (6 + 7)) is in AE
we can therefore algorithmically convert this into
, LOAD 3 in register 1
LOAD 4 in register 2
ADD the contents of register 2 into register 1
LOAD 6 in register 3
LOAD 7 in register 4
ADD the contents of register 3 into register 4
MULTIPLY register I by register 4
or some such sequence of instructions depending on the architecture of the particular machine
The hard part of the problem is to figure out by mechanical means how the input string can be produced from the
rules. The second part- given the sequence of rules that create the expression, to convert it into a computer
program to evaluate the expression - is easy.
A grammar is the set of rules by which the valid sentences in a language are constructed. The rules by which
sentences are made are an example of an organically evolved recursive definition. Our ability to understand what a
sentence means is based on our ability to understand how it could be formed from the rules of grammar.
Determining how a sentence can be formed from the rules of grammar is called parsing the sentence.
Because the English word "grammar" can mean the study of grammar as well as the set of rules themselves, we
sometimes refer to the set of rules as forming a generative grammar. This emphasizes the point that from them
and a dictionary (the alphabet) we can generate all the sentences (words) in the language.
Rules that involve the meaning of words we call semantics and rules that do not involve the meaning of words we
call syntax. In English, the meaning of words can be relevant, but in arithmetic the meaning of numbers is rarely
cataclysmic. In the high-level computer languages, one number is as good as another. If
X=B+9
is a valid formulation, then so are
X=B+8 X = B + 473 X=B+9999
So long as the constants do not become so large that they are out of range, we do not try to divide by 0, take the
square root of a negative number, and we do not mix fixed-point numbers with floating-point numbers in bad
ways, one number is as good as another. In general , the rules of computer language grammar are all syntactic and
not semantic, which makes the task of interpretation much easier. There is another way in which the parsing of
arithmetic expressions is easier than the parsing of English sentences
Let us go back to the analogy between computer languages and English. Some of the rules of English grammar are
these :
1. A sentence can be a subject followed by a predicate.
2. A subject can be a noun-phrase.
3. A noun-phrase can be an adjective followed by a noun-phrase.
4. A noun-phrase can be an article followed by a noun-phrase.
5. A noun-phrase can be a noun.
6. A predicate can be a verb followed by a noun-phrase.
7. A noun can be
• apple bear cat dog
8. A verb can be
• eats follows gets hugs
9. An adjective can be
• itchy jumpy
10. An article can be
• a an the
,Within this small model of English, there are hundreds of sentences we can form - for example,
The itchy hear hugs the jumpy dog.
The method by which this sentence can be generated is outlined here:
➢ subject predicate Rule I
➢ noun-phrase predicate Rule 2
➢ noun-phrase verb noun-phrase Rule 6
➢ article noun-phrase verb noun-phrase Rule 4
➢ article adjective noun-phrase verb noun-phrase Rule 3
➢ article adjective noun verb noun-phrase Rule 5
➢ article adjective noun verb article noun-phrase Rule 4
➢ article adjective noun verb article adjective noun-phrase Rule 2
➢ article adjective noun verb article adjective noun Rule 5
➢ the adjective noun verb article adjective noun Rule 10
➢ the itchy noun verb article adjective noun Rule 9
➢ the itchy bear verb article adjective noun Rule 7
➢ the itchy bear hugs article adjective noun Rule 8
➢ the itchy bear hugs the adjective noun Rule 10
➢ the itchy bear hugs the jumpy noun Rule 9
➢ the itchy bear hugs the jumpy dog Rule 7
A law of grammar is in reality a suggestion for possible substitutions. The arrow (=>) indicates that a substitution
was made according to the preceding rules of grammar. What happened above is that we started out with the
initial symbol sentence. We then applied the rules for producing sentences listed in the generative grammar.
Once we have put in the word "bear," we are stuck with it. No rule of grammar says that a bear can be replaced by
anything else. The words that cannot be replaced by anything are called terminals. Words that must be replaced
by other things we call nonterminals.
Midway through the production procedure, we developed the sentence into as many nonterminals as it was going
to become.
article adjective noun verb article adjective noun
From this point on, the procedure was only one of selecting which terminals were to be inserted in place of the
nonterminals. This middle stage in which all the terminals are identified by their nonterminal names is the
"grammatical parse" of the sentence.
We have allowed a noun-phrase to be a n adjective followed b y a noun-phrase. This could lead to
noun-phrase
➢ adjective noun-phrase
➢ adjective adjective noun-phrase
➢ adjective adjective adjective noun-phrase
➢ adjective adjective adjective noun
➢ itchy adjective adjective noun
➢ itchy itchy adjective noun
➢ itchy itchy itchy noun
➢ itchy itchy itchy bear
If we so desired, we could produce 50 itchy 's. Using the Kleene closure operator, we could write something like
noun-phrase => adjective* noun
But now, we are getting ahead of ourselves.
, The rules we have given for this simplified version of English allow for many dumb sentences, such as
Itchy the apple eats a jumpy jumpy jumpy bear.
Because we are not considering the limitations of semantics, diction, or good sense, we must consider this string of
terminals as a legitimate sentence. This is what we mean by the phrase "formal language," which we used in Part I.
It is a funny phrase because it sounds as if we mean the stuffy language used in aristocratic or diplomatic circles. In
our case, it means only that any string of symbols satisfying the rules of grammar (syntax alone) is as good as any
other. The word "formal" here means "strictly formed by the rules," not "highly proper."
We can follow the same model for defining arithmetic expressions. We can write the whole system of rules of
formation as the list of possible substitutions shown below :
Start → (AE)
AE → (AE + AE)
AE → (AE - AE)
AE → (AE * AE)
AE → AE I AE)
AE → (AE ** AE)
AE → (AE)
AE → - (AE)
AE → ANY-NUMBER
Here, we have used the word "Start" to begin the process, as we used the symbol "sentence" in the sample of
English. Aside from Start, the only other nonterminal is AE. The terminals are the phrase "ANY-NUMBER" and the
symbols
+ - * / ** ()
Either we could be satisfied that we know what is meant by the words "any number," or else we could define this
phrase by a set of rules, thus converting it from a terminal into a nonterminal.
Rule I ANY-NUMBER → FIRST-DIGIT
Rule 2 FIRST-DIGIT → FIRST-DIGIT OTHER-DIGIT
Rule 3 FIRST-DIGIT → 1 2 3 4 5 6 7 8 9
Rule 4 OTHER-DIGIT → O 1 2 3 4 5 6 7 8 9
Rules 3 and 4 offer choices of terminals. We put spaces between them to indicate "choose one," but we soon shall
introduce another disjunctive symbol. We can produce the number 1066 as follows:
ANY-NUMBER => FIRST-DIGIT Rule 1
=> FIRST-DIGIT OTHER-DIGIT Rule 2
=> FIRST-DIGIT OTHER-DIGIT OTHER-DIGIT Rule 2
=> FIRST-DIGIT OTHER-DIGIT OTHER-DIGIT OTHER-DIGIT Rule 2
=> 1066 Rule 3 and 4
Here, we have made all our substitutions of terminals for nonterminals in one fell swoop, but without any possible
confusion. One thing we should note about the definition of AE is that some of the grammatical rules involve both
terminals and nonterminals together.
In English, the rules were either of the form
One Nonterminal -+ string of Nonterminals
or
One Nonterminal -+ choice of terminals
In our present study, we shall see that the form of the rules in the grammar has great significance.