Languages, compilers and (context-free) grammars
What is a compiler and a parser?
A compiler translates one language into another (which is possibly the same), roughly:
Get hold of the structure of the input program attach semantics to a sequence of symbols check whether a program
makes sense optimize generate good machine code.
A parser takes a string of characters and tries to recognize a structure in the form of a tree.
Alphabet, languages and words
An alphabet is a (finite) set of symbols that can be used to form sentences, e.g.
The set of all Latin letters
Given such a set, we can consider (finite) sequences of elements of that set (in which the empty sequence is often written
as ). The set of sequences of a given set is written as and defined as:
the empty sequence is in .
if and , then is in .
(We typically use letters from the beginning of the (Latin) alphabet to represent single symbols, and letters from the end of
the alphabet to represent sequences.)
A word is a sequence of symbols from the alphabet. A language is a set of "correct" sentences of words. Thus, a language
is a subset of .
Grammars
Languages can be defined using inductive definitions, represented formally by making use of deduction rules.
Rule Meaning
If … and are true, then is true
is true (= axiom)
Grammars are a shorthand way to define a language inductively, by means of rewrite rules called productions (instead of
deduction rules). A grammar consists of multiple productions (e.g. in the grammar of palindromes).
Symbols from the alphabet are also called terminals in a grammar.
, The grammar makes use use of auxiliary symbols called nonterminals, that are not part of the alphabet and hence
cannot be part of the final word/sentence. (e.g. in the example production above.)
Starting from a nonterminal, we can rewrite successively until we reach a string of terminals. Such a sequence
is called a derivation.
A nonterminal could be a start symbol, often denoted as . (Context-free grammars have only one start
symbol.)
Note that not all languages can be generated/described by a grammar. On the other hand, multiple grammars may describe
the same language in which case these grammars are equivalent.
Context-free grammars
Grammars where the left hand side of a production always consists of a single nonterminal are called context-free
grammars. Languages that can be described by context-free grammar are called context-free languages.
A context-free grammar is a four-tuple , where:
is a finite set of terminal symbols.
is a finite set of nonterminal symbols.
is a finite set of production rules, each of form , where is a nonterminal and is a sequence of terminals
and nonterminals.
is the start symbol, .
Because context-free languages are relatively easy to deal with algorithmically, most programming languages are context-
free languages.
BNF (Backus Naur Form) and EBNF
Instead of writing every production on a single line, we may rewrite any number of rewrite rules for one nonterminal using a
shorthand notation called BNF, by using the symbol. e.g.:
can be combined to: .
Extended BNF (EBNF) is introduced to help abbreviate a number of standard constructions that usually occur often in the
syntax of a programming language:
and/or , means one or zero occurrences of nonterminal (i.e. optional).
, means one or more occurrences of nonterminal .
and/or , means zero or more occurrences of nonterminal .
These notations can be used for languages, grammars, nonterminals and sequences of terminal and nonterminal symbols.
, The language of a grammar
The language of a grammar , usually denoted , is defined as
Parse trees and ambiguity
We can visualize a derivation as a parse tree (a.k.a. abstract syntax tree (AST)). For example, for the grammar
the following derivation has the following parse tree:
The set of all equivalent derivations can be represented by selecting a so-called canonical element. A good candidate for
this element is the leftmost derivation. In a leftmost derivation, the leftmost nonterminal is rewritten in each step. (This is
not the case in the example above.)