A contex free grammar, CFG, (synonyms: Backus-Naur Firm of BNF) is a common notation for specifying the syntax of a languages
For example, an "IF-ELSE" statement in c-language has the form
IF
(Expr) stmt
ELSE stmt
In other words, it is the concatenation of:
The syntax of an 'IF-ELSE' statement can be specified by the following
'production rule' in the CFG.
stmt → IF (Expr) stmt ELSE stmt
The arrow ( → ) is read as "can have the form".
A context-free grammar (CFG) has four components:
Multiple production with the same nonterminal on the left like:
list → + digit
list → - digit
list →
may be grouped together separated by vertical bars, like:
list → list + digit | list - digit | digit
Parsing
The parsing is a process of finding a parse tree for a string of tokens.
Equivalently, it is a process of determining whether a string of tokens can be
generated by a grammar.
The worst-case time pf parsing algorithms are O(n^{n3}) but typical is :
O(n) time.
For example. The production rules of grammar G is:
list → list + digit | list - digit | digit
digit → 0 | 1 | . . . | 9
Given token string is 9-5+2.
Parse tree is:
diagram
Each node in the parse tree is labeled by a grammar symbol.
the interior node corresponds to the left side of the
production.
the children of the interior node corresponds to the right
side of production.
The language defined by a grammar is the set of all token strings can be derived from its start symbol.
The language defined by the grammar:
list → list + digit | list - digit | digit
digit → 0 | 1 | 2 | . . . | 9
contains all lists of digits separated by plus and minus signs.
The Epsilon, E, on the right side of the production denotes the empty string.
Ambiguity
A grammar is ambiguous if two or more different parse trees can be desire
the same token string. Equivalently, an ambiguous grammar allows two different
derivations for a token string.
Grammar for complier should be unambiguous since different parse trees will give
a token string different meaning.
Consider the following grammar
string → string + string
| string - string
| 0 | 2 | . . . | 9
To show that a grammar is ambiguous all we need to find a "single" stringthat has more than one perse tree.
Figure:23 --- pg.31
Above figure show two different parse trees for the token string 9 - 5 + 2 that corresponds to two different way of parenthesizing the expression:
( - 5) + 2 and 9 -(5 + 2).
The first parenthesization evaluates to 2.
Perhaps, the most famous example of ambiguity in a programming language is the dangling 'ELSE'.
Consider the grammar G with the production:
S → IF b THEN S ELSE S
| IF b THEN S
| a
G is ambiguous since the sentence
IF b THEN IF b THEN a ELSE a
has two different parse trees or derivation trees.
Parse tree I
figure
This parse tree imposes the interpretation
IF b THEN (IF b THEN a ) ELSE a
Parse Tree II
Figure
This parse tree imposes the interpretation
IF b THEN (IF b THEN a ELSE a)
The reason that the grammar G is ambiguous is that an 'ELSE' can be associated with two different THENs. For this reason, programming languages which allows both IF-THEN-ELSE and IF-THEN constant can be ambiguous.
Removal of Ambiguity
For compiling applications we need to design unambiguous grammar, or to use
ambiguous grammar with additional rules to resolve the ambiguity.
1. Associativity of Operators:
If operand has operators on both side then by connection,
operand should be associated with the operator on the left.
In most programming languages arithmetic operators like addition, subtraction, multiplication, and division are left associative.
figure 24 on pg. 31
In the C programming language the assignment operator, =, is right associative. That is, token string a = b = c should be treated as a = (b = c).
Figure
2. Precedence of Operators:
An expression 9 + 5 * 2 has two possible interpretation:
(9 + 5) * 2 and 9 + (5 * L)
The associativity of '+' and '*' do not resolve this ambiguity. For this
reason, we need to know the relative precedence of operators.
The convention is to give multiplication and division higher precedence than
addition and subtraction.
Only when we have the operations of equal precedence, we apply the rules of
associative.
So, in the example expression: 9 + 5 * 2.
We perform operation of higher precedence i.e., * before operations of lower
precedence i.e., +. Therefore, the correct interpretation is 9 + (5 *).
3. Separate Rule:
Consider the following grammar and language again.
S → IF b THEN S ELSE S
| IF b THEN S
| a
An ambiguity can be removed if we arbitrary decide that an ELSE should be attached to the last preceding THEN, like:
Figure
We can revise the grammar to have two nonterminals S_{1} and S_{2}. We insist that S_{2} generates IF-THEN-ELSE, while S_{1} is free to generate either kind of statements.
The rules of the new grammar are:
S_{1} → IF b THEN S_{1}
| IF b THEN S_{2} THEN S_{1}
| a
S_{2} → IF b THEN S_{2} ELSE S_{2}
| a
Although there is no general algorithm that can be used to determine if a given grammar is ambiguous, it is certainly possible to isolate rules which leads to ambiguity or ambiguous grammar.
A grammar containing the productions.
A → AA | Alpha
is ambiguous because the substring AAA has different parse tree.
Figure
This ambiguity disappears if we use the productions
A → AB | B
B → α
or
A → BA | B
B → α
3. Syntax of Expressions:
A grammar of arithmetic expressions looks like:
Expr → expr + term | expr - term | term
term → term * factor | term/factor | factor
factor → id | num | (expr)
That is, expr is a string of terms separated by '+' and '-'.
A term is a string of factors separated by '*' and '/' and a factor is a single
operand or an expression wrapped inside of parenthesis.
Modern compilers use syntax-directed translation to interleaves the actions of the compiler phases.
The syntax analyzer directs the whole process during the parsing of the source code.
The actions of the semantic analyzer and the intermediate code generator require the passage of information up and/or down the parse tree.
We think of this information as attributes attached to the nodes of the parse tree and the parser moving this information between parent nodes and children nodes as it performs the productions of the grammar.
Postfix Notation:
Postfix notation also called reverse polish notation or RPN places each binary
arithmetic operator after its two operands instead of between them.
Infix Expression : (9 - 5) + 2 = (95 -) + 2 = (95-) 2 + = 95 - 2 + : Postfix Notation Infix Expression : 9 - (5 + 2) = 9 - (52+) = 9 (52+) - = 9 5 2 + - : Postfix Notation
Why postfix notation?
There are two reasons
Syntax-Directed Definitions:
For example, let the grammar contains the production:
X → Y Z
And also let that nodes X, Y and Z have associated attributes X.a, Y.a and Z.a respectively.
The annotated parse tree looks like:
diagram
If the semantic rule
{X.a := Y.a + Z.a}
is associated with the production
X →
Y Z
then parser should add the attribute 'a' of node Y and attribute 'a' of node Z
together and set the attribute 'a' of node X to their sum.
Synthesized Attributes:
An attribute is synthesized if its value at a parent node can be determined from
attributes of its children.
diagram
Since in this example, the value of a node X can be determined from 'a'
attribute of Y and Z nodes attribute 'a' in a synthesized attribute.
Synthesized attributes can be evaluated by a single bottom-up traversal of the
parse tree.
Example 2.6: Following figure shows the syntax-directed definition of an infix-to-postfix translator.
Figure 2.5 Pg. 34
PRODUCTION
SEMANTIC RULE
expr → expr_{1} + term expr.t : = expr_{1}.t + | | term.t | | '+' expr → expr_{1} - term expr.t : = expr_{1}.t + | | term.t | | '-' expr → term expr.t : = term.t term → 0 term.t : = '0' term → 1 term.t : = '1' : : : : term → 9 term.t : = '9'
Parse tree corresponds to Productions
Diagram
Annotated parse tree corresponds to semantic rules.
Diagram
The above annotated parse tree shows how the input infix expression 9 - 5 + 2 is translated to the prefix expression 95 - 2 + at the root.
Depth-First Traversals
A depth-first traversal of a parse tree is one way of evaluating attributes.
Note that a syntax-directed definition does not impose any particular order as
long as order computes attribute of parent after all its children's attributes.
PROCEDURE visit (n: node)
BEGIN
FOR each child m of n, from left to right
Do visist (m);
Evaluate semantic rules at node n
END
Diagram
Translation Schemes
A translation scheme is another way of specifying a syntax-directed translation.
This scheme is a CFG in which program fragments called semantic actions are
embedded within the right sides of productions.
For example,
rest →
+ term {primt ( ' + ' )} rest,
indicates that a '+' sign should be printed between:
Diagram
Ex. 2.8
REVISION: SYNTAX-DIRECTED TRANSLATION
Step1: Syntax-directed definition for translating infix expression to postfix form.
PRODUCTION
SEMANTIC RULE
expr → expr_{1} + term expr.t : = expr_{1}.t + | | term.t | | '+' expr → expr_{1} - term expr.t : = expr_{1}.t + | | term.t | | '-' expr → term expr.t : = term.t term → 0 term.t : = '0' term → 1 term.t : = '1' : : : : term → 9 term.t : = '9'
Step 2: A translation scheme derived from syntax-direction definition is :
Figure 2.15 on pg. 39
expr → expr + term {print( ' + ' )} expr → expr - term {print( ' - ')} expr → term term → 0 {print( ' 0 ' )} term → 1 {print( ' 1 ' )} : : : : term → 9 {print( ' 9 ' )}
Step 3: A parse tree with actions translating 9 - 5 + 2 into 95 - 2 +
Figure 2.14 on pg. 40
Note that it is not necessary to actually construct the parse tree.
Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser must be capable of constructing the tree, or else the translation cannot be guaranteed correct. For any language that can be described by CFG, the parsing requires O(n3) time to parse string of n token. However, most programming languages are so simple that a parser requires just O(n) time with a single left-to-right scan over the iput string of n tokens.
There are two types of Parsing
- Easy to generate by hand.
- Examples are : Recursive-descent, Predictive.
- Not easy to handle by hands, usually compiler-generating software generate bottom up parser
- But handles larger class of grammar
- Example is LR parser.
Top-down Parsing
Consider the CFG with productions:
expr → term rest
rest → + term rest | - term rest
term → 0 | 1 | . . . | 9
In the example above, the grammar made it easy for the top-down parser to
pick the correct production in each step.
This is not true in general, see example of dangling 'else'.
Pridictive Parsing:
Recursive-descent parsing is a top-down method of syntax analysis that executes
a set of recursive procedure to process the input. A procedure is associated
with each nonterminal of a grammar.
A predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step.
Let the grammar be:
expr → term rest
rest → + term rest | - term rest | 6
term → 0 | 1 | . . . | 9
In a recursive-descent parsing, we write code for each nonterminal of a grammar. In the case of above grammar, we should have three procedure, correspond to nonterminals expr, rest, and term.
Since there is only one production for nonterminal expr, the procedure expr is:
expr ( )
{
term ( );
rest ( );
return
}
Since there are three (3) productions for rest, procedure rest uses a global variable, 'lookahead', to select the correct production or simply selects "no action" i.e.,
E - production, indicating that lookahead variable is neither + nor -
rest ( )
{
IF (lookahead = = '+') {
match ( ' + ' );
term ( );
rest ( );
return
}
ELSE IF ( lookahead = = '-') {
match (' - ');
term ( );
rest ( );
return
{
ELSE {
return;
}
}
The procedure term checks whether global variable lookahead is a digit.
term ( ) {
IF ( isdigit (lookahead)) {
match (lookahead);
return;
}
else{
ReportError ( );
}
After loading first input token into variable 'lookahead' pridictive parser
is stared by calling starting symbol, 'expr'.
If the input is error free, the parser conducts a depth-first traversal of the
parse tree and return to caller routine through expr.
Problem with Predictive Parsing: left recursion
Left Recursion:
The production is left-recursive if the leftmost symbol on the right side is the
same as the non terminal on the left side. For example,
expr
→
expr + term.
If one were to code this production in a recursive-descent parser, the parser would go in an infinite loop.
diagram
We can eliminate the left-recursion by introducing new nonterminals and new productions rules.
For example, the left-recursive grammar is:
E → E + T | T
E → T * F | F
F → (E) | id.
We can redefine E and T without left-recursion as:
E → TE`
E`→ + TE` | E
T → FT`
T → * FT` | E
F → (E) | id
Getting rid of such immediate left recursion is not enough. One must get rid
of indirect left recursion too, where two or more nonterminals are mutually
left-recursive.