Chapter-2

Chapter 2 Syntax Definition

A contex free grammar, CFG, (synonyms: Backus-Naur Firm of BNF) is a common notation for specifying the syntax of a languages

For example, an "IF-ELSE" statement in c-language has the form
IF (Expr) stmt ELSE stmt

In other words, it is the concatenation of:

the keyword IF ;
an opening parenthesis ( ;
an expression Expr ;
a closing parenthesis ) ;
a statement stmt ;
a keyword ELSE ;
Finally, another statement stmt.

The syntax of an 'IF-ELSE' statement can be specified by the following 'production rule' in the CFG.
stmt → IF (Expr) stmt ELSE stmt

The arrow ( → ) is read as "can have the form".

A context-free grammar (CFG) has four components:

A set of tokens called terminals.
A set of variable called nonterminals.
A set of production rules.
A designation of one of the nonterminals as the start symbol.

Multiple production with the same nonterminal on the left like:

list → + digit
list → - digit
list →

may be grouped together separated by vertical bars, like:

list → list + digit | list - digit | digit

Parsing
The parsing is a process of finding a parse tree for a string of tokens. Equivalently, it is a process of determining whether a string of tokens can be generated by a grammar.
The worst-case time pf parsing algorithms are O(nⁿ³) but typical is : O(n) time.

For example. The production rules of grammar G is:

list → list + digit | list - digit | digit
digit → 0 | 1 | . . . | 9

Given token string is 9-5+2.

Parse tree is:

diagram

Each node in the parse tree is labeled by a grammar symbol.
the interior node corresponds to the left side of the production.
the children of the interior node corresponds to the right side of production.

The language defined by a grammar is the set of all token strings can be derived from its start symbol.

The language defined by the grammar:

list → list + digit | list - digit | digit
digit → 0 | 1 | 2 | . . . | 9

contains all lists of digits separated by plus and minus signs.

The Epsilon, E, on the right side of the production denotes the empty string.

Ambiguity
A grammar is ambiguous if two or more different parse trees can be desire the same token string. Equivalently, an ambiguous grammar allows two different derivations for a token string.
Grammar for complier should be unambiguous since different parse trees will give a token string different meaning.

Consider the following grammar

string → string + string
| string - string
| 0 | 2 | . . . | 9

To show that a grammar is ambiguous all we need to find a "single" stringthat has more than one perse tree.

Figure:23 --- pg.31

Above figure show two different parse trees for the token string 9 - 5 + 2 that corresponds to two different way of parenthesizing the expression:

( - 5) + 2 and 9 -(5 + 2).

The first parenthesization evaluates to 2.

Perhaps, the most famous example of ambiguity in a programming language is the dangling 'ELSE'.

Consider the grammar G with the production:

S → IF b THEN S ELSE S
| IF b THEN S
| a

G is ambiguous since the sentence

IF b THEN IF b THEN a ELSE a

has two different parse trees or derivation trees.

Parse tree I

figure

This parse tree imposes the interpretation

IF b THEN (IF b THEN a ) ELSE a

Parse Tree II

Figure

This parse tree imposes the interpretation

IF b THEN (IF b THEN a ELSE a)

The reason that the grammar G is ambiguous is that an 'ELSE' can be associated with two different THENs. For this reason, programming languages which allows both IF-THEN-ELSE and IF-THEN constant can be ambiguous.

Removal of Ambiguity
For compiling applications we need to design unambiguous grammar, or to use ambiguous grammar with additional rules to resolve the ambiguity.

Associativity of operators.
Precedence of operators.
Separate rules or Productions.

1. Associativity of Operators:
If operand has operators on both side then by connection, operand should be associated with the operator on the left.

In most programming languages arithmetic operators like addition, subtraction, multiplication, and division are left associative.

Token string: 9 - 5 + 2
Production rules
list → list - digit | digit
digit → 0 | 1 | 2 | . . . | 9
Parse tree for left-associative operator is

figure 24 on pg. 31

In the C programming language the assignment operator, =, is right associative. That is, token string a = b = c should be treated as a = (b = c).

Token string: a = b =c.
Production rules:
right → letter = right | letter
letter → a | b | . . . | z
Parse tree for right-associative operator is:

Figure

2. Precedence of Operators:
An expression 9 + 5 * 2 has two possible interpretation:

(9 + 5) * 2 and 9 + (5 * L)

The associativity of '+' and '*' do not resolve this ambiguity. For this reason, we need to know the relative precedence of operators.
The convention is to give multiplication and division higher precedence than addition and subtraction.
Only when we have the operations of equal precedence, we apply the rules of associative.
So, in the example expression: 9 + 5 * 2.
We perform operation of higher precedence i.e., * before operations of lower precedence i.e., +. Therefore, the correct interpretation is 9 + (5 *).

3. Separate Rule:
Consider the following grammar and language again.

S → IF b THEN S ELSE S
| IF b THEN S
| a

An ambiguity can be removed if we arbitrary decide that an ELSE should be attached to the last preceding THEN, like:

Figure

We can revise the grammar to have two nonterminals S₁ and S₂. We insist that S₂ generates IF-THEN-ELSE, while S₁ is free to generate either kind of statements.

The rules of the new grammar are:

S₁ →   IF b THEN S₁
         |      IF b THEN S₂ THEN S₁
         |     a
S₂ → IF b THEN S₂ ELSE S₂
         |     a

Although there is no general algorithm that can be used to determine if a given grammar is ambiguous, it is certainly possible to isolate rules which leads to ambiguity or ambiguous grammar.

A grammar containing the productions.

A → AA | Alpha

is ambiguous because the substring AAA has different parse tree.

Figure

This ambiguity disappears if we use the productions

A → AB | B
B → α

A → BA | B
B → α

3. Syntax of Expressions:
A grammar of arithmetic expressions looks like:

Expr → expr + term | expr - term | term
term → term * factor | term/factor | factor
factor → id | num | (expr)

That is, expr is a string of terms separated by '+' and '-'.
A term is a string of factors separated by '*' and '/' and a factor is a single operand or an expression wrapped inside of parenthesis.

2.3 Syntax-Directed Translation:

Modern compilers use syntax-directed translation to interleaves the actions of the compiler phases.

The syntax analyzer directs the whole process during the parsing of the source code.

Calls the lexical analyzer whenever syntax analyzer wants another token.
Perform the actions of semantic analyzer.
Perform the actions of the intermediate code generator.

The actions of the semantic analyzer and the intermediate code generator require the passage of information up and/or down the parse tree.

We think of this information as attributes attached to the nodes of the parse tree and the parser moving this information between parent nodes and children nodes as it performs the productions of the grammar.

Postfix Notation:
Postfix notation also called reverse polish notation or RPN places each binary arithmetic operator after its two operands instead of between them.

Infix Expression : (9 - 5) + 2

= (95 -) + 2

= (95-) 2 +

= 95 - 2 + : Postfix Notation

Infix Expression : 9 - (5 + 2)

= 9 - (52+)

= 9 (52+) -

= 9 5 2 + - : Postfix Notation

Why postfix notation?

There are two reasons

There is only one interpretation
We do not need parenthesis to disambignate the grammar.

Syntax-Directed Definitions:

A syntax-directed definition uses a CFG to specify the syntatic structure of the input.
A syntax-directed definition associates a set of attributes with each grammar symbol.
A syntax-directed definition associates a set of semantic rules with each production rule.

For example, let the grammar contains the production:

X → Y Z

And also let that nodes X, Y and Z have associated attributes X.a, Y.a and Z.a respectively.

The annotated parse tree looks like:

diagram

If the semantic rule
{X.a := Y.a + Z.a}
is associated with the production
X → Y Z
then parser should add the attribute 'a' of node Y and attribute 'a' of node Z together and set the attribute 'a' of node X to their sum.

Synthesized Attributes:
An attribute is synthesized if its value at a parent node can be determined from attributes of its children.

diagram

Since in this example, the value of a node X can be determined from 'a' attribute of Y and Z nodes attribute 'a' in a synthesized attribute.
Synthesized attributes can be evaluated by a single bottom-up traversal of the parse tree.

Example 2.6: Following figure shows the syntax-directed definition of an infix-to-postfix translator.

Figure 2.5 Pg. 34

PRODUCTION

SEMANTIC RULE

expr → expr₁ + term expr.t : = expr₁.t + | | term.t | | '+'

expr → expr₁ - term expr.t : = expr₁.t + | | term.t | | '-'

expr → term expr.t : = term.t

term → 0 term.t : = '0'

term → 1 term.t : = '1'

           :           :

           :           :

term → 9 term.t : = '9'

Parse tree corresponds to Productions

Diagram

Annotated parse tree corresponds to semantic rules.

Diagram

The above annotated parse tree shows how the input infix expression 9 - 5 + 2 is translated to the prefix expression 95 - 2 + at the root.

Depth-First Traversals
A depth-first traversal of a parse tree is one way of evaluating attributes.
Note that a syntax-directed definition does not impose any particular order as long as order computes attribute of parent after all its children's attributes.

PROCEDURE visit (n: node)
BEGIN
    FOR each child m of n, from left to right
        Do visist (m);
    Evaluate semantic rules at node n
END

Diagram

Translation Schemes
A translation scheme is another way of specifying a syntax-directed translation. This scheme is a CFG in which program fragments called semantic actions are embedded within the right sides of productions.

For example,
rest → + term {primt ( ' + ' )} rest,
indicates that a '+' sign should be printed between:

depth-first traversal of the term node, and
depth first traversal of the rest, node.

Diagram

Ex. 2.8

REVISION: SYNTAX-DIRECTED TRANSLATION

Step1: Syntax-directed definition for translating infix expression to postfix form.

PRODUCTION

SEMANTIC RULE

expr → expr₁ + term expr.t : = expr₁.t + | | term.t | | '+'

expr → expr₁ - term expr.t : = expr₁.t + | | term.t | | '-'

expr → term expr.t : = term.t

term → 0 term.t : = '0'

term → 1 term.t : = '1'

           :           :

           :           :

term → 9 term.t : = '9'

Step 2: A translation scheme derived from syntax-direction definition is :

Figure 2.15 on pg. 39

expr → expr + term {print( ' + ' )}

expr → expr - term {print( ' - ')}

expr → term

term → 0 {print( ' 0 ' )}

term → 1 {print( ' 1 ' )}

: :

: :

term → 9 {print( ' 9 ' )}

Step 3: A parse tree with actions translating 9 - 5 + 2 into 95 - 2 +

Figure 2.14 on pg. 40

Note that it is not necessary to actually construct the parse tree.

Parsing

Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser must be capable of constructing the tree, or else the translation cannot be guaranteed correct. For any language that can be described by CFG, the parsing requires O(n3) time to parse string of n token. However, most programming languages are so simple that a parser requires just O(n) time with a single left-to-right scan over the iput string of n tokens.

There are two types of Parsing

Top-down Parsing (start from start symbol and derive string)
A Top-down parser builds a parse tree by starting at the root and working down towards the leaves.

Easy to generate by hand.

Examples are : Recursive-descent, Predictive.

Bottom-up Parsing (start from string and reduce to start symbol)
A bottom-up parser builds a parser tree by starting at the leaves and working up towards the root.

Not easy to handle by hands, usually compiler-generating software generate bottom up parser

But handles larger class of grammar

Example is LR parser.

Top-down Parsing
Consider the CFG with productions:

expr → term rest
rest → + term rest | - term rest
term → 0 | 1 | . . . | 9

Step 0:: Initialization: Root must be starting symbol
Step 1:: expr → term rest
Step 2:: term → 9
Step 3: rest → term rest
Step 4:: term → 5
Step 5:: rest → term rest
Step 6:: term → 2
Step 7:: rest → E

In the example above, the grammar made it easy for the top-down parser to pick the correct production in each step.
This is not true in general, see example of dangling 'else'.

Pridictive Parsing:
Recursive-descent parsing is a top-down method of syntax analysis that executes a set of recursive procedure to process the input. A procedure is associated with each nonterminal of a grammar.

A predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step.

Let the grammar be:

expr → term rest
rest → + term rest | - term rest | 6
term → 0 | 1 | . . . | 9

In a recursive-descent parsing, we write code for each nonterminal of a grammar. In the case of above grammar, we should have three procedure, correspond to nonterminals expr, rest, and term.

Since there is only one production for nonterminal expr, the procedure expr is:

expr ( )
{
    term ( );
    rest ( );
    return
}

Since there are three (3) productions for rest, procedure rest uses a global variable, 'lookahead', to select the correct production or simply selects "no action" i.e.,

E - production, indicating that lookahead variable is neither + nor -

rest ( )
{
    IF (lookahead = = '+') {
            match ( ' + ' );
            term ( );
            rest ( );
            return
    }
    ELSE IF ( lookahead = = '-') {
                match (' - ');
                term ( );
                rest ( );
                return
              {
    ELSE {
            return;
    }
}

The procedure term checks whether global variable lookahead is a digit.

term ( ) {
        IF ( isdigit (lookahead)) {
                match (lookahead);
                return;
        }
        else{
               ReportError ( );
        }

After loading first input token into variable 'lookahead' pridictive parser is stared by calling starting symbol, 'expr'.
If the input is error free, the parser conducts a depth-first traversal of the parse tree and return to caller routine through expr.

Problem with Predictive Parsing: left recursion

Left Recursion:
The production is left-recursive if the leftmost symbol on the right side is the same as the non terminal on the left side. For example,
expr → expr + term.

If one were to code this production in a recursive-descent parser, the parser would go in an infinite loop.

diagram

We can eliminate the left-recursion by introducing new nonterminals and new productions rules.

For example, the left-recursive grammar is:

E → E + T | T
E → T * F | F
F → (E) | id.

We can redefine E and T without left-recursion as:

E → TE`
E`→ + TE` | E
T → FT`
T → * FT` | E
F → (E) | id

Getting rid of such immediate left recursion is not enough. One must get rid of indirect left recursion too, where two or more nonterminals are mutually left-recursive.

Infix Expression	: (9 - 5) + 2
	= (95 -) + 2
	= (95-) 2 +
	= 95 - 2 +	: Postfix Notation

Infix Expression	: 9 - (5 + 2)
	= 9 - (52+)
	= 9 (52+) -
	= 9 5 2 + -	: Postfix Notation

PRODUCTION	SEMANTIC RULE
expr → expr₁ + term	expr.t : = expr₁.t + \| \| term.t \| \| '+'
expr → expr₁ - term	expr.t : = expr₁.t + \| \| term.t \| \| '-'
expr → term	expr.t : = term.t
term → 0	term.t : = '0'
term → 1	term.t : = '1'
:	:
:	:
term → 9	term.t : = '9'

expr → expr + term	{print( ' + ' )}
expr → expr - term	{print( ' - ')}
expr → term
term → 0	{print( ' 0 ' )}
term → 1	{print( ' 1 ' )}
:	:
:	:
term → 9	{print( ' 9 ' )}