Chapter 2

Syntax Definition

 

A contex free grammar, CFG, (synonyms: Backus-Naur Firm of BNF) is a common notation for specifying the syntax of a languages

For example, an "IF-ELSE" statement in c-language has the form
        IF         (Expr)         stmt         ELSE         stmt

In other words, it is the concatenation of:

The syntax of an 'IF-ELSE' statement can be specified by the following 'production rule' in the CFG.
        stmt          IF     (Expr)     stmt     ELSE     stmt

The arrow (  ) is read as "can have the form".

A context-free grammar (CFG) has four components:

  1. A set of tokens called terminals.
  2. A set of variable called nonterminals.
  3. A set of production rules.
  4. A designation of one of the nonterminals as the start symbol.

Multiple production with the same nonterminal on the left like:

list   + digit
list    - digit
list   

may be grouped together separated by vertical bars, like:

list     list + digit | list - digit | digit

Parsing
The parsing is a process of finding a parse tree for a string of tokens. Equivalently, it is a process of determining whether a string of tokens can be generated by a grammar.
The worst-case time pf parsing algorithms are O(nn3) but typical is : O(n) time.

For example. The production rules of grammar G is:

  list    list + digit | list - digit | digit
digit    0 | 1 | . . . | 9

Given token string is 9-5+2.

Parse tree is:

diagram

Each node in the parse tree is labeled by a grammar symbol.
    the interior node corresponds to the left side of the production.
    the children of the interior node corresponds to the right side of production.

The language defined by a grammar is the set of all token strings can be derived from its start symbol.

The language defined by the grammar:

 list     list + digit | list - digit | digit
digit   0 | 1 | 2 | . . . | 9

contains all lists of digits separated by plus and minus signs.

The Epsilon, E, on the right side of the production denotes the empty string.

Ambiguity
A grammar is ambiguous if  two or more different parse trees can be desire the same token string. Equivalently, an ambiguous grammar allows two different derivations for a token string.
Grammar for complier should be unambiguous since different parse trees will give a token string different meaning.

Consider the following grammar

string    string + string
            | string - string
            | 0 | 2 | . . . | 9

To show that a grammar is ambiguous all we need to find a "single" stringthat has more than one perse tree.

 

Figure:23  --- pg.31

Above figure show two different parse trees for the token string 9 - 5 + 2 that corresponds to two different way of parenthesizing the expression:

        ( - 5) + 2 and 9 -(5 + 2).

The first parenthesization evaluates to 2.

Perhaps, the most famous example of ambiguity in a programming language is the dangling 'ELSE'.

Consider the grammar G with the production:

  IF b THEN S ELSE S
      |    IF b THEN S
      |     a

G is ambiguous since the sentence

IF b THEN IF b THEN a ELSE a

has two different parse trees or derivation trees.

Parse tree I

figure

This parse tree imposes the interpretation

IF b THEN (IF b THEN a ) ELSE a

Parse Tree II

Figure

This parse tree imposes the interpretation

IF b THEN (IF b THEN a ELSE a)

The reason that the grammar G is ambiguous is that an 'ELSE' can be associated with two different THENs. For this reason, programming languages which allows both IF-THEN-ELSE and IF-THEN constant can be ambiguous.

Removal of Ambiguity
For compiling applications we need to design unambiguous grammar, or to use ambiguous grammar with additional rules to resolve the ambiguity.

  1. Associativity of operators.
  2. Precedence of operators.
  3. Separate rules or Productions.

1. Associativity of Operators:
    If operand has operators on both side then by connection, operand should be associated with the operator on the left.

In most programming languages arithmetic operators like addition, subtraction, multiplication, and division are left associative.

figure 24 on pg. 31

In the C programming language the assignment operator, =, is right associative. That is, token string a = b = c should be treated as a = (b = c).

Figure

2. Precedence of Operators:
    An expression 9 + 5 * 2 has two possible interpretation:

(9 + 5) * 2 and 9 + (5 * L)

The associativity of '+' and '*' do not resolve this ambiguity. For this reason, we need to know the relative precedence of operators.
The convention is to give multiplication and division higher precedence than addition and subtraction.
Only when we have the operations of equal precedence, we apply the rules of associative.
So, in the example expression: 9 + 5 * 2.
We perform operation of higher precedence i.e., * before operations of lower precedence i.e., +. Therefore, the correct interpretation is 9 + (5 *).

3. Separate Rule:
    Consider the following grammar and language again.

  IF b THEN S ELSE S
               |    IF b THEN S
               |    a

An ambiguity can be removed if we arbitrary decide that an ELSE should be attached to the last preceding THEN, like:

Figure

 

We can revise the grammar to have two nonterminals S1 and S2. We insist that S2 generates IF-THEN-ELSE, while S1 is free to generate either kind of statements.

The rules of the new grammar are:

S1     IF b THEN S1
         |      IF b THEN S2 THEN S1
         |     a
S2  IF b THEN S2 ELSE S2
         |     a

Although there is no general algorithm that can be used to determine if a given grammar is ambiguous, it is certainly possible to isolate rules which leads to ambiguity or ambiguous grammar.

A grammar containing the productions.

A AA | Alpha

is ambiguous because the substring AAA has different parse tree.

Figure

This ambiguity disappears if we use the productions

A AB | B
B α

or

A BA | B
B α

3. Syntax of Expressions:
   
A grammar of arithmetic expressions looks like:

Expr   expr + term | expr - term | term
term   term * factor | term/factor | factor
factor id | num | (expr)

That is, expr is a string of terms separated by '+' and '-'.
A term is a string of factors separated by '*' and '/' and a factor is a single operand or an expression wrapped inside of parenthesis.

2.3 Syntax-Directed Translation:

Modern compilers use syntax-directed translation to interleaves the actions of the compiler phases.

The syntax analyzer directs the whole process during the parsing of the source code.

The actions of the semantic analyzer and the intermediate code generator require the passage of information up and/or down the parse tree.

We think of this information as attributes attached to the nodes of the parse tree and the parser moving this information between parent nodes and children nodes as it performs the productions of the grammar.

Postfix Notation:
Postfix notation also called reverse polish notation or RPN places each binary arithmetic operator after its two operands instead of between them.

Infix Expression  : (9 - 5) + 2  
  = (95 -) + 2  
  = (95-) 2 +  
  = 95 - 2 + : Postfix Notation
     
Infix Expression  : 9 - (5 + 2)  
  = 9 - (52+)  
  = 9 (52+) -  
  = 9 5 2 + - : Postfix Notation

Why postfix notation?

There are two reasons

Syntax-Directed Definitions:

For example, let the grammar contains the production:

X Y Z

And also let that nodes X, Y and Z have associated attributes X.a, Y.a and Z.a respectively.

The annotated parse tree looks like:

diagram

If the semantic rule
        {X.a := Y.a + Z.a}
is associated with the production
        X Y Z
then parser should add the attribute 'a' of node Y and attribute 'a' of node Z together and set the attribute 'a' of node X to their sum.

Synthesized Attributes:
An attribute is synthesized if its value at a parent node can be determined from attributes of its children.

diagram

Since in this example, the value of a node X can be determined from 'a' attribute of Y and Z nodes attribute 'a' in a synthesized attribute.
Synthesized attributes can be evaluated by a single bottom-up traversal of the parse tree.

Example 2.6: Following figure shows the syntax-directed definition of an infix-to-postfix translator.

Figure 2.5 Pg. 34

PRODUCTION     

SEMANTIC RULE

expr    expr1 + term expr.t : = expr1.t + | | term.t | | '+'
expr    expr1 - term expr.t : = expr1.t + | | term.t | | '-'
expr    term expr.t : =  term.t
term    0 term.t : = '0'
term    1 term.t : = '1'
           :           :
           :           :
term    9 term.t : = '9'


Parse tree corresponds to Productions

Diagram

Annotated parse tree corresponds to semantic rules.

Diagram

The above annotated parse tree shows how the input infix expression 9 - 5 + 2 is translated to the prefix expression 95 - 2 + at the root.

Depth-First Traversals
A depth-first traversal of a parse tree is one way of evaluating attributes.
Note that a syntax-directed definition does not impose any particular order as long as order computes attribute of parent after all its children's attributes.

PROCEDURE visit (n: node)
BEGIN
    FOR each child m of n, from left to right
        Do visist (m);
    Evaluate semantic rules at node n
END

Diagram

Translation Schemes
A translation scheme is another way of specifying a syntax-directed translation. This scheme is a CFG in which program fragments called semantic actions are embedded within the right sides of productions.

For example,
        rest + term {primt ( ' + ' )} rest,
indicates that a '+' sign should be printed between:

Diagram

Ex. 2.8

REVISION: SYNTAX-DIRECTED TRANSLATION

Step1: Syntax-directed definition for translating infix expression to postfix form.

PRODUCTION     

SEMANTIC RULE

expr    expr1 + term expr.t : = expr1.t + | | term.t | | '+'
expr    expr1 - term expr.t : = expr1.t + | | term.t | | '-'
expr    term expr.t : =  term.t
term    0 term.t : = '0'
term    1 term.t : = '1'
           :           :
           :           :
term    9 term.t : = '9'

Step 2: A translation scheme derived from syntax-direction definition is :

Figure 2.15 on pg. 39

expr    expr + term {print( ' + ' )}
expr    expr - term {print( ' - ')}
expr    term  
term    0 {print( ' 0 ' )}
term    1 {print( ' 1 ' )}
           :           :
           :           :
term    9 {print( ' 9 ' )}

Step 3: A parse tree with actions translating 9 - 5 + 2 into 95 - 2 +

 

Figure 2.14 on pg. 40

Note that it is not necessary to actually construct the parse tree.

Parsing

Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser must be capable of constructing the tree, or else the translation cannot be guaranteed correct. For any language that can be described by CFG, the parsing requires O(n3) time to parse string of n token. However, most programming languages are so simple that a parser requires just O(n) time with a single left-to-right scan over the iput string of n tokens.

There are two types of Parsing

  1. Top-down Parsing (start from start symbol and derive string)
    A Top-down parser builds a parse tree by starting at the root and working down towards the leaves.
  2. Bottom-up Parsing (start from string and reduce to start symbol)
    A bottom-up parser builds a parser tree by starting at the leaves and working up towards the root.

Top-down Parsing
Consider the CFG with productions:

expr term rest
rest  → + term rest | - term rest
term 0 | 1 | . . . | 9

Step 0:
Initialization: Root must be starting symbol
Step 1:
expr term rest
Step 2:
term 9
Step 3
rest term rest
Step 4:
term 5
Step 5:
rest term rest
Step 6:
term 2
Step 7:
rest E

In the example above, the grammar made it easy for the top-down parser to pick the correct production in each step.
This is not true in general, see example of dangling 'else'.

Pridictive Parsing:
Recursive-descent parsing is a top-down method of syntax analysis that executes a set of recursive procedure to process the input. A procedure is associated with each nonterminal of a grammar.

A predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step.

Let the grammar be:

expr term rest
rest  + term rest | - term rest | 6
term 0 | 1 | . . . | 9

In a recursive-descent parsing, we write code for each nonterminal of a grammar. In the case of above grammar, we should have three procedure, correspond to nonterminals expr, rest, and term.

Since there is only one production for nonterminal expr, the procedure expr is:

expr ( )
{
    term ( );
    rest ( );
    return
}

Since there are three (3) productions for rest, procedure rest uses a global variable, 'lookahead', to select the correct production or simply selects "no action" i.e.,

E - production, indicating that lookahead variable is neither + nor -

rest ( )
{
    IF (lookahead = = '+') {
            match ( ' + ' );
            term ( );
            rest ( );
            return
    }
    ELSE IF ( lookahead = = '-') {
                match (' - ');
                term ( );
                rest ( );
                return
              {
    ELSE {
            return;
    }
}

The procedure term checks whether global variable lookahead is a digit.

term ( ) {
        IF ( isdigit (lookahead)) {
                match (lookahead);
                return;
        }
        else{
               ReportError ( );
        }
   

After loading first input token into variable 'lookahead' pridictive parser is stared by calling starting symbol, 'expr'.
If the input is error free, the parser conducts a depth-first traversal of the parse tree and return to caller routine through expr.

Problem with Predictive Parsing: left recursion

Left Recursion:
The production is left-recursive if the leftmost symbol on the right side is the same as the non terminal on the left side. For example,
            expr expr + term.

If one were to code this production in a recursive-descent parser, the parser would go in an infinite loop.

diagram

We can eliminate the left-recursion by introducing new nonterminals and new productions rules.

For example, the left-recursive grammar is:

E + T | T
T * F | F
(E) | id.

We can redefine E and T without left-recursion as:

TE`
E` + TE` | E
T FT`
T * FT` | E
F (E) | id

Getting rid of such immediate left recursion is not enough. One must get rid of indirect left recursion too, where two or more nonterminals are mutually left-recursive.