Compiler Notes

Chapter 3 Lexical Analyzer

The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.

It discards the white spaces and comments between the tokens and also keep track of line numbers.

<fig: 3.1 pp. 84>

Tokens, Patterns, Lexemes
Specification of Tokens
- Regular Expressions
- Notational Shorthand
Finite Automata
- Nondeterministic Finite Automata (NFA).
- Deterministic Finite Automata (DFA).
- Conversion of an NFA into a DFA.
- From a Regular Expression to an NFA.

Tokens, Patterns, Lexemes

Token

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

Example of tokens:

Type token (id, num, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)

Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

Patterns

There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.

Regular expressions are an important notation for specifying patterns.

For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.

Lexeme

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.

The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.

Strings
A string over some alphabet is a finite sequence of symbol taken from that alphabet.

For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).

If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S.

String exponentiation concatenates a string with itself a given number of times:

S²= SS or S.S
S³ = SSS or S.S.S
S⁴ = SSSS or S.S.S.S and so on

By definition S⁰ is an empty string, , and S` = S. For example, if x =ba and na then xy² = banana.

Languages
A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings.

Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then

Union: LUM = {dog, ba, na, house}
Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba}
Expontentiation: L2 = LL
By definition: L⁰ ={} and L` = L

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.

L* = L⁰ U L` U L² U L³ . . . U Lⁿ. . .

For example, If L = {a, b}, then

L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }

The positive closure of Language L, denoted by L⁺, is "one or more Concatenation of" L.

L⁺ = L` U L² U L³ . . . U Lⁿ . . .

For example, If L = {a, b}, then

L⁺ = {a, b, aa, ba, bb, aaa, aba, . . . }

Regular Expressions

The regular expressions over alphabet specifies a language according to the following rules. is a regular expression that denotes {}, that is, the set containing the empty string.
If a is a symbol in alphabet, then a is a regular expression that denotes {a}, that is, the set containing the string a.
Suppose r and s are regular expression denoting the languages L(r) and L(s). Then
1. (r)|(s) is a regular expression denoting L(r) U L(s).
2. (r)(s) is a regular expression denoting L(r) L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r), that is, extra pairs of parentheses may be used around regular expressions.

Unnecessary parenthesis can be avoided in regular expressions using the following conventions:

The unary operator * (kleene closure) has the highest precedence and is left associative.
Concatenation has a second highest precedence and is left associative.
Union has lowest precedence and is left associative.

Regular Definitions

A regular definition gives names to certain regular expressions and uses those names in other regular expressions.

Here is a regular definition for the set of Pascal identifiers that is define as the set of strings of letter and digits beginning with a letters.

letter → A | B | . . . | Z | a | b | . . . | z
digit → 0 | 1 | 2 | . . . | 9
id → letter (letter | digit)*

The regular expression id is the pattern for the Pascal identifier token and defines letter and digit.
Where letter is a regular expression for the set of all upper-case and lower case letters in the alphabet and digit is the regular for the set of all decimal digits.

The pattern for the Pascal unsigned token can be specified as follows:

digit → 0 | 1 | 2 | . . . | 9
digit → digit digit*

    Optimal-fraction    → . digits | ε
    Optimal-exponent → (E (+ | - | ) digits) | ε
                         num → digits optimal-fraction optimal-exponent.

This regular definition says that

An optimal-fraction is either a decimal point followed by one or more digits or it is missing (i.e., an empty string).
An optimal-exponent is either an empty string or it is the letter E followed by an ' optimal + or - sign, followed by one or more digits.

Notational Shorthand

The unary postfix operator + means "one of more instances of "

(r)⁺ = rr^*

The unary postfix operator? means "zero or one instance of"

r? = (r | ε)

Using these shorthand notation, Pascal unsigned number token can be written as:

                  digit   → 0 | 1 | 2 | . . . | 9
                   digits →digit⁺
   optimal-fraction → (. digits)?
optimal-exponent → (E (+ | -)?digits)?
                     num → digits optimal-fraction optimal-exponent

3.6 Finite Automata

A recognizer for a language is a program that takes a string x as an input and answers "yes" if x is a sentence of the language and "no" otherwise.

One can compile any regular expression into a recognizer by constructing a generalized transition diagram called a finite automation.

A finite automation can be deterministic means that more than one transition out of a state may be possible on a same input symbol.

Both automata are capable of recognizing what regular expression can denote.

Nondeterministic Finite Automata (NFA)

A nondeterministic finite automation is a mathematical model consists of

a set of states S;
a set of input symbol, ∑, called the input symbols alphabet.
a transition function move that maps state-symbol pairs to sets of states.
a state so called the initial or the start state.
a set of states F called the accepting or final state.

An NFA can be described by a transition graph (labeled graph) where the nodes are states and the edges shows the transition function.

The labeled on each edge is either a symbol in the set of alphabet, ∑, or denoting empty string.

Following figure shows an NFA that recognizes the language: (a | b)* a bb.

FIGURE 3.19 - pp 114

This automation is nondeterministic because when it is in state-0 and the input symbol is a, it can either go to state-1 or stay in state-0.

The transition is

FIGURE 115 pp. 115

The advantage of transition table is that it provides fast access to the transitions of states and the disadvantage is that it can take up a lot of soace.

The following diagram shows the move made in accepting the input strings abb, aabb and ba bb.

abb :

In general, more than one sequence of moves can lead to an accepting state. If at least one such move ended up in a final state. For instance

The language defined by an NFA is the set of input strings that particular NFA accepts.

Following figure shows an NFA that recognize aa* | bb*.

Note that 's disappear in a cancatenation.

FIGURE 3.21 pp. 116

The transition table is:

Deterministic Finite Automata (DFA)

A deterministic finite automation is a special case of a non-deterministic finite automation (NFA) in which

no state has an -transition
for each state s and input symbol a, there is at most one edge labeled a leaving s.

A DFA has st most one transition from each state on any input. It means that each entry on any input. It means that each entry in the transition table is a single state (as oppose to set of states in NFA).

Because of single transition attached to each state, it is vary to determine whether a DFA accepts a given input string.

Algorithm for Simulating a DFA

INPUT:

string x
a DFA with start state, so . . .
a set of accepting state's F.

OUTPUT:

The answer 'yes' if D accepts x; 'no' otherwise.

The function move (S, C) gives a new state from state s on input character C.

The function 'nextchar' returns the next character in the string.

Initialization:
        S := S₀C := nextchar;

while not end-of-file do
        S := move (S, C)
        C := nextchar;

If S is in F then
        return "yes"

else
    return "No".

Following figure shows a DFA that recognizes the language (a|b)*abb.

FIGURE

The transition table is

state	a	b
0	1	0
1	1	2
2	1	3
3	1	0

With this DFA and the input string "ababb", above algorithm follows the sequence of states:

FIGURE

Conversion of an NFA into a DFA

It is hard for a computer program to simulate an NFA because the transition function is multivalued. Fortunately, an algorithm, called the subset construction will convert an NFA for any language into a DFA that recognizes the same languages. Note that this algorithm is closely related to an algorithm for constructing LR parser.

In the transition table of an NFA, entry is a set of states;
In the transition table of a DFA, each entry is a single state;

The general idea behind the NFA-to-DFA construction is that the each DFA state corresponds to a set of NFA states.

For example, let T be the set of all states that an NFA could reach after reading input: a₁, a₂, . . . , a_n - then the state that the DFA reaches after reading a₁, a₂, . . . , a_ncorresponds to set T.

Theoretically, the number of states of the DFA can be exponential in the number of states of the NFA, i.e., θ(2ⁿ), but in practice this worst case rarely occurs.

Algorithm: Subset construction.

INPUT: An NFA N
OUTPUT: A DFA D accepting the same language.

METHOD: Construct a transition table DTrans. Each DFA state is a set of NFA states. DTran simulates in parallel all possible moves N can make on a given string.

Operations to keep track of sets of NFA states:

ε Closure (S): Set of states reachable from state S via epsilon.
ε Closure (T): Set of states reachable from any state in set T via epsilon.
move (T, a): Set of states to which there is an NFA transition from states in T on a symbol a.

Algorithm:

    initially, ε-Closure (S₀) in DTrans.
    While unmarked state T in DTrans
    mark T
    for each input symbol 'a'
        do u = ε Closure (T, a)
            If u is not in DTrans
                then add u to DTrans
            DTrans [T, a] = U

Following algorithm shows a computation of ε -Closure function.

    Push all states in T onto stack.
    initialize ε-Closure (T) to T
    while stack is not empty
        do pop top element t
            for each state u with ε-edge
                        t to u
                do If u is not in ε-Closure(T)
                    do add u ε Closure (T)
                        push u onto stack

Following example illustrates the method by constructing a DFA for the NFA.

From a Regular Expression to an NFA

Thompson's construction is an NFA from a regular expression.

The Thompson's construction is guided by the syntax of the regular expression with cases following the cases in the definition of regular expression.

ε is a regular expression that denotes {ε}, the set containing just the empty string.
diagram
where i is a new start state and f is a new accepting state. This NFA recognizes {ε}.
If a is a symbol in the alphabet, a ∑, then regular expression 'a' denotes {a} and the set containing just 'a' symbol.
diagram
This NFA recognizes {a}.
Suppose, s and t are regular expressions denoting L{s} and L(t) respectively, then
1. s/r is a regular expression denoting L(s) L(t)
2. st is a regular expression denoting L(s) L(t)
  diagram
3. s* is a regular expression denoting L(s)*
  diagram
4. (s) is a regular expression denoting L(s) and can be used for putting parenthesis around regular expression

Example: Use above algorithm, Thompson's construction, to construct NFA for the regular expression r = (a|b)* abb.

First constant the parse tree for r = (a|b)* abb.

figure

For r₁ - use case 2.

figure

For r₂ - use case 2.

figure

For r₃- use case 3a

figure

For r₅ - use case 3c

figure

We have r₅ = (a|b)*

For r₆ - use case 2

figure

and for r₇ - use case 3b

figure

We get r₇=(a|b)* a

Similarly for r₈ and r₁₀ - use case 2

figure

And get r₁₁ by case 3b

figure

We have r = (a|b)*abb.