Chapter 3

Lexical Analyzer


The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.

It discards the white spaces and comments between the tokens and also keep track of line numbers.

<fig: 3.1 pp. 84>


Tokens, Patterns, Lexemes


A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

Example of tokens:

Example of non-tokens:


There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.

Regular expressions are an important notation for specifying patterns.

For example, the pattern for the Pascal identifier token, id, is: id letter (letter | digit)*.


A lexeme is a sequence of characters in the source program that is matched by the pattern for  a token.

For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.

The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.

A string over some alphabet is a finite sequence of symbol taken from that alphabet.

For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII  computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).

If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S.

String exponentiation concatenates a string with itself a given number of times:

S2 = SS or S.S
S3 = SSS or S.S.S
S4 = SSSS or S.S.S.S and so on

By definition S0 is an empty string, , and S` = S. For example, if x =ba and na then xy2 = banana.

A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings.

Let L and M be two languages  where L = {dog, ba, na} and M = {house, ba} then

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.

L* = L0 U L` U L2 U L3 . . . U Ln . . .

For example, If  L = {a, b}, then

L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }

The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L.

L+  = L` U L2 U L3 . . . U Ln  . . .

For example, If L = {a, b}, then

L+  = {a, b, aa, ba, bb, aaa, aba, . . . }


Regular Expressions

  1. The regular expressions over alphabet specifies a language according to the following rules. is a regular expression that denotes {}, that is, the set containing the empty string.
  2. If a is a symbol in alphabet, then a is a regular expression that denotes {a}, that is, the set containing the string a.
  3. Suppose r and s are regular expression denoting the languages L(r) and L(s). Then
    1. (r)|(s) is a regular expression denoting L(r) U L(s).
    2. (r)(s) is a regular expression denoting L(r) L(s).
    3. (r)* is a regular expression denoting (L(r))*.
    4. (r) is a regular expression denoting L(r), that is, extra pairs of parentheses may be used around regular expressions.

Unnecessary parenthesis can be avoided in regular expressions using the following conventions:

Regular Definitions

A regular definition gives names to certain regular expressions and uses those names in other regular expressions.

Here is a regular definition for the set of Pascal identifiers that is define as the set of strings of letter and digits beginning with a letters.

letter A | B | . . . | Z | a | b | . . . | z
digit  0 | 1 | 2 | . . . | 9
   id  letter (letter | digit)*

The regular expression id is the pattern for the Pascal identifier token and defines letter and digit.
Where letter is a regular expression for the set of all upper-case and lower case letters in the alphabet and digit is the regular for the set of all decimal digits.

The pattern for the Pascal unsigned token can be specified as follows:

digit 0 | 1 | 2 | . . . | 9
digit digit digit*

    Optimal-fraction     → . digits | ε
    Optimal-exponent  (E (+ | - | ) digits) | ε
                         num  digits optimal-fraction optimal-exponent.

This regular definition says that


Notational Shorthand

The unary postfix operator + means "one of more instances of "

(r)+ = rr*

The unary postfix operator? means "zero or one instance of"

    r? = (r | ε)

Using these shorthand notation, Pascal unsigned number token can be written as:

                  digit   0 | 1 | 2 | . . . | 9
                   digits digit+
   optimal-fraction (. digits)?
optimal-exponent  (E (+ | -)?digits)?
                     num digits optimal-fraction optimal-exponent


3.6 Finite Automata

A recognizer for a language is a program that takes a string x as an input and answers "yes" if x is a sentence of the language and "no" otherwise.

One can compile any regular expression into a recognizer by constructing a generalized transition diagram called a finite automation.

A finite automation can be deterministic means that more than one transition out of a state may be possible on a same input symbol.

Both automata are capable of recognizing what regular expression can denote.


Nondeterministic Finite Automata (NFA)

A nondeterministic finite automation is a mathematical model consists of

  1. a set of states S;
  2. a set of input symbol, , called the input symbols alphabet.
  3. a transition function move that maps state-symbol pairs to sets of states.
  4. a state so called the initial or the start state.
  5. a set of states F called the accepting or final state.

An NFA can be described by a transition graph (labeled graph) where the nodes are states and the edges shows the transition function.

The labeled on each edge is either a symbol in the set of  alphabet, , or denoting empty string.

Following figure shows an NFA that recognizes the language: (a | b)* a bb.


FIGURE 3.19 - pp 114

This automation is nondeterministic because when it is in state-0 and the input symbol is a, it can either go to state-1 or stay in state-0.

The transition is

FIGURE 115 pp. 115

The advantage of transition table is that it provides fast access to the transitions of states and the disadvantage is that it can take up a lot of soace.

The following diagram shows the move made in accepting the input strings abb, aabb and ba bb.

abb :



In general, more than one sequence of moves can lead to an accepting state. If at least one such move ended up in a final state. For instance


The language defined by an NFA is the set of input strings that particular NFA accepts.

Following figure shows an NFA that recognize aa* | bb*.

Note that 's disappear in a cancatenation.

FIGURE 3.21 pp. 116

The transition table is:


Deterministic Finite Automata (DFA)

A deterministic finite automation is a special case of a non-deterministic finite automation (NFA) in which

  1. no state has an -transition
  2. for each state s and input symbol a, there is at most one edge labeled a leaving s.

A DFA has st most one transition from each state on any input. It means that each entry on any input. It means that each entry in the transition table is a single state (as oppose to set of states in NFA).

Because of single transition attached to each state, it is vary to determine whether a DFA accepts a given input string.

Algorithm for Simulating a DFA



The function move (S, C) gives a new state from state s on input character C.

The function 'nextchar' returns the next character in the string.

        S := S0
C := nextchar;

while not end-of-file do
        S := move (S, C)
        C := nextchar;

If S is in F then                    
        return "yes"

    return "No".

Following figure shows a DFA that recognizes the language (a|b)*abb.


The transition table is

state a b
0 1 0
1 1 2
2 1 3
3 1 0

With this DFA and the input string "ababb", above algorithm follows the sequence of states:



Conversion of an NFA into a DFA

It is hard for a computer program to simulate an NFA because the transition function is multivalued. Fortunately, an algorithm, called the subset construction will convert an NFA for any language into a DFA that recognizes the same languages. Note that this algorithm is closely related to an algorithm for constructing LR parser.

The general idea behind the NFA-to-DFA construction is that the each DFA state corresponds to a set of NFA states.

For example, let T be the set of all states that an NFA could reach after reading input: a1, a2, . . . , an - then the state that the DFA reaches after reading  a1, a2, . . . , an corresponds to set T.

Theoretically, the number of states of the DFA can be exponential in the number of states of the NFA, i.e., θ(2n), but in practice this worst case rarely occurs.

Algorithm: Subset construction.

OUTPUT: A DFA D accepting the same language.

METHOD: Construct a transition table DTrans. Each DFA state is a set of NFA states. DTran simulates in parallel all possible moves N can make on a given string.

Operations to keep track of sets of NFA states:

ε Closure (S)
Set of states reachable from state S via epsilon.
ε Closure (T)
Set of states reachable from any state in set T via epsilon.
move (T, a)
Set of states to which there is an NFA transition from states in T on a symbol a.


    initially, ε-Closure (S0) in DTrans.
    While unmarked state T in DTrans
    mark T
    for each input symbol 'a'
        do u = ε Closure (T, a)
            If u is not in DTrans
                then add u to DTrans
            DTrans [T, a] = U

Following algorithm shows a computation of ε -Closure function.

    Push all states in T onto stack.
    initialize ε-Closure (T) to T
    while stack is not empty
        do pop top element t
            for each state u with ε-edge
                        t to u
                do If u is not in ε-Closure(T)
                    do add u ε Closure (T)
                        push u onto stack

Following example illustrates the method by constructing a DFA for the NFA.

From a Regular Expression to an NFA

Thompson's construction is an NFA from a regular expression.

The Thompson's construction is guided by the syntax of the regular expression with cases following the cases in the definition of regular expression.

  1. ε is a regular expression that denotes {ε}, the set containing just the empty string.
    where i is a new start state and f is a new accepting state. This NFA recognizes {ε}.
  2. If a is a symbol in the alphabet, a , then regular expression 'a' denotes {a} and the set containing just 'a' symbol.
    This NFA recognizes {a}.
  3. Suppose, s and t are regular expressions denoting L{s} and L(t) respectively, then
    1. s/r is a regular expression denoting L(s) L(t)
    2. st is a regular expression denoting L(s) L(t)
    3. s* is a regular expression denoting L(s)*
    4. (s) is a regular expression denoting L(s) and can be used for putting parenthesis around regular expression

Example: Use above algorithm, Thompson's construction, to construct NFA for the regular expression r = (a|b)* abb.

First constant the parse tree for r = (a|b)* abb.






We have r5 = (a|b)*



We get r7 =(a|b)* a

Similarly for r8 and r10 - use case 2



And get r11 by case 3b


We have r = (a|b)*abb.