The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.
It discards the white spaces and comments between the tokens and also keep track of line numbers.
<fig: 3.1 pp. 84>
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.
Example of tokens:
Example of non-tokens:
There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.
Regular expressions are an important notation for specifying patterns.
For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.
An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.
The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.
Strings
A string over some alphabet is a finite sequence of symbol taken from that
alphabet.
For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).
If x and y are two strings, then the concatenation of x and y, written xy, is
the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string,
, we
have S
= S
= S.
String exponentiation concatenates a string with itself a given number of times:
S^{2 }= SS or S.S
S^{3} = SSS or S.S.S
S^{4} = SSSS or S.S.S.S and so on
By definition S^{0} is an empty string, , and S` = S. For example, if x =ba and na then xy^{2} = banana.
Languages
A language is a set of strings over some fixed alphabet. The language may
contain a finite or an infinite number of strings.
Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then
The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.
L* = L^{0} U L` U L^{2} U L^{3} . . . U L^{n }. . .
For example, If L = {a, b}, then
L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }
The positive closure of Language L, denoted by L^{+}, is "one or more Concatenation of" L.
L^{+ } = L` U L^{2} U L^{3} . . . U L^{n } . . .
For example, If L = {a, b}, then
L^{+ } = {a, b, aa, ba, bb, aaa, aba, . . . }
Unnecessary parenthesis can be avoided in regular expressions using the following conventions:
A regular definition gives names to certain regular expressions and uses those names in other regular expressions.
Here is a regular definition for the set of Pascal identifiers that is define as the set of strings of letter and digits beginning with a letters.
letter → A | B | . . . | Z | a | b | . . . | z
digit → 0 | 1 | 2 | . . . | 9
id → letter (letter | digit)*
The regular expression id is the pattern for the Pascal identifier token and
defines letter and digit.
Where letter is a regular expression for the set of all upper-case and
lower case letters in the alphabet and digit is the regular for the set
of all decimal digits.
The pattern for the Pascal unsigned token can be specified as follows:
digit → 0 | 1 | 2 | . . . | 9
digit → digit digit*
Optimal-fraction
→
. digits |
ε
Optimal-exponent → (E (+
| - |
)
digits) | ε
num
→
digits optimal-fraction optimal-exponent.
This regular definition says that
The unary postfix operator + means "one of more instances of "
(r)^{+} = rr^{*}
The unary postfix operator? means "zero or one instance of"
r? = (r | ε)
Using these shorthand notation, Pascal unsigned number token can be written as:
digit → 0 | 1 | 2 | . . . | 9
digits →digit^{+}
optimal-fraction → (. digits)?
optimal-exponent → (E (+ | -)?digits)?
num → digits optimal-fraction optimal-exponent
A recognizer for a language is a program that takes a string x as an input and answers "yes" if x is a sentence of the language and "no" otherwise.
One can compile any regular expression into a recognizer by constructing a generalized transition diagram called a finite automation.
A finite automation can be deterministic means that more than one transition out of a state may be possible on a same input symbol.
Both automata are capable of recognizing what regular expression can denote.
A nondeterministic finite automation is a mathematical model consists of
An NFA can be described by a transition graph (labeled graph) where the nodes are states and the edges shows the transition function.
The labeled on each edge is either a symbol in the set of alphabet, ∑, or denoting empty string.
Following figure shows an NFA that recognizes the language: (a | b)* a bb.
FIGURE 3.19 - pp 114
This automation is nondeterministic because when it is in state-0 and the input symbol is a, it can either go to state-1 or stay in state-0.
The transition is
FIGURE 115 pp. 115
The advantage of transition table is that it provides fast access to the transitions of states and the disadvantage is that it can take up a lot of soace.
The following diagram shows the move made in accepting the input strings abb, aabb and ba bb.
abb :
In general, more than one sequence of moves can lead to an accepting state. If at least one such move ended up in a final state. For instance
The language defined by an NFA is the set of input strings that particular NFA accepts.
Following figure shows an NFA that recognize aa* | bb*.
Note that 's disappear in a cancatenation.
FIGURE 3.21 pp. 116
The transition table is:
A deterministic finite automation is a special case of a non-deterministic finite automation (NFA) in which
A DFA has st most one transition from each state on any input. It means that each entry on any input. It means that each entry in the transition table is a single state (as oppose to set of states in NFA).
Because of single transition attached to each state, it is vary to determine whether a DFA accepts a given input string.
Algorithm for Simulating a DFA
INPUT:
OUTPUT:
The function move (S, C) gives a new state from state s on input character C.
The function 'nextchar' returns the next character in the string.
Initialization:
S := S_{0 }C := nextchar;while not end-of-file do
S := move (S, C)
C := nextchar;If S is in F then
return "yes"else
return "No".
Following figure shows a DFA that recognizes the language (a|b)*abb.
FIGURE
The transition table is
state | a | b |
0 | 1 | 0 |
1 | 1 | 2 |
2 | 1 | 3 |
3 | 1 | 0 |
With this DFA and the input string "ababb", above algorithm follows the sequence of states:
FIGURE
It is hard for a computer program to simulate an NFA because the transition function is multivalued. Fortunately, an algorithm, called the subset construction will convert an NFA for any language into a DFA that recognizes the same languages. Note that this algorithm is closely related to an algorithm for constructing LR parser.
The general idea behind the NFA-to-DFA construction is that the each DFA state corresponds to a set of NFA states.
For example, let T be the set of all states that an NFA could reach after reading input: a_{1}, a_{2}, . . . , a_{n} - then the state that the DFA reaches after reading a_{1}, a_{2}, . . . , a_{n }corresponds to set T.
Theoretically, the number of states of the DFA can be exponential in the number of states of the NFA, i.e., θ(2^{n}), but in practice this worst case rarely occurs.
Algorithm: Subset construction.
INPUT: An NFA N
OUTPUT: A DFA D accepting the same language.
METHOD: Construct a transition table DTrans. Each DFA state is a set of NFA states. DTran simulates in parallel all possible moves N can make on a given string.
Operations to keep track of sets of NFA states:
Algorithm:
initially,
ε-Closure (S_{0})
in DTrans.
While unmarked state T in DTrans
mark T
for each input symbol 'a'
do u =
ε Closure
(T, a)
If u is not in DTrans
then add u to DTrans
DTrans [T, a] = U
Following algorithm shows a computation of ε -Closure function.
Push all states in T onto stack.
initialize
ε-Closure
(T) to T
while stack is not empty
do pop top element t
for each state u with
ε-edge
t to u
do If u is not in
ε-Closure(T)
do add u
ε Closure
(T)
push u onto stack
Following example illustrates the method by constructing a DFA for the NFA.
Thompson's construction is an NFA from a regular expression.
The Thompson's construction is guided by the syntax of the regular expression with cases following the cases in the definition of regular expression.
Example: Use above algorithm, Thompson's construction, to construct NFA for the regular expression r = (a|b)* abb.
First constant the parse tree for r = (a|b)* abb.
figure
figure
figure
figure
figure
We have r_{5} = (a|b)*
figure
figure
We get r_{7 }=(a|b)* a
Similarly for r_{8} and r_{10} - use case 2
figure
figure
And get r_{11} by case 3b
figure
We have r = (a|b)*abb.