Lexical Analyzer

The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.

It discards the white spaces and comments between the tokens and also keep track of line numbers.

<fig: 3.1 pp. 84>


Tokens, Patterns, Lexemes


A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

Example of tokens:

Example of non-tokens:


There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.

Regular expressions are an important notation for specifying patterns.

For example, the pattern for the Pascal identifier token, id, is: id letter (letter | digit)*.


A lexeme is a sequence of characters in the source program that is matched by the pattern for  a token.

For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.

The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.

A string over some alphabet is a finite sequence of symbol taken from that alphabet.

For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII  computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).

If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S.

String exponentiation concatenates a string with itself a given number of times:

S2 = SS or S.S
S3 = SSS or S.S.S
S4 = SSSS or S.S.S.S and so on

By definition S0 is an empty string, , and S` = S. For example, if x =ba and na then xy2 = banana.

A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings.

Let L and M be two languages  where L = {dog, ba, na} and M = {house, ba} then

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.

L* = L0 U L` U L2 U L3 . . . U Ln . . .

For example, If  L = {a, b}, then

L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }

The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L.

L+  = L` U L2 U L3 . . . U Ln  . . .

For example, If L = {a, b}, then

L+  = {a, b, aa, ba, bb, aaa, aba, . . . }