The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.

It discards the white spaces and comments between the tokens and also keep track of line numbers.

<fig: 3.1 pp. 84>

- Tokens, Patterns, Lexemes
- Specification of Tokens
- Regular Expressions
- Notational Shorthand

- Finite Automata
- Nondeterministic Finite Automata (NFA).
- Deterministic Finite Automata (DFA).
- Conversion of an NFA into a DFA.
- From a Regular Expression to an NFA.

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

**Example of tokens:**

- Type token (id, num, real, . . . )
- Punctuation tokens (IF, void, return, . . . )
- Alphabetic tokens (keywords)

**Example of non-tokens:**

- Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.

Regular expressions are an important notation for specifying patterns.

For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.

The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.

**Strings**

A string over some alphabet is a finite sequence of symbol taken from that
alphabet.

For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).

If x and y are two strings, then the concatenation of x and y, written xy, is
the string formed by appending y to x.

For example, If x = dog and y = house, then xy = doghouse. For empty string,
, we
have S
= S
= S.

String exponentiation concatenates a string with itself a given number of times:

S

^{2 }= SS or S.S

S^{3}= SSS or S.S.S

S^{4}= SSSS or S.S.S.S and so on

By definition S^{0} is an empty string,
,
and S` = S. For example, if x =ba and na then xy^{2} = banana.

**Languages**

A language is a set of strings over some fixed alphabet. The language may
contain a finite or an infinite number of strings.

Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then

- Union: LUM = {dog, ba, na, house}
- Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba}
- Expontentiation: L2 = LL
- By definition: L
^{0}={} and L` = L

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.

L* = L

^{0}U L` U L^{2}U L^{3}. . . U L^{n }. . .

For example, If L = {a, b}, then

L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }

The positive closure of Language L, denoted by L^{+}, is "one or more
Concatenation of" L.

L

^{+ }= L` U L^{2}U L^{3}. . . U L^{n }. . .

For example, If L = {a, b}, then

L

^{+ }= {a, b, aa, ba, bb, aaa, aba, . . . }