Operating Systems Notes

Lexical Analyzer

The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer.

It discards the white spaces and comments between the tokens and also keep track of line numbers.

<fig: 3.1 pp. 84>

Tokens, Patterns, Lexemes
Specification of Tokens
- Regular Expressions
- Notational Shorthand
Finite Automata
- Nondeterministic Finite Automata (NFA).
- Deterministic Finite Automata (DFA).
- Conversion of an NFA into a DFA.
- From a Regular Expression to an NFA.

Tokens, Patterns, Lexemes

Token

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages.

Example of tokens:

Type token (id, num, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)

Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

Patterns

There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token.

Regular expressions are an important notation for specifying patterns.

For example, the pattern for the Pascal identifier token, id, is: id → letter (letter | digit)*.

Lexeme

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.

For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters.

The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets.

Strings
A string over some alphabet is a finite sequence of symbol taken from that alphabet.

For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0).

If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x.
For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S.

String exponentiation concatenates a string with itself a given number of times:

S²= SS or S.S
S³ = SSS or S.S.S
S⁴ = SSSS or S.S.S.S and so on

By definition S⁰ is an empty string, , and S` = S. For example, if x =ba and na then xy² = banana.

Languages
A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings.

Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then

Union: LUM = {dog, ba, na, house}
Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba}
Expontentiation: L2 = LL
By definition: L⁰ ={} and L` = L

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L.

L* = L⁰ U L` U L² U L³ . . . U Lⁿ. . .

For example, If L = {a, b}, then

L* = {, a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . }

The positive closure of Language L, denoted by L⁺, is "one or more Concatenation of" L.

L⁺ = L` U L² U L³ . . . U Lⁿ . . .

For example, If L = {a, b}, then

L⁺ = {a, b, aa, ba, bb, aaa, aba, . . . }