Compiler Construction Lecture Notes

Why study compilers?

You may never write a commercial compiler, but that's not why we study compilers. We study compiler construction for the following reasons:

Compilers - a brief overview

Purpose: translate a program in some language (the source language) into a lower-level language (the target language).

Phases:

Lexical Analysis:
Converts a sequence of characters into words, or tokens
Syntax Analysis:
Converts a sequence of tokens into a parse tree
Semantic Analysis:
Manipulates parse tree to verify symbol and type information
Intermediate Code Generation:
Converts parse tree into a sequence of intermediate code instructions
Optimization:
Manipulates intermediate code to produce a more efficient program
Final Code Generation:
Translates intermediate code into final (machine/assembly) code

Example of the Compilation Process

Consider the example from Figure 1.10 on p. 13 of the book in detail.
position = initial + rate * 60
30 or so characters, from a single line of source code, are first transformed by lexical analysis into a sequence of 7 tokens. Those tokens are then used to build a tree of height 4 during syntax analysis. Semantic analysis may transform the tree into one of height 5, that includes a type conversion necessary for real addition on an integer operand. Intermediate code generation uses a simple traversal algorithm to linearize the tree back into a sequence of machine-independent three-address-code instructions.
t1 = inttoreal(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Optimization of the intermediate code allows the four instructions to be reduced to two machine-independent instructions. Final code generation might implement these two instructions using 5 machine instructions, in which the actual registers and addressing modes of the CPU are utilized.
MOVF	id3, R2
MULF	#60.0, R2
MOVF	id2, R1
ADDF	R2, R1
MOVF	R1, id1

Overview of Lexical Analysis

Lexical Analyzer a.k.a. scanner The term "token" usually refers to an object (or struct) containing complete information about a single lexical entity, but it is often also used to refer the category ("class" if you prefer) of that entity. The term "lexeme" denotes the actual string of characters that comprise a particular occurrence ("instance" if you like) of a token.

Regular Expressions

The notation we use to precisely capture all the variations that a given category of token may take are called "regular expressions" (or, less formally, "patterns". The word "pattern" is really vague and there are lots of other notations for patterns besides regular expressions). Regular expressions are a shorthand notation for sets of strings. In order to even talk about "strings" you have to first define an alphabet, the set of characters which can appear.
  1. Epsilon is a regular expression denoting the set containing the empty string
  2. Any letter in the alphabet is also a regular expression denoting the set containing a one-letter string consisting of that letter.
  3. For regular expressions r and s,
             r | s
    is a regular expression denoting the union of r and s
  4. For regular expressions r and s,
             r s
    is a regular expression denoting the set of strings consisting of a member of r followed by a member of s
  5. For regular expression r,
             r*
    is a regular expression denoting the set of strings consisting of zero or more occurrences of r.
  6. You can parenthesize a regular expression to specify operator precedence (otherwise, alternation is like plus, concatenation is like times, and closure is like exponentiation)
Although these operators are sufficient to describe all regular languages, in practice everybody uses extensions:

Finite Automata

A finite automaton is an abstract, mathematical machine, also known as a finite state machine, with the following components:
  1. A set of states S
  2. A set of input symbols E (the alphabet)
  3. A transition function move(state, symbol) : new state(s)
  4. A start state S0
  5. A set of final states F
For a deterministic finite automaton (DFA), the function move(state, symbol) goes to at most one state, and symbol is never epsilon.

Finite automata correspond in a 1:1 relationship to transition diagrams; from any transition diagram one can write down the formal automaton in terms of items #1-#5 above, and vice versa. To draw the transition diagram for a finite automaton:

DFA Implementation

The nice part about DFA's is that they are efficiently implemented on computers. What DFA does the following code correspond to? What is the corresponding regular expression? You can speed this code fragment up even further if you are willing to use goto's or write it in assembler.
state := S0
for(;;)
   switch (state) {
   case 0: 
      switch (input) {
         'a': state = 1; input = getchar(); break;
         'b': input = getchar(); break;
	 default: printf("dfa error\n"); exit(1);
         }
   case 1: 
      switch (input) {
         EOF: printf("accept\n"); exit(0);
	 default: printf("dfa error\n"); exit(1);
         }
      }

Nondeterministic Finite Automata (NFA's)

Notation convenience motivates more flexible machines in which function move() can go to more than one state on a given input symbol, and some states can move to other states even without consuming an input symbol.

Fortunately, one can prove that for any NFA, there is an equivalent DFA. They are just a notational convenience. So, finite automata help us get from a set of regular expressions to a computer program that recognizes them efficiently.

Regular expressions can be converted automatically to NFA's

Each rule in the definition of regular expressions has a corresponding NFA; NFA's are composed using epsilon transitions. This is cited in the text as "Thompson's construction" (Algorithm 3.3). We will work examples such as (a|b)*abb in class and during lab.
  1. For epsilon, draw two states with a single epsilon transition.
  2. For any letter in the alphabet, draw two states with a single transition labeled with that letter.
  3. For regular expressions r and s, draw r | s by adding a new start state with epsilon transitions to the start states of r and s, and a new final state with epsilon transitions from each final state in r and s.
  4. For regular expressions r and s, draw rs by adding epsilon transitions from the final states of r to the start state of s.
  5. For regular expression r, draw r* by adding new start and final states, and epsilon transitions (a) from the start state to the final state, (b) from the final state back to the start state, (c) from the new start to the old start and from the old final states to the new final state.
  6. For parenthesed regular expression (r) you can use the NFA for r.

NFA's can be converted automatically to DFA's

In: NFA N
OUt: DFA D
Method: Construct transition table Dtran (a.k.a. the "move function"). Each DFA state is a set of NFA states. Dtran simulates in parallel all possible moves N can make on a given string.

Operations to keep track of sets of NFA states:

e_closure(s)
set of states reachable from state s via epsilon
e_closure(T)
set of states reachable from any state in set T via epsilon
move(T,a)
set of states to which there is an NFA transition from states in T on symbol a

Algorithm:

Dstates := {e_closure(start_state)}
while T := unmarked_member(Dstates) do {
	mark(T)
	for each input symbol a do {
		U := e_closure(move(T,a))
		if not member(Dstates, U) then
			insert(Dstates, U)
		Dtran[T,a] := U
	}
}

lex(1) and flex(1)

These programs generally take a lexical specification given in a .l file and create a corresponding C language lexical analyzer in a file named lex.yy.c. The lexical analyzer is then linked with the rest of your compiler.

The C code generated by lex has the following public interface. Note the use of global variables instead of parameters, and the use of the prefix yy to distinguish scanner names from your program names. This prefix is also used in the YACC parser generator.

FILE *yyin;	/* set this variable prior to calling yylex() */
int yylex();	/* call this function once for each token */
char yytext[];	/* yylex() writes the token's lexeme to this array */

The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is used to signify lex elements. The whole file is divided into three sections separated by %%:

   header
%%
   body
%%
   helper functions

The header consists of C code fragments enclosed in %{ and %} as well as macro definitions consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly by enclosing the macro name in curly braces. Following are some example lex macros.

letter		[a-zA-Z]
digit		[0-9]
ident		{letter}({letter}|{digit})*

The body consists of of a sequence of regular expressions for different token categories and other lexical entities. Each regular expression can have a C code fragment enclosed in curly braces that executes when that regular expression is matched. For most of the regular expressions this code fragment (also called a semantic action consists of returning an integer that identifies the token category to the rest of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions and semantic actions might include:

" "		{ /* no-op, discard whitespace */ }
{ident}		{ return IDENTIFIER; }
"*"		{ return ASTERISK; }
You also need regular expressions for lexical errors such as unterminated character constants, or illegal characters.

The helper functions in a lex file typically compute lexical attributes, such as the actual integer or string values denoted by literals.

Lex extended regular expressions

Lex further extends the regular expressions with several helpful operators. Lex's regular expressions include:
c
normal characters mean themselves
\c
backslash escapes remove the meaning from most operator characters. Inside character sets and quotes, backslash performs C-style escapes.
"s"
Double quotes mean to match the C string given as itself. This is particularly useful for multi-byte operators and may be more readable than using backslash multiple times.
[s]
This character set operator matches any one character among those in s.
[^s]
A negated-set matches any one character not among those in s.
.
The dot operator matches any one character except newline: [^\n]
r*
match r 0 or more times.
r+
match r 1 or more times.
r?
match r 0 or 1 time.
r{m,n}
match r between m and n times.
r1r2
concatenation. match r1 followed by r2
r1|r2
alternation. match r1 or r2
(r)
parentheses specify precedence but do not match anything
r1/r2
lookahead. match r1 when r2 follows, without consuming r2
^r
match r only when it occurs at the beginning of a line
r$
match r only when it occurs at the end of a line

Lexical Attributes and Token Objects

Besides the token's category, the rest of the compiler may several pieces of information about a token in order to perform semantic analysis, code generation, and error handling. These are stored in an object instance of class Token, or in C, a struct. The fields are generally something like:
struct token {
   int category;
   char *text;
   int linenumber;
   int column;
   char *filename;
   union literal value;
}
The union literal will hold computed values of integers, real numbers, and strings.

Lexical Analysis and the Symbol Table

In many compilers, the symbol table and memory management components of the compiler interact with several phases of compilation, starting with lexical analysis.

A hash table or other efficient data structure can avoid this duplication. The software engineering design pattern to use is called the "flyweight". Example: Figure 3.18, p. 109. Use "install_id()" instead of "strdup()" to avoid duplication in the lexical data.

Syntax Analysis

Parsing is the act of performing syntax analysis to verify an input program's compliance with the source language. A by-product of this process is typically a tree that represents the structure of the program.

Context Free Grammars

A context free grammar G has: