Compiler Design LectureNotes
Compiler Design LectureNotes
Compiler Design LectureNotes
(AUTONOMOUS)
Syllabus
UNIT – I
(15 Periods)
Introduction to compiling: Compilers, The Phases of a compiler.
Simple one-pass compiler: Overview, syntax definition, syntax direct translation, parsing, a translator for
simple expressions.
Lexical Analysis: The role of the lexical analyzer, input buffering, simplification of tokens, Recognition
of tokens, implementing transition diagrams, a language for specifying lexical analyzers.
Syntax analysis: Top down parsing - Recursive descent parsing, Predictive parsers.
UNIT – II
(15 Periods)
Syntax Analysis: Bottom up parsing - Shift Reduce parsing, LR Parsers – Construction of SLR,
Canonical LR and LALR parsing techniques, Parser generators – Yacc Tool.
Syntax – Directed Translation: Syntax Directed definition, construction of syntax trees, Bottom-up
evaluation of S – attributed definitions.
UNIT – III
` (14 Periods)
Runtime Environment: Source language issues, Storage organization, Storage-allocation strategies,
Access to nonlocal names, Parameter passing..
Symbol Tables: Symbol table entries, Data structures to symbol tables, representing scope information.
UNIT – VI
(16 Periods)
Intermediate code Generation: Intermediate languages, Declarations, Assignment statements, Boolean
expressions, Backpatching.
Code Generation- Issues in the design of code generartor, the target machines, Basic blocks and flow
graphs, Next use information, A simple code generator
TEXT BOOK:
1. Alfred V.Aho, RaviSethi, JD Ullman, “Compilers Principles, Techniques and Tools”, Pearson
Education, 2007.
REFERENCE BOOKS:
1. Alfred V.Aho, Jeffrey D. Ullman, “Principles of Compiler Design”, Narosa publishing.
2. “Lex & Yacc”, John R. Levine, Tony Mason, Doug Brown, O’reilly.
3. “Modern Compiler Implementation in C”, Andrew N. Appel, Cambridge University Press.
4. “Engineering a Compiler”, Cooper & Linda, Elsevier.
5. “Compiler Construction”, Louden, Thomson.
2
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
UNIT-I
1. Introduction to Compiling
We have learnt that any computer system is made of hardware and software. The hardware understands a
language, which humans cannot understand. So we write programs in high-level language, which is easier
for us to understand and remember. These programs are then fed into a series of tools and OS components
to get the desired code that can be used by the machine. This is known as Language Processing System.
Preprocessor:
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands
for longer constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-ofcontrol
and data structuring facilities.
4. Language Extensions: These preprocessor attempts to add capabilities to the language by certain
amounts to build-in macro.
Compiler:
Compiler is a translator program that translates a program written in (HLL) the source program and
translate it into an equivalent program in (MLL) the target program. As an important part of a compiler is
error showing to the programmer
3
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Executing a program written n HLL programming language is basically of two parts. The source program
must first be compiled translated into an object program. Then the results object program is loaded into a
memory executed.
Assembler:
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into machine
language. Such a mnemonic machine language is now called an assembly language. The input to an
assembler program is called source program, the output is a machine language translation (object
program).
Interpreter:
An interpreter is a program that appears to execute a source program as if it were machine language.
Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution proceeds.
Type of object that denotes various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
Loader and Link Editor:
Once the assembler procedures an object program, that program must be placed into memory and
executed. The assembler could place the object program directly in memory and transfer control to it,
hereby causing the machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the programmer would have to
retranslate his program with each execution, thus wasting translation time. To over come this problems of
wasted translation time and memory. System programmers developed another component called loader.
“A loader is a program that places programs into memory and prepares them for execution.” It would be
more efficient if subroutines could be translated into object form the loader could “relocate” directly
behind the user’s program. The task of adjusting programs o they may be placed in arbitrary core
locations is called relocation. Relocation loaders perform four functions.
4
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Translator:
A translator is a program that takes as input a program written in one language and produces as output a
program in another language. Beside program translation, the translator performs another very important
role, the error-detection. Any violation of d HLL specification would be detected and reported to the
programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the HLL.
List of Compilers:
1. Ada compilers
2 .ALGOL compilers
3 .BASIC compilers
4 .C# compilers
5 .C compilers
6 .C++ compilers
7 .COBOL compilers
8 .Common Lisp compilers
9. ECMAScript interpreters
10. Fortran compilers
11 .Java compilers
12. Pascal compilers
13. PL/I compilers
14. Python compilers
15. Smalltalk compilers
THE PHASES OF A COMPILER
A compiler can broadly be divided into two phases based on the way they compile.
Analysis Phase:
Known as the front-end of the compiler, the analysis phase of the compiler reads the source program,
divides it into core parts, and then checks for lexical, grammar, and syntax errors. The analysis phase
generates an intermediate representation of the source program and symbol table, which should be fed to
the Synthesis phase as input.
Synthesis Phase:
Known as the back-end of the compiler, the synthesis phase generates the target program with the help of
intermediate source code representation and symbol table.
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire program.
Phase : A phase of a compiler is a distinguishable stage, which takes input from the previous stage,
processes and yields output that can be used as input for the next stage. A pass can have more than one
phase.
5
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
The compilation process is a sequence of various phases. Each phase takes input from its previous stage,
has its own representation of source program, and feeds its output to the next phase of the compiler. Let
us understand the phases of a compiler.
Lexical Analysis:
LA or Scanner reads the source program one character at a time, separates the source program into a
sequence of atomic units called tokens. The usual tokens are keywords such as WHILE, FOR, DO or IF,
identifiers such as X or NUM, operator symbols such as <,<=,+,>,>= and punctuation symbols such as
parentheses or commas. The output of the lexical analyzer is a stream of tokens, which is passed to the
next phase.
Syntax Analysis:
The second phase is called Syntax analysis or parser. In this phase expressions, statements, declarations
etc… are identified by using the results of lexical analysis. It takes the token produced by lexical analysis
as input and generates a parse tree (or syntax tree). In this phase, token arrangements are checked against
the source code grammar, i.e., the parser checks if the expression made by the tokens is syntactically
correct.
6
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Semantic Analysis:
Semantic analysis checks whether the parse tree constructed follows the rules of language. For example,
assignment of values is between compatible data types, and adding string to an integer. Also, the semantic
analyzer keeps track of identifiers, their types and expressions; whether identifiers are declared before use
or not, etc. The semantic analyzer produces an annotated syntax tree as an output.
Intermediate Code Generations:
After semantic analysis, the compiler generates an intermediate code of the source code for the target
machine. It represents a program for some abstract machine. It is in between the high-level language and
the machine language. This intermediate code should be generated in such a way that it makes it easier to
be translated into the target machine code. This phase bridges the analysis and synthesis phases of
translation.
Code Optimization:
The next phase does code optimization of the intermediate code. Optimization can be assumed as
something that removes unnecessary code lines, and arranges the sequence of statements in order to speed
up the program execution without wasting resources (CPU, memory). This is optional phase described to
improve the intermediate code so that the output runs faster and takes less space.
Code Generation:
The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is the
machine language program of the specified computer.
Table Management (or) Book-keeping:
This is the portion to keep the names used by the program and records essential information about each.
The data structure used to record this information called a ‘Symbol Table’. It is a data-structure
maintained throughout all the phases of a compiler. All the identifiers’ names along with their types are
stored here. The symbol table makes it easier for the compiler to quickly search the identifier record and
retrieve it. The symbol table is also used for scope management.
Error Handlers:
It is invoked when a flaw error in the source program is detected. The output of LA is a stream of tokens,
which is passed to the next phase, the syntax analyzer or parser. The SA groups the tokens together into
syntactic structure called as expression. Expression may further be combined to form statements. The
syntactic structure can be regarded as a tree whose leaves are the token called as parse trees.
7
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Example:
8
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Lexical Analysis
To identify the tokens we need some method of describing the possible tokens that can appear in
the input stream. For this purpose we introduce regular expression, a notation that can be used to
describe essentially all the tokens of programming language.
Secondly, having decided what the tokens are, we need some mechanism to recognize these in the
input stream. This is done by the token recognizers, which are designed using transition diagrams
and finite automata.
ROLE OF LEXICAL ANALYZER
The LA is the first phase of a compiler. It main task is to read the input character and produce as output a
sequence of tokens that the parser uses for syntax analysis.
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input character
until it can identify the next token. The LA return to the parser representation for the token it has found.
The representation will be an integer code, if the token is a simple construct such as parenthesis, comma
or colon.
LA may also perform certain secondary tasks as the user interface. One such task is striping out from the
source program the commands and white spaces in the form of blank, tab and new line characters.
Another is correlating error message from the compiler with the source program.
INPUT BUFFERING
The LA scans the characters of the source program one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
Buffering techniques:
1. Buffer pairs
2. Sentinels
The lexical analyzer scans the characters of the source program one a t a time to discover tokens. Often,
however, many characters beyond the next token many have to be examined before the next token itself
can be determined. For this and other reasons, it is desirable for the lexical analyzer to read its input from
an input buffer. Figure shows a buffer divided into two haves(halfs) of, say 100 characters each. One
pointer marks the beginning of the token being discovered. A look ahead pointer scans ahead of the
beginning point, until the token is discovered. We view the position of each pointer as being between the
character last read and the character next to be read. In practice each buffering scheme adopts one
convention either a pointer is at the symbol last read or the symbol it is ready to read.
9
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Token beginnings look ahead pointer the distance which the look ahead pointer may have to travel past
the actual token may be large. For example, in a PL/I program we may see: DECALRE (ARG1, ARG2…
ARG n) without knowing whether DECLARE is a keyword or an array name until we see the character
that follows the right parenthesis. In either case, the token itself ends at the second E. If the look ahead
pointer travels beyond the buffer half in which it began, the other half must be loaded with the next
characters from the source file. Since the buffer shown in above figure is of limited size there is an
implied constraint on how much look ahead can be used before the next token is discovered. In the above
example, if the look ahead traveled to the left half and all the way through the left half to the middle, we
could not reload the right half, because we would lose characters that had not yet been grouped into
tokens. While we can make the buffer larger if we chose or use another buffering scheme, we cannot
ignore the fact that overhead is limited.
if forward at end of first half then begin
Reload second half;
forward := forward+1
end
else if forward at end of second half then begin
Reload first half;
Move forward to beginning of first half
end
else forward := forward+1;
Code to advance forward pointer
The above code requires two tests for each advance of the forward pointer. We can reduce the two to one
if we extend each buffer half to hold a sentinel character at the end. The sentinel is a special character
that cannot be part of the source program. A natural choice is eof;
: : : E: : =: : M: *: eof C : * : * : 2 : eof: : : : eof
Most of the time the code performs only one test to see whether forward points to an eof.
forward := forward+1;
if forward ↑= eof then begin
if forward at end of first half then begin
Reload second half;
forward := forward+1
end
else if forward at end of second half then begin
Reload first half;
Move forward to beginning of first half
end
else terminate lexical analysis
end
Lookahead code with sentinels
Token: Token is a sequence of characters that can be treated as a single logical entity.
Typical tokens are: 1) Identifiers 2) keywords 3) operators 4) special symbols 5) constants
Pattern: A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token.
10
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
LEXICAL ERRORS
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that there is no
way to recognize a lexeme as a valid token for you lexer? Syntax errors, on the other side, will be thrown
by your scanner when a given set of already recognized valid tokens don't match any of the right sides of
your grammar rules. Simple panic-mode error handling system requires that we return to a high-level
parsing function when a parsing or lexical error is detected.
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.
REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of regular expression..
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view the set of
strings in each token class as an language, we can use the regular-expression notation to describe tokens.
Consider an identifier, which is defined to be a letter followed by zero or more letters or digits.
In regular expression notation we would write.
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet.
• is a regular expression denoting { € }, that is, the language containing only the empty string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one string
consisting of the single symbol ‘a’ .
• If R and S are regular expressions, then
11
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to define regular
expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following regular
definition provides a precise specification for this class of string.
Example-1:
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
RECOGNITION OF TOKENS
We learn how to express pattern using regular expressions. Now, we must study how to take the patterns
for all the needed tokens and build a piece of code that examines the input string and finds a prefix that is
a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then stmt else stmt
|є
Expr →term relop term
| term
Term →id
|number
For relop, we use the comparison operations of languages like Pascal or SQL where = is “equals” and < >
is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens as far as
the lexical analyzer is concerned, the patterns for the tokens are described using regular definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the “token”
we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of the same
names. Token ws is different from the other tokens in that ,when we recognize it, we do not return it to
parser ,but rather restart the lexical analysis from the character that follows the white space . It is the
following token that gets returned to the parser.
Lexeme Token Name Attribute Value
WS -- -
if if -
then then -
else else -
id id Pointer to table entry
num num Pointer to table entry
12
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state represents a condition
that could occur during the process of scanning the input looking for a lexeme that matches one of several
patterns.
Edges are directed from one state of the transition diagram to another. each edge is labeled by a symbol or
set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s labeled by a. if
we find such an edge ,we advance the forward pointer and enter the state of the transition diagram to
which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a lexeme has been found,
although the actual lexeme may not consist of all positions b/w the lexeme Begin and forward pointers we
always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then we shall additionally place
a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled “start” entering from
nowhere .the transition diagram always begins in the state before any input symbols have been used.
As an intermediate step in the construction of a LA, we first produce a stylized flowchart, called a
transition diagram. Position in a transition diagram, are drawn as circles and are called as states.
13
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
The above TD for an identifier, defined to be a letter followed by any no of letters or digits. A sequence of
transition diagram can be converted into program to look for the tokens specified by the diagrams. Each
state gets a segment of code.
case 9: c=nextchar( );
if (isletter( c )) state=10;
else state= fail();
break;
case 10: c=nextchar();
if( isletter( c )) state=10;
else ifIisdidit( c )) state=10;
else state=11;
break;
case 11: retract(1);
install_id();
return (gettoken( ));
A LANGUAGE FOR SPECIFYING LEXICAL ANLYZERS (OR) LEXICAL ANALYZER
GENERATOR (OR) LEX TOOL
We describe a particular tool, called Lex, that has been widely used ti specify lexical analyzers for a
variety of languages. We refer to the tool as the Lex compiler, and to its specification as the Lex language.
First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c. The program lex.yy.c
consists of a tabular representation of a transition diagram constructed from the regular expressions of
lex.l, together with a standard routine that uses the table to recognize lexemes. Finally, lex.yy.c is run
through the C compiler to produce an object program a.out, which is the lexical analyzer that transforms
an input stream into a sequence of tokens.
Lex specifications:
14
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
1. The declarations section includes declarations of variables, manifest constants(A manifest constant is
an identifier that is declared to represent a constant e.g. # define PIE 3.14), and regular definitions.
3. The third section holds whatever auxiliary procedures are needed by the actions. Alternatively these
procedures can be compiled separately and loaded with the lexical analyzer.
15
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
SYNTAX ANALYSIS
ROLE OF THE PARSER
Parser for any grammar is program that takes as input string w (obtain set of strings tokens from the
lexical analyzer) and produces as output either a parse tree for w , if w is a valid sentences of grammar or
error message indicating that w is not a valid sentences of given grammar. The goal of the parser is to
determine the syntactic validity of a source string is valid; a tree is built for use by the subsequent phases
of the computer. The tree reflects the sequence of derivations or reduction used during the parser. Hence,
it is called parse tree. If string is invalid, the parse has to issue diagnostic message identifying the nature
and cause of the errors in string. Every elementary sub tree in the parse tree corresponds to a production
of the grammar.
There are two ways of identifying an elementary sub tree:
1. By deriving a string from a non-terminal or
2. By reducing a string of symbol to a non-terminal.
The two types of parsers employed are:
a. Top down parser: which build parse trees from top(root) to bottom(leaves)
b. Bottom up parser: which build parse trees from leaves and work up the root.
16
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
2 If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-
most derivation.
Example:
E→E+E|E–E|E*E|E/E|-E
E→(E)
E → id
Leftmost derivation :
E → E + E→ E * E+E →id* E+E→id*id+E→id*id+id
The string is derive from the grammar w= id*id+id, which is consists of all terminal symbols
Rightmost derivation:
E → E + E→ E+E * E→E+ E*id→E+id*id→id+id*id
E→E*E E→E+E*E
17
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
E → id + E * E E → id + id * E E → id + id * id
Ambiguity:
A grammar that produces more than one parse for some sentence is said to be ambiguous grammar.
Example : Given grammar G : E → E+E | E*E | ( E ) | - E | id
The sentence id+id*id has the following two distinct leftmost derivations:
E → E+ E E → E* E
E → id + E E→E+E*E
E → id + E * E E → id + E * E
E → id + id * E E → id + id * E
E → id + id * id E → id + id * id
The two corresponding parse trees are:
Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of operators as
follows:
^ (right to left)
/,* (left to right)
-,+ (left to right)
Eliminating Left Recursion:
A grammar is said to be left recursive if it has a non-terminal A such that there is a derivation A=>Aα for
some string α. Top-down parsing methods cannot handle left-recursive grammars.
Hence, left recursion can be eliminated as follows:
If there is a production A → Aα | β it can be replaced with a sequence of two productions
A → βA’ A’ → αA’ | ε
Without changing the set of strings derivable from A.
18
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
TOP-DOWN PARSING
It can be viewed as an attempt to find a left-most derivation for an input string or an attempt to construct a
parse tree for the input starting from the root to the leaves.
Types of top-down parsing:
1. Recursive descent parsing
2. Predictive parsing
19
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second symbol of
w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
20
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input pointer to third
symbol of w‘d’. But the third leaf of tree is b which does not match with the input symbol d.
Hence discard the chosen production and reset the pointer to second position. This is called backtracking.
Step4:
Now try the second alternative for A.
21
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Procedure TPRIME( )
begin
If input_symbol=’*’ then
ADVANCE( );
F( );
TPRIME( );
end
Procedure F( )
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
begin
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
else ERROR()
end
else ERROR( );
Stack implementation:
PROCEDURE INPUT STRING
E( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
EPRIME( ) id+id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
PREDICTIVE PARSING
Predictive parsing is a special case of recursive descent parsing where no backtracking is required.
The key problem of predictive parsing is to determine the production to be applied for a non-
terminal in case of alternatives.
Non-recursive predictive parser:
22
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
The table-driven predictive parser has an input buffer, stack, a parsing table and an output stream.
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack. Initially, the
stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.
Predictive parsing program:
The parser is controlled by a program that considers X, the symbol on top of stack, and a, the current
input symbol. These two symbols determine the parser action. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table M. This entry will either
be an X-production of the grammar or an error entry.
If M[X, a] = {X → UVW},the parser replaces X on top of the stack by UVW
If M[X, a] = error, the parser calls an error recovery routine.
Algorithm for nonrecursive predictive parsing:
Input : A string w and a parsing table M for grammar G.
Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$ in the input
buffer. The program that utilizes the predictive parsing table M to produce a parse for the input is as
follows:
set ip to point to the first symbol of w$;
repeat
let X be the top stack symbol and a the symbol pointed to by ip;
if X is a terminal or $ then
if X = a then
pop X from the stack and advance ip
else error()
else /* X is a non-terminal */
if M[X, a] = X →Y1Y2 … Yk then begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, with Y1 on top;
output the production X → Y1 Y2 . . . Yk
end
else error()
until X = $
23
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
24
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry. This type of
grammar is called LL(1) grammar.
Consider this following grammar:
S → iEtS | iEtSeS | a
E→b
After eliminating left factoring, we have
S → iEtSS’ | a
S’→ eS | ε
E→b
25
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, ε }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
LL Parser
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can be
implemented by means of both algorithms, namely, recursive-descent or table-driven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in
LL(k) stands for left-most derivation and k itself represents the number of look aheads. Generally k = 1,
so LL(k) may also be written as LL(1).
26
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
UNIT-II
SYNTAX ANALYSIS
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the root is called
bottom-up parsing. Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction
till it reaches the root node. Here, we start from a sentence and then apply production rules in reverse
manner in order to reach the start symbol. The image given below depicts the bottom-up parsers available.
SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree for an input
string beginning at the leaves (the bottom) and working up towards the root (the top).
Example:
Consider the grammar:
S → aABe
A → Abc | b
B→d
The sentence to be recognized is abbcde.
Reduction (Leftmost) Rightmost Derivation
abbcde S aABe
aAbcde (Ab) aAde
aAde (AAbc) aAbcde
aABe (Bd) abbcde
S
The reductions trace out the right-most derivation in reverse.
27
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Handles:
28
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
29
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
2. Reduce-Reduce Conflict:
Consider the grammar
M R+R | R+c | R
R c
30
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Precedence Relations
Bottom-up parsers for a large class of context-free grammars can be easily developed using operator
grammars. Operator grammars have the property that no production right side is empty or has two
adjacent non terminals. This property enables the implementation of efficient operator-precedence
parsers. This parser relies on the following three precedence relations:
Relation Meaning
a <· b a yields precedence to b
a =· b a has the same precedence as b
a ·> b a takes precedence over b
These operator precedence relations allow to delimit the handles in the right sentential forms: <· marks
the left end, =· appears in the interior of the handle, and ·> marks the right end.
31
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
Consider the above table Using the algorithm leads to the following graph:
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as LR(k)
parsers, where L stands for left-to-right scanning of the input stream; R stands for the construction of
right-most derivation in reverse, and k denotes the number of look ahead symbols to make decisions.
There are three widely used algorithms available for constructing an LR parser:
SLR(1) – Simple LR Parser:
Works on smallest class of grammar
Few number of states, hence very small table
Simple and fast construction
LR(1) – LR Parser:
Works on complete set of LR(1) Grammar
Generates large table and large number of states
Slow construction
32
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
33
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
34
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
35
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
36
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
37
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
38
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
CANONICAL LR PARSING
By splitting states when necessary, we can arrange to have each state of an LR parser indicate exactly
which input symbols can follow a handle a for which there is a possible reduction to A. As the text points
out, sometimes the FOLLOW sets give too much information and doesn't (can't) discriminate between
different reductions.
The general form of an LR(k) item becomes [A -> a.b, s] where A -> ab is a production and s is a string of
terminals. The first part (A -> a.b) is called the core and the second part is the look ahead. In LR(1) |s| is
1, so s is a single terminal.
A -> ab is the usual right hand side with a marker; any a in s is an incoming token in which we are
interested. Completed items used to be reduced for every incoming token in FOLLOW(A), but now we
will reduce only if the next input token is in the look ahead set s..if we get two productions A -> a and B -
> a, we can tell them apart when a is a handle on the stack if the corresponding completed items have
different look ahead parts. Furthermore, note that the look ahead has no effect for an item of the form [A -
> a.b, a] if b is not e. Recall that our problem occurs for completed items, so what we have done now is to
say that an item of the form [A -> a., a] calls for a reduction by A -> a only if the next input symbol is a.
More formally, an LR(1) item [A -> a.b, a] is valid for a viable prefix g if there is a derivation
S =>* s abw, where g = sa, and either a is the first symbol of w, or w is e and a is $.
39
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
I3: C->c.C,c/d
C->.Cc,c/d
C->.d,c/d
I4: C->d.,c/d
40
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
I5: S->CC.,$
I6: C->c.C,$
C->.cC,$
C->.d,$
I7: C->d.,$
I8: C->cC.,c/d
I9: C->cC.,$
Here is what the corresponding DFA looks like
1. Construct C = {I0, I1 , ..., In} the collection of sets of LR(1) items for G'.State i is constructed from Ii.
2. if [A -> a.ab, b>] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to "shift j". Here a must be a terminal.
3. if [A -> a., a] is in Ii, then set action[i, a] to "reduce A -> a" for all a in FOLLOW(A). Here A may not
be S'.
4. if [S' -> S.] is in Ii, then set action[i, $] to "accept"
5. If any conflicting actions are generated by these rules, the grammar is not LR(1) and the algorithm fails
to produce a parser.
6. The goto transitions for state i are constructed for all nonterminals A using the rule: If goto(Ii, A)= Ij,
then goto[i, A] = j.
7. All entries not defined by rules 2 and 3 are made "error".
8. The inital state of the parser is the one constructed from the set of items containing [S' -> .S, $].
41
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
LALR PARSER:
We begin with two observations. First, some of the states generated for LR(1) parsing have the same set
of core (or first) components and differ only in their second component, the look ahead symbol. Our
intuition is that we should be able to merge these states and reduce the number of states we have, getting
close to the number of states that would be generated for LR(0) parsing. This observation suggests a
hybrid approach: We can construct the canonical LR(1) sets of items and then look for sets of items
having the same core. We merge these sets with common cores into one set of items. The merging of
states with common cores can never produce a shift/reduce conflict that was not present in one of the
original states because shift actions depend only on the core, not the look ahead. But it is possible for the
merger to produce a reduce/reduce conflict.
Our second observation is that we are really only interested in the look ahead symbol in places where
there is a problem. So our next thought is to take the LR(0) set of items and add look ahead’s only where
they are needed. This leads to a more efficient, but much more complicated method.
6. Then goto(J, X) = K.
42
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
I36: C->c.C,c/d/$
C->.Cc,C/D/$
C->.d,c/d/$
I47: C->d.,c/d/$
I89: C->Cc.,c/d/$
Parsing Table:
ACTION GOTO
State
c d $ S C
0 S36 S47 1 2
1 acc
2 S36 S47 5
36 S36 S47 89
47 r3 r3 r3
5 r1
89 r2 r2 r2
SEMANTIC ANALYSIS
Semantic Analysis computes additional information related to the meaning of the program once the
syntactic structure is known.
In typed languages as C, semantic analysis involves adding information to the symbol table and
performing type checking.
The information to be computed is beyond the capabilities of standard parsing techniques, therefore it
is not regarded as syntax.
As for Lexical and Syntax analysis, also for Semantic Analysis we need both a Representation
Formalism and an Implementation Mechanism.
As representation formalism this lecture illustrates what are called Syntax Directed Translations.
The Principle of Syntax Directed Translation states that the meaning of an input sentence is related
to its syntactic structure, i.e., to its Parse-Tree.
By Syntax Directed Translations we indicate those formalisms for specifying translations for
programming language constructs guided by context-free grammars.
We associate Attributes to the grammar symbols representing the language
constructs.
Values for attributes are computed by Semantic Rules associated with grammar
productions.
Evaluation of Semantic Rules may:
Generate Code;
Insert information into the Symbol Table;
43
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
1. Syntax Directed Definitions. High-level specification hiding many implementation details (also called
Attribute Grammars).
2. Translation Schemes. More implementation oriented: Indicate the order in which semantic rules are to
be evaluated.
Syntax Directed Definitions
• Syntax Directed Definitions are a generalization of context-free grammars in which:
1. Grammar symbols have an associated set of Attributes;
2. Productions are associated with Semantic Rules for computing the values of attributes.
Such formalism generates Annotated Parse-Trees where each node of the tree is a record with a
field for each attribute (e.g.,X.a indicates the attribute a of the grammar symbol X).
The value of an attribute of a grammar symbol at a given parse-tree node is defined by a semantic
rule associated with the production used at that node.
We distinguish between two kinds of attributes:
1. Synthesized Attributes. They are computed from the values of the attributes of the children nodes.
2. Inherited Attributes. They are computed from the values of the attributes of both the siblings and the
parent nodes
Syntax Directed Definitions: An Example
• Example. Let us consider the Grammar for arithmetic expressions. The Syntax Directed Definition
associates to each non terminal a synthesized attribute called val.
S-ATTRIBUTED DEFINITIONS
Definition. An S-Attributed Definition is a Syntax Directed Definition that uses only synthesized
attributes.
• Evaluation Order. Semantic rules in a S-Attributed Definition can be evaluated by a bottom-up, or
PostOrder, traversal of the parse-tree.
• Example. The above arithmetic grammar is an example of an S-Attributed Definition. The annotated
parse-tree for the input 3*5+4n is:
44
Department of Information Technology, BEC, Bapatla-522102. GP
Compiler Design (14IT502)
45
Department of Information Technology, BEC, Bapatla-522102. GP