Lab Manual - CL1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 89

Computer Laboratory-I

Laboratory Manual

Computer Engineering
Laboratory Assignments
Group A (Mandatory Six Assignments)
1. Using Divide and Conquer Strategies design a function for Binary Search using C.
2. Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using C++.
3. Lexical analyzer for sample language using LEX.
4. Parser for sample language using YACC.
5. Int code generation for sample language using LEX and YACC.
6. Implement a simple approach for k-means/ k-medoids clustering using C++.
Group B (Any Six Assignments: atleast 3 from the selected Elective)
1. 8-Queens Matrix is Stored using JSON/XML having first Queen placed, use back-tracking to place
remaining Queens to generate final 8-queen's Matrix using Python.
2. Implementation of 0-1 knapsack problem using branch and bound approach.
3. Code optimization using DAG.
4. Code generation using DAG / labeled tree.
5. Generating abstract syntax tree using LEX and YACC.
6. Implementing recursive descent parser for sample language.
7. Implement Apriori approach for datamining to organize the data items on a shelf using following table
of items purchased in a Mall
Transaction ID Item1 Item2 Item3 Item4 Item5 Item6
T1 Mnago Onion Jar Key-chain Eggs Chocolates
T2 Nuts Onion Jar Key-chain Eggs Chocolates
T3 Mnago Apple Key-chain Eggs - -
T4 Mnago Toothbrush Corn Key-chain Chocolates
T5 Corn Onion Onion Key-chain Knife Eggs
8. Implement Decision trees on Digital Library Data to mirror more titles(PDF) in the library application,
compare it with Naive Bayes algorithm.
9. Implement Naive Bayes for Concurrent/Distributed application. Approach should handle categorical
and continuous data.
10. Implementation of K-NN approach take suitable example.
Group C (Any One Assignment)
1.Code generation using \iburg" tool.
GROUP A
ASSIGNMENT NO:1

TITLE: Using Divide and Conquer Strategies design a function for Binary Search using C++.
OBJECTIVES:
To learn Divide and Conquer Strategies by using a function for Binary Search using C++.
PROBLEM STATEMENT: Write a program using Divide and Conquer Strategies design a
function for Binary Search using C++.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-19,
Windows 8 with Multicore CPU equivalent to Intel i5/7
MATHEMATICAL MODEL:

Let, S be the System Such that,


A= {I, O, F, success, failure}
Where,
I= Set of Input
O= Set of Out put
F =Set of Function
INPUT:
I=Array of the numbers.
Function:
F1=Sorting Function (This function is used for sorting list)
F2= Splitting Function (This function is used for splitting sorted list)
F3= Binary Search (This function apply binary search on sorted list)
OUTPUT:
O1=Success Case (It is the case when all the inputs are given by system are entered correctly)
O2=Failure Case (It is the case when the input does not match the validation Criteria)
Computational Model
THEORY:
Divide-and-conquer is a top-down technique for designing algorithms that consists of dividing
the problem into smaller subproblems hoping that the solutions of the subproblems are easier to
find and then composing the partial solutions into the solution of the original problem.

Little more formally, divide-and-conquer paradigm consists of following major phases:

 Breaking the problem into several sub-problems that are similar to the original problem
but smaller in size,
 Solve the sub-problem recursively (successively and independently), and then
 Combine these solutions to subproblems to create a solution to the original problem.
Binary Search (simplest application of divide-and-conquer) :-
Binary Search is an extremely well-known instance of divide-and-conquer paradigm. Given an
ordered array of n elements, the basic idea of binary search is that for a given element we
"probe" the middle element of the array. We continue in either the lower or upper segment of the
array, depending on the outcome of the probe until we reached the required (given) element.
Problem Let A[1 . . . n] be an array of non-decreasing sorted order; that is A [i] ≤ A [j]
whenever 1 ≤ i ≤ j ≤ n. Let 'q' be the query point. The problem consist of finding 'q' in the
array A. If q is not in A, then find the position where 'q' might be inserted.
Formally, find the index i such that 1 ≤ i ≤ n+1 and A[i-1] < x ≤ A[i].
Sequential Search
Look sequentially at each element of A until either we reach at the end of an array A or find an
item no smaller than 'q'.Sequential search for 'q' in array A

Analysis:-
This algorithm clearly takes a θ(r), where r is the index returned. This is Ω(n) in the worst case
and O(1) in the best case.
If the elements of an array A are distinct and query point q is indeed in the array then loop
executed (n + 1) / 2 average number of times. On average (as well as the worst case), sequential
search takes θ(n) time.
Binary Search:-
Look for 'q' either in the first half or in the second half of the array A. Compare 'q' to an element
in the middle, n/2 , of the array. Let k = n/2 . If q ≤ A[k], then search in the A[1 . . . k];
otherwise search T[k+1 . . n] for 'q'. Binary search for q in subarray A[i . . j] with the promise
that

Analysis:-
Binary Search can be accomplished in logarithmic time in the worst case , i.e., T(n) = θ(log n).
This version of the binary search takes logarithmic time in the best case.
Iterative Version of Binary Search:-Interactive binary search for q, in array A[1 . . n]

CONCLUSION

Thus we have studied binary search using divide and conquer strategy.
ASSIGNMENT NO:2

TITLE: Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using C+
+
OBJECTIVES: Compare The Performance Of Concurrence Quick Sort Algorithm By Using
Divide & Conquer Strategies.

PROBLEM STATEMENT: Implement of Concurrent Quick Sort.


SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-19,
or Windows 8, C++
MATHEMATICAL MODEL:

S= { I, ,F,O, Success, Failure}


I is the set of input element.
O is the set of required output element.
F is the set of functions required for Quick Sort.
INPUT:
I1= Array of numbers
Function:
F= Quick sort Function(choose pivot, up and down and perform a quick sort)
OUTPUT:
O ={Sorted list of elements} Success.
Success – Sorted list of elements.
Computational Model

THEORY: The quick sort algorithm is amenable to a concurrent implementation. Each step in
the recursion – each time the partitioning routing is called – operates on an independent sub
array. In Quick sort one can sort an array by partitioning the elements around a pivot element
and then doing the same to each of the partitions. The partitioning step consists of picking an
pivot element in the array, p, and moving the elements in the array around so that all the
elements less than p have indices less than p and vice versa. Partitioning will usually move
several array elements, including p, around. the partitioning has put p into the place it belongs in
the array – no element will ever move from the set that’s smaller than p to the set that’s larger;
once a partitioning step is complete, its pivot element is in its sorted position.

Procedure: Consider One Coordinator & Different Workers (Coordinator Will Distribute Work
among Different Workers.). The coordinator sends a message to an idle worker telling it to sort
the array and waits to receive messages from the workers about the progress of the algorithm. A
worker partitions a sub-array, and every time that worker gets ready to call the partition routine
on a smaller array, it checks to see if there is an idle worker to assign the work to. If so, it sends
a message to the worker to start working on the sub-problem; if not the current worker makes
calls the partition routine itself. After each partitioning, two recursive calls are (usually) made,
so there are plenty of chances to start other workers. The diagram below shows two workers
sorting the same 5-element array. Each blue line represents the flow of control of a worker
thread, and the red arrow represents the message sent from one worker to start the other. (Since
the workers proceed working concurrently, it is no longer guaranteed that the smaller elements in
the array will be ordered before the larger; what is certain is that the two workers will never try
to manipulate the same elements.)

Two workers sort an array


A worker can complete working either because it has directly completed all the work sorting the
sub array it was initially called on, or because it has ordered a subset of that array but has passed
some or all of the remaining work to other workers. In either case, it reports the number of
elements it has ordered back to the coordinator. (The number of elements a worker has ordered
is the number of partitions of sub-arrays that have 1 or more members).When the coordinator
hears that all the elements in the array have been ordered, it tells the workers that there is nothing
left to do, and the workers exit.

CONCLUSION:

Thus we have studied Concurrent Quick Sort using divide and conquer strategy.
ASSIGNMENT NO:3

TITLE: Lexical analyzer for sample language using LEX


OBJECTIVES:
Understand the importance and usage of LEX automated tool.
PROBLEM STATEMENT:
Implement a lexical analyzer for a sample language using LEX Implementation
SOFTWARE REQUIRED: Linux Operating Systems, GCC
INPUT: Input data as Sample language.
OUTPUT: It will generate tokens for sample language.
MATHEMATICAL MODEL:

Let S be the solution perspective of the class Weather Report such that
S={s, e, i, o, f, success, failure}
s=Start of program
e = the end of program
i=Sample language statement.
o=Result of statement
Success-token is generated.
Failure-token is not generated or forced exit due to system error.
Computational Model
e1 e2

Where,
S={Start state}
A={genrate token()}
R={Final Result}
THEORY:
Lex is a program generator designed for lexical processing of character input streams. It accepts
a high-level, problem oriented specification for character string matching, and produces a
program in a general purpose langua ge which recognizes regular expressions. The regular
expressions are specified by the user in the so urce specifications given to Lex.

The Lex written code recognizes these expressions in an input stream and partitions the input
stream into strings matching the expressions. At the boundaries between strings program sections
provided by the user are executed. The Lex source file associates the regular expressions and the
program fragments. As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed

1) LEX Specifications :- Structure of the LEX Program is as follows

-----------------------------
Declaration Part
-----------------------------
%%
-----------------------------
Translation Rule
-----------------------------
%%
-----------------------------
Auxiliary Procedures
-----------------------------
2) Declaration part :- Contains the declaration for variables required for LEX program and C
program.

Translation rules:- Contains the rules like

Reg. Expression { action1 }


Reg. Expression { action2 }
Reg. Expression { action3 }
-------------------------------------------
Reg. Expression { action-n }
3) Auxiliary Procedures :-

Contains all the procedures used in your C – Code.


 Built-in Functions i.e. yylex() , yyerror() , yywrap() etc.
1) yylex() :- This function is used for calling lexical analyzer for given translation rules.

2) yyerror() :- This function is used for displaying any error message.

3) yywrap() :- This function is used for taking i/p from more than one file.

 Built-in Variables i.e. yylval, yytext, yyin, yyout etc.


1) yylval :- This is a global variable used to store the value of any token.

2) yytext :-This is global variable which stores current token.

3) yyin :- This is input file pointer used to change value of input file pointer. Default file
pointer is pointing to stdin i.e. keyboard.

4) yyout :- This is output file pointer used to change value of output file pointer. Default
output file pointer is pointing to stdout i.e. Monitor.

 How to execute LEX program :- For executing LEX program follow the following steps
 Compile *.l file with lex command
# lex *.l It will generate lex.yy.c file for your lexical analyzer.
 Compile lex.yy.c file with cc command
# cc lex.yy.c It will generate object file with name a.out.
 Execute the *.out file to see the output
# ./a.out
CONCLUSION

Thus we have studied Lexical Analyzer for sample language.


ASSIGNMENT NO:4

TITLE: Parser for sample language using YACC.


OBJECTIVES:
1. To understand Second phase of compiler: Syntax Analysis.
2. To learn and use compiler writing tools.
3. Understand the importance and usage of YACC automated tool.
PROBLEM STATEMENT:
Write an ambiguous CFG to recognize an infix expression and implement a parser that
recognizes the infix expression using YACC.
SOFTWARE REQUIRED: Linux Operating Systems, GCC
INPUT: Input data as Sample language.
OUTPUT: It will generate parser for sample language.
MATHEMATICAL MODEL:
Let S be the solution perspective of the class Weather Report such that
S={s, e, i, o, f, DD, NDD, success, failure}

s=Start of program

e = the end of program

i=Arithmetic expression.

o=Result of arithmetic expression

Success-parser is generated.

Failure-parser is not generated or forced exit due to system error.

Computational Model

S e1 A e2 B e3
R

Where,
S={Start state}
A={Genrate token()}
B={Parse_token()}
R={Final Result}
THEORY:
1) YACC Specifications: - Parser generator facilitates the construction of the front end of a
compiler. YACC is LALR parser generator. It is used to implement hundreds of compilers.
YACC is command (utility) of the UNIX system. YACC stands for “Yet Another Compiler
Complier”. File in which parser generated is with .y extension. e.g. parser.y, which is
containing YACC specification of the translator. After complete specification UNIX
command. YACC transforms parser.y into a C program called y.tab.c using LR parser. The
program y.tab.c is automatically generated. We can use command with –d option as yacc –d
parser.y By using –d option two files will get generated namely y.tab.c and y.tab.h. The header
file y.tab.h will store all the token information and so you need not have to create y.tab.h
explicitly. The program y.tab.c is a representation of an LALR parser written in C, along with
other C routines that the user may have prepared. By compiling y.tab.c with the ly library that
contains the LR parsing program using the command. cc ytabc – ly . we obtain the desired
object program a.out that perform the translation specified by the original program. If
procedure is needed, they can be compiled or loaded with y.tab.c, just as with any C program.
LEX recognizes regular expressions, whereas YACC recognizes entire grammar. LEX divides
the input stream into tokens, while YACC uses these tokens and groups them together
logically.LEX and YACC work together to analyze the program syntactically. The YACC can
report conflicts or ambiguities (if at all) in the form of error messages.

Structure of the YACC Program is as follows

-----------------------------
Declaration Section
-----------------------------
%%
-----------------------------
Translation Rule Section
-----------------------------
%%
-----------------------------
Auxiliary Procedures Section
-----------------------------
Declaration Section :-
The definition section can include a literal block, C code copied verbatim to the beginning of the
generated C file, usually containing declaration and #include lines. There may be %union,
%start, %token, %type, %left, %right, and %nonassoc declarations. (See "%union Declaration,"
"Start Declaration," "Tokens," "%type Declarations," and "Precedence and Operator
Declarations.") It can also contain comments in the usual C format, surrounded by "/*" and "*/".
All of these are optional, so in a very simple parser the definition section may be completely
empty.
Translation rule Section :-

Contains the rules / grammars


Production { action1 }
Production { action2 }
Production { action3 }
---------------------------------------
Production { action-n }

Auxiliary Procedure Section :-


Contains all the procedures used in your C – Code.

2) Built-in Functions i.e. yyparse() , yyerror() , yywrap() etc.


1) yyparse() :-
This is a standard parse routine used for calling syntax analyzer for given translation
rules.
2) yyerror() :-
This is a standard error routine used for displaying any error message.
3) yywrap() :-
This function is used for taking i/p from more than one file.
3) Built-in Types i.e. %token , %start , %prec , %nonassoc etc.
1) %token
Used to declare the tokens used in the grammar. The tokens that are declared in the
declaration section will be identified by the parser.
Eg. :- %token NAME NUMBER
2) %start :-
Used to declare the start symbol of the grammar.
Eg.:- %start STMT
3) %left
Used to assign the associatively to operators.
Eg.:- %left ‘+’ ‘-‘ - Assign left associatively to + & – with lowest precedence.
%left ‘*’ ‘/‘ - Assign left associatively to * & / with highest precedence.
4) %right :-
Used to assign the associatively to operators.
Eg.:- 1) %right ‘+’ ‘-‘
- Assign right associatively to + & – with lowest precedence
2) %right ‘*’ ‘/‘
- Assign right left associatively to * & / with highest precedence.
5) %nonassoc :-
Used to un associate.
Eg.:- %nonassoc UMINUS
6) %prec :-
Used to tell parser use the precedence of given code.
Eg. :- %prec UMINUS
7) %type :-
Used to define the type of a token or a non-terminal of the production written in the
rules section of the .Y file.
Eg.:- %type <name of any variable> exp
Let us see both LEX and YACC specification for writing a program for calculator.
LEX Specification : (lex.l file)
1) Declaration Section :

%{
#include "y.tab.h"
#include<math.h>
extern int yylval;
%}
Here, we include the header file that is generated while executing the .y file. We also include
math.h as we will be using a function atoi (that type casts string to integer).
Lastly, when a lexical analyzer passes a token to the parser, it can also pass a value for the token.
In order to pass the value that our parser can use (for the passed token), the lexical analyser has
to store it in the variable yylval.
Before storing the value in yylval we have to specify its data type. In our program we want to
perform mathematical operations on the input, hence we declare the variable yylval as integer.
1) Rules Section :
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
[ \t] ; /* ignore white space */
\n return 0; /* logical EOF */
. return yytext[0];
In rules section, we match the pattern for numbers and pass the token NUMBER to the
parser. As we know the matched string is stored in yytext which is a string, hence we
type cast the string value to integer. We ignore spaces, and for all other input characters
we just pass them to the parser.
YACC Specification : (yacc.y file)
1) Declaration Section:
%token NUMBER
%left '+' '-'
%left '/' '*'
%right '^'
%nonassoc UMINUS
In declaration section we declare all the variables that we will be using through out the
program, also we include all the necessary files.Apart from that, we also declare tokens that are
recognized by the parser. As we are writing a parser specification for calculator, we have only
one token that is NUMBER.To deal with ambiguous grammar, we have to specify the
associativity and precedence of the operators. As seen above +,-,* and / are left associative
whereas the Unary minus and power symbol are non-associative.The precedence of the
operators increase as we come down the declaration. Hence the lowest precedence is of + and –
and the highest is of Unary minus.
2) Rules Section:The rules section consists of all the production that perform the operations. One
example is given as follows :
expression: expression '+' expression { $$ = $1 + $3; }
| NUMBER { $$ = $1; }
;
When a number token is returned by the lexical analyser, it converts it into expression and its
value is assigned to expression (non-terminal).When addition happens the value of both the
expression are added and are assigned to the expression that has resulted from the reduction.
3) Auxillary Function Section:In main we just call function yyparse.
We also have to define the function yyerror(). This function is called when there is a syntactic
error.
CONCLUSION:
Thus we have studied Parser for sample language using YACC.
ASSIGNMENT NO:5

TITLE: Int. code generation for sample language using LEX and YACC.

OBJECTIVES:
1. To understand fourth phase of compiler: Intermediate code generation.
2. To learn and use compiler writing tools.
3. To learn how to write three address code for given statement.
PROBLEM STATEMENT:
Write an attributed translation grammar to recognize declarations of simple variables, "for",
assignment, if, if-else statements as per syntax of C and generate equivalent three address code
for the given input made up of constructs mentioned above using LEX and YACC .
SOFTWARE REQUIRED: Linux Operating Systems, GCC
INPUT: Input data as Sample language.
OUTPUT: It will generate Intermediate language for sample language.

MATHEMATICAL MODEL:
Let S be the solution perspective of the class Weather Report such that
S={s, e, i, o, f, success, failure}
s=initial state of grammar
e = the end state of grammar.
i=Sample language Statement.
o=Intermediate code for language statement
Success-Intermediate code is is generated.
Failure-Intermediate code is not generated or forced exit due to system error.

Mathematical model for above system.

THEORY:
In the analysis-synthesis model of a compiler, the front end analyzes a source program and
creates an intermediate representation, from which the back end generates target code. This
facilitates retargeting: enables attaching a back end for the new machine to an existing front end.

Close to source language close to machine language

A compiler front end is organized, where parsing, static checking, and intermediate-code
generation are done sequentially; sometimes they can be combined and folded into parsing. All
schemes can be implemented by creating a syntax tree and then walking the tree.
Static Checking
This includes type checking which ensures that operators are applied to compatible operands. It
also includes any syntactic checks that remain after parsing like
• flow–of-control checks
– Ex: Break statement within a loop construct
• Uniqueness checks
– Labels in case statements
• Name-related checks
Intermediate Representations
We could translate the source program directly into the target language. However, there are
benefits to having an intermediate, machine-independent representation.
• A clear distinction between the machine-independent and machine-dependent parts of the
compiler
• Retargeting is facilitated; the implementation of language processors for new machines will
require replacing only the back-end
• We could apply machine independent code optimisation techniques
Intermediate representations span the gap between the source and target languages.
• High Level Representations
– closer to the source language
– easy to generate from an input program
– code optimizations may not be straightforward
• Low Level Representations
– closer to the target machine
– Suitable for register allocation and instruction selection
Intermediate Languages:Three ways of intermediate code representation:
1. Syntax tree
2. Postfix notation
3. Three address code
The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.
Graphical Representations
Syntax tree
A syntax tree depicts the natural hierarchical structure of a source program. A dag (Directed
Acyclic Graph) gives the same information but in a more compact way because common
subexpressions are identified. A syntax tree and dag for the assignment statement
a:=b*-c+b*-c
are as follows:
assign(=)

a +

* *

b uminus b uminus

c c
Postfix notation
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of the tree
in which a node appears immediately after its children. The postfix notation for the syntax tree
given above is a b c uminus * b c uminus * + assign
Three-Address Code
Three-address code is a sequence of statements of the general form x : = y op z where x, y and z
are names, constants, or compiler-generated temporaries; op stands for any operator, such as a
fixed- or floating-point arithmetic operator, or a logical operator on Boolean valued data. Thus a
source language expression like x+ y*z might be translated into a sequence
t1 : = y * z
t2 : = x + t1
where t1 and t2 are compiler-generated temporary names. The reason for the term “three-
address code” is that each statement usually contains three addresses, two for the operands and
one for the result.
Implementation of Three-Address Statements: A three-address statement is an abstract form
of intermediate code. In a compiler, these statements can be implemented as records with fields
for the operator and the operands. Three such representations are: Quadruples, Triples, Indirect
triples.
A. Quadruples
A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result. The op
field contains an internal code for the operator. The 3 address statement x = y op z is represented
by placing y in arg1, z in arg2 and x in result. The contents of fields arg1, arg2 and result are
normally pointers to the symbol-table entries for the names represented by these fields. If so,
temporary names must be entered into the symbol table as they are created. Fig a) shows
quadruples for the assignment a : b * c + b * c
B. Triples:
To avoid entering temporary names into the symbol table, we might refer to a temporary value
by the position of the statement that computes it. If we do so, three-address statements can be
represented by records with only three fields: op, arg1 and arg2. The fields arg1 and arg2, for
the arguments of op, are either pointers to the symbol table or pointers into the triple structure
( for temporary values ). Since three fields are used, this intermediate code format is known as
triples. Fig b) shows the triples for the assignment statement a: = b * c + b * c.

op arg1 arg2 result


(0) uminus c t1
(1) * b t1 t2
(2) uminus c t3
(3) * b t2 t4
(4) + t2 t4 t5
(5) := t5 a
Fig. a. Quadraples

op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) := a (4)
Fig.b. Triples
Quadruples & Triple representation of three-address statement
C. Indirect triples:
Indirect triple representation is the listing pointers to triples rather-than listing the triples
themselves. Let us use an array statement to list pointers to triples in the desired order. Fig c)
shows the indirect triple representation.
Statement op
(1) (15)
(2) (16)
(3) (17)
(4) (18)
(5) (19)

op arg1 arg2
(14) uminus c
(15) * b (14)
(16) uminus c
(17) * b (16)
(18) + (15) (17)
(19) := a (18)
Fig c): Indirect triples representation of three address statements
Steps to execute the program
$ lex filename.l (eg: comp.l)
$ yacc -d filename.y (eg: comp.y)
$cc lex.yy.c y.tab.c –ll –ly –lm
$./a .out
ALGORITHM:
Write a LEX and YACC program to generate Intermediate Code for arithmetic expression
LEX program
1. Declaration of header files specially y.tab.h which contains declaration for Letter, Digit, expr.
2. End declaration section by %%
3. Match regular expression.
4. If match found then convert it into char and store it in yylval.p where p is pointer declared in
YACC
5. Return token
6. If input contains new line character (\n) then return 0
8. If input contains „.‟ then return yytext[0]
9. End rule-action section by %%
10. Declare main function
a. open file given at command line
b.if any error occurs then print error and exit
c. assign file pointer fp to yyin
d.call function yylex until file ends
11. End
YACC program
1. Declaration of header files
2. Declare structure for threeaddresscode representation having fields of argument1, argument2,
operator, result.
3. Declare pointer of char type in union.
4. Declare token expr of type pointer p.
5. Give precedence to „*‟,‟/‟.
6. Give precedence to „+‟,‟-‟.
7. End of declaration section by %%.
8. If final expression evaluates then add it to the table of three address code.
9. If input type is expression of the form.
a. exp‟+‟exp then add to table the argument1, argument2, operator.
b.exp‟-‟exp then add to table the argument1, argument2, operator.
c. exp‟*‟exp then add to table the argument1, argument2, operator.
d.exp‟/‟exp then add to table the argument1, argument2, operator.
e. „(„exp‟)‟ then assign $2 to $$.
f. Digit OR Letter then assigns $1 to $$.
10. End the section by %%.
11. Declare file *yyin externally.
12. Declare main function and call yyparse function untill yyin ends
13. Declare yyerror for if any error occurs.
14. Declare char pointer s to print error.
15. Print error message.
16. End of the program.
In short:
Addtotable function:It will add the argument1, argument2, operator and temporary variable to
the structure array of threeaddresscode.
Threeaddresscode function:It will print the values from the structure in the form first
temporary variable, argument1, operator, argument2
Quadruple Functio:It will print the values from the structure in the form first operator,
argument1, argument2, result field
Triple Function:It will print the values from the structure in the form first argument1,
argument2, and operator. The temporary variables in this form are integer / index instead of
variables.
CONCLUSION:
Hence, we have successfully studied concept of Intermediate code generation of sample
language

ASSIGNMENT NO:6(D)
TITLE: Implement a simple approach for k-means / k-medoids clustering using C++.

OBJECTIVES:
1. Students should be able to implement k-means program.
2. Student should be able to identify k-means of each cluster .using c++ programming.
PROBLEM STATEMENT:
Implement a simple approach for k-means / k-medoids clustering using C++.

SOFTWARE REQUIREMENT: Latest version of 64 Bit Operating Systems Open Source


Fedora-20

INPUT: Network traffic in a LAN


OUTPUT: Analysis and classification of network traffic in a LAN.

MATHEMATICAL MODEL:
Let S be the solution perspective of the class
S={s, e, i, o, f, DD, NDD, success, failure}
s=initial state that
e = the end state
i= input of the system here is value of k and data
o=output of the system. Here is k number of clusters
DD-deterministic data it helps identifying the load store functions or assignment functions.
NDD- Non deterministic data of the system S to be solved.
Success- desired outcome generated.
Failure-Desired outcome not generated or forced exit due to system error.

THEORY:

Introduction
The popular k-means algorithm for clustering has been around since the late 1950s, and the
standard algorithm was proposed by Stuart Lloyd in 1957. Given a set of points, k-means
clustering aims to partition each point into a cluster (where and, the number of clusters, is a
parameter). The partitioning is done to minimize the objective function where is the centroid of
cluster . The standard algorithm is a two-step algorithm:
 Assignment step. Each point in is assigned to the cluster whose centroid it is
closest to.
 Update step. Using the new cluster assignments, the centroids of each cluster are
recalculated.
The algorithm has converged when no more assignment changes are happening with each
iteration. However, this algorithm can get stuck in local minima of the objective function and is
particularly sensitive to the initial cluster assignments. Also, situations can arise where the
algorithm will never converge but reaches steady state -- for instance, one point may be changing
between two cluster assignments.
There is vast literature on the k-means algorithm and its uses, as well as strategies for choosing
initial points effectively and keeping the algorithm from converging in local minima. mlpack
does implement some of these, notably the Bradley-Fayyad algorithm (see the reference below)
for choosing refined initial points. Importantly, the C++ KMeans class makes it very easy to
improve the k-means algorithm in a modular way.
@inproceedings{bradley1998refining,
title={Refining initial points for k-means clustering},
author={Bradley, Paul S. and Fayyad, Usama M.},
booktitle={Proceedings of the Fifteenth International Conference on Machine
Learning (ICML 1998)},
volume={66},
year={1998}
}
mlpack provides:
 a simple command-line executable to run k-means
 a simple C++ interface to run k-means
Command-Line 'kmeans'
mlpack provides a command-line executable, kmeans, to allow easy execution of the k-means
algorithm on data. Complete documentation of the executable can be found by typing
$ kmeans --help

Below are several examples demonstrating simple use of the kmeans executable.

Simple k-means clustering


We want to find 5 clusters using the points in the file dataset.csv. By default, if any of the
clusters end up empty, that cluster will be reinitialized to contain the point furthest from the
cluster with maximum variance. The cluster assignments of each point will be stored in
assignments.csv. Each row in assignments.csv will correspond to the row in dataset.csv.
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv
Allowing empty clusters
If you would like to allow empty clusters to exist, instead of reinitializing them, simply specify
the -e (--allow_empty_clusters) option. Note that when you save your clusters, some of the
clusters may be filled with NaNs. This is expected behavior -- if a cluster has no points, the
concept of a centroid makes no sense.
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -C centroids.csv
Limiting the maximum number of iterations
As mentioned earlier, the k-means algorithm can often fail to converge. In such a situation, it
may be useful to stop the algorithm by way of limiting the maximum number of iterations. This
can be done with the -m (--max_iterations) parameter, which is set to 1000 by default. If the
maximum number of iterations is 0, the algorithm will run until convergence -- or potentially
forever. The example below sets a maximum of 250 iterations.
$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -m 250
Setting the overclustering factor
The mlpack k-means implementation allows "overclustering", which is when the k-means
algorithm is run with more than the requested number of clusters. Upon convergence, the clusters
with the nearest centroids are merged until only the requested number of centroids remain. This
can provide better clustering results. The overclustering factor, specified with -O or --
overclustering, determines how many more clusters are found than were requested. For instance,
with k set to 5 and an overclustering factor of 2, 10 clusters will be found. Note that the
overclustering factor does not need to be an integer.
The following code snippet finds 5 clusters, but with an overclustering factor of 2.4 (so 12
clusters are found and then merged together to produce 5 final clusters).
$ kmeans -c 5 -O 2.4 -i dataset.csv -v -o assignments.csv
The 'KMeans' class
The KMeans<> class (with default template parameters) provides a simple way to run k-means
clustering using mlpack in C++. The default template parameters for KMeans<> will initialize
cluster assignments randomly and disallow empty clusters. When an empty cluster is
encountered, the point furthest from the cluster with maximum variance is set to the centroid of
the empty cluster.
Running k-means and getting cluster assignments
The simplest way to use the KMeans<> class is to pass in a dataset and a number of clusters, and
receive the cluster assignments in return. Note that the dataset must be column-major -- that is,
one column corresponds to one point. See the matrices guide for more information.
#include <mlpack/methods/kmeans/kmeans.hpp>
using namespace mlpack::kmeans;
// The dataset we are clustering.
extern arma::mat data;
// The number of clusters we are getting.
extern size_t clusters;
// The assignments will be stored in this vector.
arma::Col<size_t> assignments;
// Initialize with the default arguments.
KMeans<> k;
k.Cluster(data, clusters, assignments);
Now the vector assignments holds the cluster assignments of each point in the dataset.
Running k-means and getting centroids of clusters
Often it is useful to not only have the cluster assignments, but the centroids of each cluster.
Another overload of Cluster() makes this easily possible:
#include <mlpack/methods/kmeans/kmeans.hpp>
using namespace mlpack::kmeans; // The dataset we are clustering.
extern arma::mat data; // The number of clusters we are getting.
extern size_t clusters; // The assignments will be stored in this vector.
arma::Col<size_t> assignments; // The centroids will be stored in this matrix.
arma::mat centroids; // Initialize with the default arguments.
KMeans<> k;
k.Cluster(data, clusters, assignments, centroids);
Note that the centroids matrix has columns equal to the number of clusters and rows equal to the
dimensionality of the dataset. Each column represents the centroid of the according cluster --
centroids.col(0) represents the centroid of the first cluster.
Limiting the maximum number of iterations
The first argument to the constructor allows specification of the maximum number of iterations.
This is useful because often, the k-means algorithm does not converge, and is terminated after a
number of iterations. Setting this parameter to 0 indicates that the algorithm will run until
convergence -- note that in some cases, convergence may never happen. The default maximum
number of iterations is 1000.
// The first argument is the maximum number of iterations. Here we set it to
// 500 iterations.
KMeans<> k(500);
Then you can run Cluster() as normal.
Setting initial cluster centroids
An equally important option to being able to make initial cluster assignment guesses is to make
initial cluster centroid guesses without having to assign each point in the dataset to an initial
cluster. This is similar to the previous section, but now you must pass two extra booleans -- the
first (initialAssignmentGuess) as false, indicating that there are not initial cluster assignment
guesses, and the second (initialCentroidGuess) as true, indicating that the centroids matrix is
filled with initial centroid guesses.
This, of course, only works with the overload of Cluster() that takes a matrix to put the resulting
centroids in. Below is an example.
#include <mlpack/methods/kmeans/kmeans.hpp>
using namespace mlpack::kmeans;
// The dataset we are clustering on.
extern arma::mat dataset;
// The number of clusters we are obtaining.
extern size_t clusters;
// A matrix pre-filled with guesses for the initial cluster centroids.
extern arma::mat centroids;
// This will be filled with the final cluster assignments for each point.
arma::Col<size_t> assignments;
KMeans<> k;
// Remember, the first boolean indicates that we are not giving initial
// assignment guesses, and the second boolean indicates that we are giving
// initial centroid guesses.
k.Cluster(dataset, clusters, assignments, centroids, false, true);
If you have a heuristic or algorithm which makes initial guesses, a more elegant solution is to
create a new class fulfilling the InitialPartitionPolicy template policy. If you set the
InitialPartitionPolicy parameter to something other than the default but give an initial cluster
centroid guess, the InitialPartitionPolicy will not be used to initialize the algorithm.
Running sparse k-mans
The Cluster() function can work on both sparse and dense matrices, so all of the above examples
can be used with sparse matrices instead. Below is a simple example. Note that the centroids are
returned as a sparse matrix also.
// The sparse dataset.
extern arma::sp_mat sparseDataset;
// The number of clusters.
extern size_t clusters;
// The assignments will be stored in this vector.
arma::Col<size_t> assignments;
// The centroids of each cluster will be stored in this sparse matrix.
arma::sp_mat sparseCentroids;
// No template parameter modification is necessary.
KMeans<> k;
k.Cluster(sparseDataset, clusters, assignments, sparseCentroids);
Template parameters for the 'KMeans' class
The KMeans<> class also takes three template parameters, which can be modified to change the
behavior of the k-means algorithm. There are three template parameters:

 MetricType: controls the distance metric used for clustering (by default, the squared
Euclidean distance is used)
 InitialPartitionPolicy: the method by which initial clusters are set; by default,
RandomPartition is used
 EmptyClusterPolicy: the action taken when an empty cluster is encountered; by default,
MaxVarianceNewCluster is used

The class is defined like below:


template<
typename DistanceMetric = mlpack::metric::SquaredEuclideanDistance,
typename InitialPartitionPolicy = RandomPartition,
typename EmptyClusterPolicy = MaxVarianceNewCluster >
class KMeans;
In the following sections, each policy is described further, with examples of how to modify them.
CONCLUSION:

Hence, we have successfully studied concept of k means clustering programming.


GROUP B

ASSIGNMENT NO:1

TITLE: 8-Queens Matrix is Stored using JSON/XML having first Queen placed, use back-
tracking to place remaining Queens to generate final 8-queen's Matrix using Python.
OBJECTIVES: To learn 8-queens matrix
PROBLEM STATEMENT: Write a program : 8-Queens Matrix is Stored using JSON/XML
having first Queen placed, use back- tracking to place remaining Queens to generate final 8-
queen's Matrix using Python.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
19, Eclipse
INPUT: First Queen placed matrix
OUTPUT: Final 8-queens matrix
MATHEMATICAL MODEL:

INPUT:
I=Set of 8 quuens in 8*8 matrix.
Function:
F1=In the second iteration, the value of b will be added to the first position in our openList.
F2= openList contains b first, then a: [b, a].
F3=In the third iteration, our openList will look like [c, b, a].
F4=The end result is that the data is reversed, add the items to our OPENLIST in reverse.
First c gets inserted, then b, then a.
OUTPUT:
O1=Success Case: It is the case when all the inputs are given by system are entered correctly and
8-Queen problem is solved.
O2=Failure Case: It is the case when the input does not match the validation Criteria.

Computational Model

THEORY
Backtracking a problem solving approach which is closest to the brute force method. In this, we
explore each path which may lead to solution, taking one decision at a time and as soon as we
find that the path which selected does not lead to solution, we go back to the place where we tool
most recent decision. At that place, we will explore other opportunities to go to different path, if
available. If there are no option available, we go back further. We go back till the time we find
alternate path to be followed or at the start. If we reach to start without finding any path reaching
to solution, then there is no solution present, else we would have found it following one of the
paths.
Backtracking is depth first traversal of path in graph where nodes are states of the solution and
edge are between two states of solution only if one state can be reached from another state.:
Typical examples of backtracking are : N queen problem, Sudoku, Knight problem, crosswords.

The 8 queen problem is a case of more general set of problems namely “n queen
Problem”. The basic idea: How to place n queen on n by n board, so that they don’t attack each
other. As we can expect the complexity of solving the problem increases with n. We will briefly
introduce solution by backtracking.
The board should be regarded as a set of constraints and the solution is simply satisfying
all constraints. For example: Q1 attacks some positions, therefore Q2 has to comply with these
constraints and take place, not directly attacked by Q1. Placing Q3 is harder, since we have to
satisfy constraints of Q1
and Q2. Going the same way we may reach point, where the constraints make the Placement of
the next queen impossible. Therefore we need to relax the constraints and find new solution. To
do this we are going backwards and finding new admissible solution. To keep everything in
order we keep the simple rule: last placed, first displaced. In other words if we place successfully
queen on the ith column but cannot find solution for (i+1) th queen, then going backwards we
will try to find other admissible solution for the ith queen first. This process is called backtrack
Let’s discuss this with example.
Algorithm:
-Start with one queen at the first column first row
-Continue with second queen from the second column first row
-Go up until find a permissible situation
-Continue with next queen
How implement backtrack in code. Remember that we used backtrack when we cannot find
admissible position for a queen in a column . Otherwise we go further with the next column
until we place a queen on the last column. Therefore your code must have fragment:
int PlaceQueen(int board[8], int row)
If (Can place queen on ith column)
PlaceQueen(newboard, 0)
Else
PlaceQueen(oldboard,oldplace+1)
End
If you can place queen on ith column try to place a queen on the next one, or backtrack and try
to place a queen on position abovethe solution found for i-1 column.
ALGORITHM :

1) Start in the leftmost column

2) If all queens are placed return true

3) Try all rows in the current column. Do following for every tried row.

a) If the queen can be placed safely in this row then mark this [row, column] as part of the
solution and recursively check if placing queen here leads to a solution.

b) If placing queen in [row, column] leads to a solution then return true.

c) If placing queen doesn't lead to a solution then umark this [row,column] (Backtrack) and go
to step (a) to try other rows.

3) If all rows have been tried and nothing worked, return false to trigger backtracking.

ANALYSIS :

Complexity of backtracking algorithm for 8 queens problem will be O(n!)

General structure of a Solution using Backtracking

public ... backtrackSolve("problem N")


{
if ( "problem N" is a solved problem )
{
print "(solved) problem N"; // It's a solution !
return;
}

for ( each possible step S you can make in "problem N" )


{
if ( step S is a legal move )
{
Make step S;
backtrackSolve("problem N-1");
Take step S back;
}
}
}
CONCLUSION

Thus we have studied and implemented 8-Queens problem using backtracking.

ASSIGNMENT NO:2

TITLE: Implementation of 0-1 knapsack problem using branch and bound approach.
OBJECTIVES: To implement and apply 0-1 knapsack using branch and bound
PROBLEM STATEMENT: Implementation of 0-1 knapsack problem using branch and bound
approach
SOFTWARE REQUIREMENT: Latest version of 64 Bit Operating Systems Open Source
Fedora-20

INPUT:
OUTPUT:

MATHEMATICAL MODEL:
MATHEMATICAL MODEL:
Let S be the solution perspective of optimized code

S={s, e, i, o, f, DD, NDD, success, failure}

s=initial state that is intermediate or un-optimized code

e = the end state or optimized code.

i= input of the system.

o=output of the system.

THEORY:

Knapsack Problem

There are many different knapsack problems. The first and classical one is the binaryknapsack
problem. It has the following story. A tourist is planning a tour in the mountains. He has a lot of
objects which may be useful during the tour. For exampleice pick and can opener can be among
the objects. We suppose that the following conditions are satisfied.
• Each object has a positive value and a positive weight. (E.g. a balloon filled withhelium has a
negative weight. The value is thedegree of contribution of the object to the success of the tour.
• The objects are independent from each other. (E.g. can and can opener are not independent as
any of them without the other one has limited value.)
• The knapsack of the tourist is strong and large enough to contain all possible objects.
• The strength of the tourist makes possible to bring only a limited total weight.
• But within this weight limit the tourist want to achieve the maximal total value.
The following notations are used to the mathematical formulation of the problem:
n the number of objects;
j the index of the objects;
wj the weight of object j;
vj the value of object j;
b the maximal weight what the tourist can bring.
For each object j a so-called binary or zero-one decision variable, say xj , is
introduced:
xj =1 if object j is present on the tour
0 if object j isn’t present on the tour.
Notice that
wjxj =wj if object j is present on the tour,
0 if object j isn’t present on the tour is the weight of the object in the knapsack.
Similarly vjxj is the value of the object on the tour. The total weight in the
knapsack is
Xn
j=1
wjxj
which may not exceed the weight limit. Hence the mathematical form of the problem
ismax
Xn
j=1
vjxj (24.1)
Xn
j=1
wjxj _ b (24.2)
xj = 0 or 1, j = 1, . . . , n . (24.3)
The difficulty of the problem is caused by the integrality requirement. If constraint
is substituted by the relaxed constraint, i.e. by

Algorithm:

Algorithm UBound(cp,cw,k,m)
{
b:=cp;c:=cw;
for i:=k+1 to n do {
If(c+w[i]<=m)then
{c:=c+w[i];b=b-p[i];}
}
return b;
}
CONCLUSION: Thus we have studied 0-1 knapsack problem using branch and bound
approach.

ASSIGNMENT NO:3

TITLE: Code optimization using DAG


OBJECTIVES:
1. To express concept of code optimization and directed acyclic graph
2. To implement code optimization using directed acyclic graphs.
PROBLEM STATEMENT:
Implement code optimization using directed acyclic graph.
SOFTWARE REQUIRED: Linux Operating Systems, GCC.
INPUT: Input data as intermediate code.
OUTPUT: It will create a optimized code or optimized intermediate code.
MATHEMATICAL MODEL:
Let S be the solution perspective of optimized code

S={s, e, i, o, f, DD, NDD, success, failure}

s=initial state that is intermediate or un-optimized code

e = the end state or optimized code.

i= input of the system.

o=output of the system.

DD-deterministic data it helps identifying the load store functions or assignment functions.

NDD- Non deterministic data of the system S to be solved.

Computational Model

THEORY:
Optimization is the process of transforming a piece of code to make more efficient (either in
terms of time or space) without changing its output or side-effects.In the code optimization phase
the intermediate code is improved to run the output faster and occupies the lesser space.
Output of this phase is another intermediate code to improve the efficiency. The basic
requirement of optimization methods should comply with is that an optimized program must
have the same output and side effects as its non-optimized version. This requirement, however,
may be ignored in case the benefit from optimization is estimated to be more important than
probable consequences of a change in the program behavior.
Optimization can be performed by automatic optimizers or programmers. An optimizer is either
a specialized software tool or a built-in unit of a compiler (the so-called optimizing
compiler).Optimizations are classified into high-level and low-level optimizations. High-level
optimizations are usually performed by the programmer who handles abstract entities (functions,
procedures, classes, etc.) and keeps in mind the general framework of the task to optimize the
design of a system.
Control-Flow Analysis:-In control-flow analysis, the compiler figures out even more
information about ow the program does its work, only now it can assume that there are no
syntactic or semantic errors in the code.
Control-flow analysis begins by constructing a control-flow graph , which is a graph of the
different possible paths program flow could take through a function. To build the graph, we first
divide the code into basic blocks.
Constant Propagation:- If a variable is assigned a constant value, then subsequent uses of that
variable can be replaced by the constant as long as no intervening assignment has changed the
value of the variable.
Code Motion :- Code motion (also called code hoisting ) unifies sequences of code common to
one or more basic blocks to reduce code size and potentially avoid expensive re-evaluation.
Peephole Optimizations :- Peephole optimization is a pass that operates on the target assembly
and only considers a few instructions at a time (through a "peephole") and attempts to do simple,
machine-dependent code improvements
Redundant instruction elimination:- At source code level, the following can be done by the user
int add_ten(int x) int add_ten(int x) int add_ten(int x) int add_ten(int x)
{ { { {
int y, z; int y; int y = 10; return x + 10;
y = 10; y = 10; return x + y; }
z = x + y; y = x + y; }
return z;} return y;}

At compilation level, the compiler searches for instructions redundant in nature. Multiple loading
and storing of instructions may carry the same meaning even if some of them are removed. For
example:

 MOV x, R0
 MOV R0, R1

We can delete the first instruction and re-write the sentence as: MOV x, R1
Unreachable code:-Unreachable code is a part of the program code that is never accessed
because of programming constructs. Programmers may have accidently written a piece of code
that can never be reached.

Example:
void add_ten(int x)
{
return x + 10; printf(“value of x is %d”, x); }
In this code segment, the printf statement will never be executed as the program control returns
back before it can execute, hence printf can be removed.
Flow of control optimization:-There are instances in a code where the program control jumps
back and forth without performing any significant task. These jumps can be removed. Consider
the following chunk of code:...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1

In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1
and then to L2, the control can directly reach L2, as shown below:...

MOV R1, R2
GOTO L2
...
L2 : INC R1
Algebraic expression simplification:- There are occasions where algebraic expressions can be
made simple. For example, the expression a = a + 0 can be replaced by a itself and the expression
a = a + 1 can simply be replaced by INC a.
Strength reduction:- There are operations that consume more time and space. Their ‘strength’
can be reduced by replacing them with other operations that consume less time and space, but
produce the same result.For example, x * 2 can be replaced by x << 1, which involves only one
left shift. Though the output of a * a and a2 is same, a2 is much more efficient to implement.
Optimization Phases
The phase which represents the pertinent, possible flow of control is often called control flow
analysis. If this representation is graphical, then a flow graph depicts all possible execution
paths. Control flow analysis simplifies the data flow analysis. Data flow analysis is the proces of
collecting information about the modification, preservation , and use of program "quantities"--
usually variables and expressions.Once control flow analysis and data flow analysis have been
done, the next phase, the improvement phase, improves the program code so that it runs faster or
uses less space. This phase is sometimes termed optimization. Thus, the term optimization is
used for this final code improvement, as well as for the entire process which includes control
flow analysis and data flow analysis. Optimization algorithms attempt to remove useless code,
eliminate redundant expressions, move invariant computations out of loops, etc.
Basic Blocks
A basic block is a sequence of intermediate representation constructs (quadruples, abstract
syntax trees, whatever) which allow no flow of control in or out of the block except at the top or
bottom. Figure 3 shows the structure of a basic block.

We will use the term statement for the intermediate representation and show it in quadruple form
because quadruples are easier to read than IR in tree form.

Leaders:A basic block consists of a leader and all code before the next leader. We define a
leader to be (1) the first statement in the program (or procedure), (2) the target of a branch,
identified most easily because it has a label, and (3) the statement after a "diverging flow " : the
statement after a conditional or unconditional branch.

Basic blocks can be built during parsing if it is assumed that all labels are referenced or after
parsing without that assumption. Following Example shows the outline of a FOR-Loop and its
basic blocks.Since a basic block consists of straight-line code, it computes a set of expressions.
Many optimizations are really transformations applied to basic blocks and to sequences of basic
blocks. Basic blocks for For Loop shown in below figure .One we have computed basic blocks,
we can create the control flow graph.

Building a Flow Graph: A flow graph shows all possible execution paths. We will use this
information to perform optimizations .Formally, a flow graph is a directed graph G, with N
nodes and E edges. Example: Building the Control Flow Graph from basic block

Directed Acyclic Graphs (DAGs)

The previous sections looked at the control flow graph nodes of basic blocks. DAGs, on the other
hands, create a useful structure for the intermediate representation within the basic blocks. A
directed graph with no cycles, called a DAG (Directed Acyclic Graph), is used to represent a
basic block. Thus, the nodes of a flow graph are themselves graphs! We can create a DAG
instead of an abstract syntax tree by modifying the operations for constructing the nodes. If the
node already exists, the operation returns a pointer to the existing node. Example 4 shows this for
the two-line assignment statement example.
Example: A DAG for two-line assignment statement

In above example there are two references to a and two references to the quantity bb * 12 .

Data Structure for a DAG: As above example shows, each node may have more than one
pointer to it. We can represent a DAG internally by a process known as value-numbering . Each
node is numbered and this number is entered whenever this node is reused. This is shown
following Example.
EXAMPLE: Value-numbering
Node Left Child Right Child
1. X1
2. a
3. bb
4. 12
5. * (3) (4)
6. + (2) (5)
8.y := (1) (6)
8. X2
9. 2
10. / (2) (9)
11. + (10) (5)
12. := (8) (11)

CONCLUSION:

Hence, we have successfully implemented code optimization using directed acyclic graph.
ASSIGNMENT NO: 4

TITLE: Code Generation using DAG/Labeled Tree

OBJECTIVES:
1. To express concept of DAG and Labeled Tree.
2. To apply the code generation algorithm to generate target code.
PROBLEM STATEMENT:
Accept Postfix expression. Create a DAG from that expression. Apply Labeling algorithm to
DAG and then apply code generation algorithm to generate target code from DAG.

SOFTWARE REQUIRED: 64-bit Fedora or equivalent OS, LEX, YACC.

INPUT: Input data as DAG/Labeled Tree which is generated from postfix expression.
OUTPUT: It will create a target code.

MATHEMATICAL MODEL:
Let S be the solution perspective of the code generation such that

S={s, e, i, o, f, DD, NDD, success, failure}


s=initial state that is start of program

e = the end state or end of program

i= postfix expression

o=target code

f= {ComputeLabel ( ), PostTraverse( ), Gencode ( )}

ComputeLabel ( )={Applies Labeling Algorithm to tree generated from postfix expression}

PostTraverse( )={ displays tree in postorder}

Gencode() = { Handles different cases of tree & accordingly generate target code}

COMPUTATIONAL MODEL:
e2
e1 e3

S A B
Where S : Initial State

A : DAG /Labeled Tree

B: Target Code

Edge e1 : postfix expression

e2 : Compute Label

e3: root node of labeled tree

System Accepts postfix expression & enters into A state . In that state, it generates tree, at the
same time, calculates label of tree. After that, it passes root node of Label tree to gencode
function & enters into B state. In B state it applies code generation algorithm & generate target
code.
Success- desired output is generated as target code in assembly language form
Failure- desired output is not generated as target code in assembly language form
THEORY:

The Labeling Algorithm:The labeling can be done by visiting nodes in a bottom-up order so that
a node is not visited until all it’s children are labeled.

Fig.1.1 gives the algorithm for computing label at node n.


Fig.1.1 The Labeling Algorithm

In the important special case that n is binary node and it’s children have labels l1 and l2, the
above formula reduces to

Example:

Fig. 1.2 Labeled Tree

Node a is labeled 1 since it is left leaf. Node b is labeled 0 since it is right leaf. Node t1 is
labeled 1 because the labels of it’s children are unequal and the maximum label of a child is
1. Fig.1.2 shows labeled tree that results.
Code generation from a Labeled Tree
Procedure GENCODE(n)
 RSTACK –stack of registers, R0,...,R(r-1)
 TSTACK –stack of temporaries, T0,T1,...
 A call to Gencode(n) generates code to evaluate a tree T, rooted at node n, into the
register top(RSTACK) ,and
o the rest of RSTACK remains in the same state as the one before the call

A swap of the top two registers of RSTACK is needed at some points in the algorithm to
ensure that a node is evaluated into the same register as its left child
CONCLUSION:

Hence, we have successfully studied DAG, Labeling algorithm and code generation from
Labeled Tree.

ASSIGNMENT NO:5

TITLE: Generating abstract syntax tree using Lex and YACC.

OBJECTIVES: To express and apply the concept of abstract syntax tree.

PROBLEM STATEMENT: Generate an abstract syntax tree using Lex and YACC

SOFTWARE REQUIRED: Lex and YACC

INPUT: Input data as default values and user defined values.

OUTPUT: Abstract syntax tree.


MATHEMATICAL MODEL
Let S be the solution perspective of the abstract syntax tree such that

S={s, e, i, o, f, success, failure}

s=initial state

e = the end state


i= input of the system.

o=output of the system.

Success-desired outcome generated.

Failure-Desired outcome not generated or forced exit due to system error.

s e1 e2

Where,
s is the initial state
e1 is the input
B is the token generator and
R is the result.

THEORY:
What is abstract syntax tree?
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of
the abstract syntactic structure of source code written in a programming language. Each node of
the tree denotes a construct occurring in the source code. The syntax is "abstract" in not
representing every detail appearing in the real syntax. For instance, grouping parentheses are
implicit in the tree structure, and a syntactic construct like an if-condition-then expression may
be denoted by means of a single node with three branches.
This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated parse
trees, which are often built by a parser during the source code translation and compiling process.
Once built, additional information is added to the AST by means of subsequent processing, e.g.,
contextual analysis. Abstract syntax trees are also used in program analysis and program
transformation systems.
The representation of SourceCode as a tree of nodes representing constants or variables (leaves)
and operators or statements (inner nodes). Also called a "parse tree". An AbstractSyntaxTree is
often the output of a parser (or the "parse stage" of a compiler), and forms the input to semantic
analysis and code generation (this assumes a phased compiler; many compilers interleave the
phases in order to conserve memory).
They are widely used in compilers, due to their property of representing the structure of program
code. An AST is usually the result of the syntax analysis phase of a compiler. It often serves as
an intermediate representation of the program through several stages that the compiler requires,
and has a strong impact on the final output of the compiler.
Generating abstract syntax tree using Lex and YACC
Yacc actions appear to the right of each rule, much like lex actions. We can associate pretty
much any C code that we like with each rule. Consider the yacc file. Tokens are allowed to have
values, which we can refer to in our yacc actions. The $1 in printf("%s",$1) (the 5th rule for exp)
refers to the value of the IDENTIFIER (notice that IDENTIFIER tokens are specified to have
string values). The $1 in printf("%d",$1) (the 6th rule for exp) refers to the value of the
INTEGER LITERAL. We use $1 in each of these cases because the tokens are the first items in
the right-hand side of the rule.
Example
4*5+6
First, we shift INTEGER LITERAL(4) onto the stack. We then reduce by the rule exp:
INTEGER LITERAL, and execute the printf statement, printing out a 4. We then shift a TIMES,
then shift an INTEGER LITERAL(5). Next, we reduce by the rule exp: INTEGER LITERAL
and print out a 5. Next, we reduce by the rule exp: exp TIMES exp (again, remember those
precedence directives!) and print out a *. Next, we shift a PLUS and an INTEGER LITERAL(6).
We reduce by exp: INTEGER LITERAL (printing out a 6), then we reduce by the rule exp: exp
+ exp (printing out a +), giving an output of:
45*6+
So what does this parser do? It converts infix expressions into postfix expressions.
Creating an Abstract Syntax Tree
C Definitions Between %{ and %} in the yacc file are the C definitions. The #include is
necessary for the yacc actions to have access to the tree types and constructors, defined in
treeDefinitions.h. The global variable root is where yacc will store the finished abstract syntax
tree.
• %union: The %union{ } command defines every value that could occur on the stack – not only
token values, but non-terminal values as well. This %union tells yacc that the types of values that
we could push on the stack are integers, strings (char *), and expressionTrees. Each element on
the yacc stack contains two items – a state, and a union (which is defined by %union). by
%union{ }.
• %token: For tokens with values, %token <field> tells yacc the type of the token. For instance,
in the rule exp : INTEGER LITERAL { $$ = ConstantExpression($1)}, yacc looks at the union
variable on the stack to get the value of $1. The command %token <integer value> INTEGER
LITERAL tells yacc to use the integer value field of the union on the stack when looking for the
value of an INTEGER LITERAL.
• %type: Just like for tokens, these commands tell yacc what values are legal for non-terminal
commands
• %left Precedence operators: • Grammar rules The rest of the yacc file, after the %%, are the
grammar rules, of the form
<non-terminal> : <rhs> { /* C code */ }
where <non-terminal> is a non-terminal and <rhs> is a sequence of tokens and non-terminals.
Let’s look at an example. For clarity, we will represent the token INTEGER LITERAL(3) as just
3. Pointers will be represented graphically with arrows. The stack will grow down in our
diagrams. Consider the input string 3+4*5. First, the stack is empty, and the input is 3+4*x.
Example: Creating an abstract syntax tree for simple expressions
%{
#include "treeDefinitions.h"
expressionTree root;
%}
%union{
int integer_value;
char *string_value;
expressionTree expression_tree;
}
%token <integer_value> INTEGER_LITERAL
%token <string_value> IDENTIFIER
%token PLUS MINUS TIMES DIVIDE
%type <expresstion_tree> exp
%left PLUS MINUS
%left TIMES DIVIDE
%%
prog : exp { root = $$; }
exp : exp PLUS exp { $$ = OperatorExpression(PlusOp,$1,$3); }
| exp MINUS exp { $$ = OperatorExpression(MinusOp,$1,$3); }
| exp TIMES exp { $$ = OperatorExpression(TimesOp,$1,$3); }
| exp DIVIDE exp { $$ = OperatorExpression(DivideOp,$1,$3); }
| IDENTIFIER { $$ = IdentifierExpression($1); }
| INTEGER_LITERAL { $$ = ConstantExpression($1); }
/* File treeDefinitions.c */
#include "treeDefinitions.h"
#include <stdio.h>
expressionTree operatorExpression(optype op, expressionTree left,
expressionTree right) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = operatorExp;
retval->u.oper.op = op;
retval->u.oper.left = left;
retval->u.oper.right = right;
return retval;
}
expressionTree IdentifierExpression(char *variable) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = variableExp;
retval->u.variable = variable;
return retval;
}
expressionTree ConstantExpression(int constantval) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = constantExp;
retval->u.constantval = constantval;
return retval;
}
CONCLUSION
Hence, we have successfully studied concept of abstract syntax tree.

ASSIGNMENT NO:6

TITLE: Implementing RDP for sample language


OBJECTIVES:
1. To express concept of RDP.
2. To implement RDP.
PROBLEM STATEMENT:
Accept sample language expression and generate recursive decent parser for the same.
SOFTWARE REQUIRED: 64-bit Fedora or equivalent OS, TurboC.
INPUT: Sample language expression.
OUTPUT: It will create parser for language expression
MATHEMATICAL MODEL:
Let S be the solution perspective of the recursive descent parser such that
S={s, e, i, o, f, success, failure}
s=initial state that is start of program
e = the end state or end of program
i= input expression
o=parsed output
f= {IsTerminal(),Match ( ), Error( )}
IsTerminal()={Check whether input Character is terminal or non-terminal}
Match ( )={Matches the current input token against a predicted token }
Error( )={ Generate Error message}
COMPUTATIONAL MODEL:

e1 e2
S T

e3

Where S : Initial State


T: Function checking terminals
P: Procedure For terminal
B: Match non-terminal
Edge e1 : Scan input character
e2 : Call procedure for Terminal
e3: matches non-terminal and print it.
System Accepts input expression & enters into T state . In that state, it checks current input for
terminal .If it is terminal then procedure for current terminal is created .If current symbol is non-
terminal then it will be matched with grammar defined.
Success- Given input is successfully parsed.
Failure- parser giving unable to process the input language
THEORY:
Tokenizing
Tokenization is the process of converting input program text into a sequence of tokens
LL(1) Grammars
A context-free grammar whose Predict sets are always disjoint (for the same non-
terminal) is said to be LL(1). LL(1) grammars are ideally suited for top-down parsing because it
is always possible to correctly predict the expansion of any non-terminal. No backup is ever
needed. LL(1) grammars are easy to parse in a top-down manner since predictions are always
correct.
Recursive Descent Parsing:Recursive descent parsing is a method of writing a compiler as a
collection of recursive functions. This is usually done by converting a BNF grammar
specification directly into recursive functions.
Recursive Descent Parsers
A recursive descent parser is a top-down parser, so called because it builds a parse tree
from the top (the start symbol) down, and from left to right, using an input sentence as a target as
it is scanned from left to right. (The actual tree is not constructed but is implicit in a sequence of
function calls.) This type of parser was very popular for real compilers in the past, but is not as
popular now. The parser is usually written entirely by hand and does not require any
sophisticated tools. It is a simple and effective technique, but is not as powerful as some of the
shift-reduce parsers -- not the one presented in class, but fancier similar ones called LR parsers.
Perhaps the hardest part of a recursive descent parser is the scanning: repeatedly fetching the
next token from the scanner. It is tricky to decide when to scan, and the parser doesn't work at all
if there is an extra scan or a missing scan
Although parsers can be generated by parser generators, it is still sometimes convenient to write
a parser by hand. However, LALR(1) grammars are not easy to use to manually construct
parsers. Instead, we want an LL(1) grammar if we are going to manually construct a parser. An
LL(1) grammar can be used to construct a top-down or recursive descent parser where an
LALR(1) grammar is typically used to construct a bottom-up parser. A top-down parser
constructs (or at least traverses) the parse tree starting at the root of the tree and proceeding
downward. A bottom-up parser constructs or traverses the parse tree in a bottom-up fashion.
In a recursive descent parser, each non-terminal in the grammar becomes a function in a
program. The right hand side of the productions becomes the bodies of the functions. An
LALR(1) grammar is not appropriate for constructing a recursive descent parser. To create a
recursive-descent parser (the topic of this page) we must convert the LALR(1) grammar above to
an LL(1) grammar. Typically, there are two steps involved.

 Eliminate left recursion


 Perform left factorization where appropriate
Eliminate Left Recursion
Eliminating left recursion means eliminating rules like Expr Expr + Term. Rules like this are
left recursive because the Expr function would first call the Expr function in a recursive descent
parser. Without a base case first, we are stuck in infinite recursion (a bad thing). To eliminate left
recursion we look to see what Expr can be rewritten as. In this case, Expr can be only be
replaced by a Term so we replace Expr with Term in the productions. The usual way to eliminate
left recursion is to introduce a new non-terminal to handle all but the first part of the production.
So we get
Expr -> Term RestExpr
RestExpr -> + Term RestExpr | - Term RestExpr | <null>
We must also eliminate left recursion in the Term Term * Factor | Term / Factor productions in
the same way. We end up with an LL(1) grammar that looks like this:
Prog -> Expr EOF
Expr -> Term RestExpr
RestExpr -> + Term RestExpr | - Term RestExpr | <null>
Term -> Storable RestTerm
RestTerm -> * Storable RestTerm | / Storable RestTerm | <null>
Storable -> Factor S | Factor
Factor -> number | R | ( Expr )
Perform Left Factorization
Left factorization isn't needed on this grammar so this step is skipped. Left factorization is
needed when the first part of two or more productions is the same and the rest of the similar
productions are different. Left factorization is important in languages like Prolog because
without it the parser is inefficient. However, it isn't needed and does not improve efficiency
when writing a parser in an imperative language like Java, for instance.
Building A Recursive Descent Parser
We start with a procedure Match, that matches the current input token against a predicted token:
void Match(Terminal a)
{
if (a == currentToken)
currentToken = Scanner();
else
SyntaxErrror();}
To build a parsing procedure for a non-terminal A, we look at all productions with A on the
lefthand side:
A→X1...Xn |
A→Y1...Ym | ...
We use predict sets to decide which production to match (LL(1) grammars always have disjoint
predict sets) We match a production’s right hand side by calling Match to match terminals, and
calling parsing procedures to match non-terminals.
The general form of a parsing procedure for
A→X1...Xn | A→Y1...Ym | ... is
void A()
{
if (currentToken in Predict(A→X1...Xn))
for(i=1;i<=n;i++)
if (X[i] is a terminal)
Match(X[i]);
Else
X[i]();
Else
if (currentToken in Predict(A→Y1...Ym))
for(i=1;i<=m;i++)
if (Y[i] is a terminal)
Match(Y[i]);
else
Y[i]();
else
// Handle other A→...productions
else // No production predicted
SyntaxError();
}

CONCLUSION:
Hence, we have successfully studied to eliminate Left Recursion and generate a
Recursive Descent Parser.

ASSGINMENT NO:7

TITLE

Implement Apriori approach for data mining to organize the data items

OBJECTIVES:
1. Students should be able to implement Apriori approach for data mining.
2. Student should be able to identify frequent item sets using association rule.
PROBLEM STATEMENT: Implement Apriori approach for data mining to organize the data
items on a shelf using following table of items purchased in a Mall

Transacti Item1 Item2 Item3 Item4 Item 5 Item6


on ID
T1 Mango Onion jar Key - Eggs Chocol
chain ates
T2 nuts Onion jar Key - Eggs Chocol
chain ates
T3 Mango Apple Key - Eggs - -
chain
T4 Mango tuthbru Corn Key - Chocola -
sh chain tes
T5 corn Onion Onion Key - Knief Eggs
chain

SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
20
INPUT: Data items (item purchase in mall e.g. Mango, Onion etc)

OUTPUT: Frequent item set purchase

MATHEMATICAL MODEL:

Let S be the solution perspective of the class


S={s, e, i, o, f, DD, NDD, success, failure}
s=initial state
e = end state
i= input of the system here is value of k and data items
o=output of the system. Here is k frequent itemset
DD-deterministic data it helps identifying the load store functions or assignment functions.
NDD- Non deterministic data of the system S to be solved.
Success- desired outcome generated.
Failure-Desired outcome not generated or forced exit due to system error

THEORY:
Apriori is a classic algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database. The frequent item sets determined by Apriori can be used to determine
association rules which highlight general trends in the database. This has applications in domains
such as market basket analysis. Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time a step known as candidate generation, and groups of candidates are
tested against the data. The algorithm terminates when no further successful extensions are
found. Apriori is designed to operate on databases containing transactions (for example,
collections of items bought by customers, or details of a website frequentation)
Pseudo Code For Apriori Algorithm:

Association rule learning:It is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered in
databases using different measures of interestingness. For example, the rule

found in the sales data of a supermarket would indicate that if a customer buys onions and
potatoes together, they are likely to also buy hamburger meat. Such information can be used as
the basis for decisions about marketing activities such as, e.g., promotional pricing or product
placements. In addition to the above example from market basket analysis association rules are
employed today in many application areas including Web usage mining, intrusion detection,
Continuous production, and bioinformatics. In contrast with sequence mining, association rule
learning typically does not consider the order of items either within a transaction or across
transactions.
Association rule mining is defined as: Let be a set of binary attributes
called items. Let be a set of transactions called the database. Each
transaction in has a unique transaction ID and contains a subset of the items in . A rule is
defined as an implication of the form where and . The sets of
items (for short itemsets) and are called antecedent (left-hand-side or LHS) and consequent
(right-hand-side or RHS) of the rule respectively.
Suppose you have records of large number of transactions at a shopping center as follows:

Transactions Items bought


T1 Item1, item2, item3
T2 Item1, item2
T3 Item2, item5
T4 Item1, item2, item5

Learning association rules basically means finding the items that are purchased together more
frequently than others.
For example in the above table you can see Item1 and item2 are bought together frequently.

What is the use of learning association rules?


 Shopping centers use association rules to place the items next to each other so that users buy
more items. If you are familiar with data mining you would know about the famous beer-diapers-
Wal-Mart story. Basically Wal-Mart studied their data and found that on Friday afternoon young
American males who buy diapers also tend to buy beer. So Wal-Mart placed beer next to diapers
and the beer-sales went up. This is famous because no one would have predicted such a result
and that’s the power of data mining. You can Google for this if you are interested in further
details
 Also if you are familiar with Amazon, they use association mining to recommend you the
items based on the current item you are browsing/buying.
 Another application is the Google auto-complete, where after you type in a word it searches
frequently associated words that user type after that particular word.

So as I said Apriori is the classic and probably the most basic algorithm to do it. Now if you
search online you can easily find the pseudo-code and mathematical equations and stuff. I would
like to make it more intuitive and easy, if I can.
I would like if a 10th or a 12th grader can understand this without any problem. So I will try and
not use any terminologies or jargons.
Let’s start with a non-simple example,

Transaction Items Bought


ID
T1 {Mango, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T2 {Doll, Onion, Nintendo, Key-chain, Eggs, Yo-yo}
T3 {Mango, Apple, Key-chain, Eggs}
T4 {Mango, Umbrella, Corn, Key-chain, Yo-yo}
T5 {Corn, Onion, Onion, Key-chain, Ice-cream, Eggs}
Now, we follow a simple golden rule: we say an item/itemset is frequently bought if it is bought
at least 60% of times. So for here it should be bought at least 3 times.
For simplicity
M = Mango
O = Onion
And so on……
So the table becomes
Original table:
Transaction Items Bought
ID
T1 {M, O, N, K, E, Y }
T2 {D, O, N, K, E, Y }
T3 {M, A, K, E}
T4 {M, U, C, K, Y }
T5 {C, O, O, K, I, E}

Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3 transactions.
Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1

Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times.
So in this step we remove all the items that are bought less than 3 times from the above table and
we are left with

Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3

This is the single items that are bought frequently. Now let’s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with
the second item like OK,OE,OY. We did not do OM because we already did MO when we were
making pairs with M and buying a Mango and Onion together is same as buying Onion and
Mango together. After making all the pairs we get,

Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and O is
just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get

Item Pairs Number of


transactions
MO 1
MK 3
ME 2
MY 2
OK 3
OE 3
OY 2
KE 4
KY 3
EY 2
Step 5: Golden rule to the rescue. Remove all the item pairs with number of transactions less
than three and we are left with
Item Pairs Number of
transactions
MK 3
OK 3
OE 3
KE 4
KY 3
These are the pairs of items frequently bought together.Now let’s say we want to find a set of
three items that are brought together. We use the above table (table in step 5) and make a set of 3
items.
Step 6: To make the set of three items we need one more rule (it’s termed as self-join),
It simply means, from the Item pairs in the above table, we find two pairs with the same first
Alphabet, so we get
 OK and OE, this gives OKE
 KE and KY, this gives KEY
Then we find how many times O,K,E are bought together in the original table and same for
K,E,Y and we get the following table

Item Set Number of


transactions
OKE 3
KEY 2

While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and
you want to generate item sets of 4 items you look for two sets having the same first two
alphabets.
 ABC and ABD -> ABCD
 ACD and ACE -> ACDE

And so on … In general you have to look for sets having just the last alphabet/item different.
Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3
times which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E.
CONCLUSION:

Hence, we have successfully studied concept of Association rule & Apriori Algorithm
ASSGINMENT NO: 8

TITLE: Implement Navie Bayes for Concurrent/Distributed application. Approach should


handle categorical and continuous data.
OBJECTIVES:
1. Students should be able to understand Naive Bayes therom
2. Student should be able to Implement Naive Bayes for Concurrent/Distributed application
PROBLEM STATEMENT:
Implement Navie Bayes for Concurrent/Distributed application. Approach should handle
categorical and continuous data.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
20

INPUT: Data set (pima-indians-diabetes.data)


OUTPUT: Analysis and classification of data

MATHEMATICAL MODEL:
Let S be the solution perspective of the class

S={s, e, i, o, f, DD, NDD, success, failure}

s=initial state that

e = the end state

i= input of the system here is set of data

o=output of the system.

DD-deterministic data it helps identifying the load store functions or assignment functions.

NDD- Non deterministic data of the system S to be solved.

Success- desired outcome generated.

Failure-Desired outcome not generated or forced exit due to system error

THEORY:
The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute
belonging to each class to make a prediction. It is the supervised learning approach you would
come up with if you wanted to model a predictive modeling problem probabilistically.

Naive bayes simplifies the calculation of probabilities by assuming that the probability of each
attribute belonging to a given class value is independent of all other attributes. This is a strong
assumption but results in a fast and effective method.The probability of a class value given a
value of an attribute is called the conditional probability. By multiplying the conditional
probabilities together for each attribute for a given class value, we have a probability of a data
instance belonging to that class.To make a prediction we can calculate probabilities of the
instance belonging to each class and select the class value with the highest probability.Naive
bases is often described using categorical data because it is easy to describe and calculate using
ratios. A more useful version of the algorithm for our purposes supports numeric attributes and
assumes the values of each numerical attribute are normally distributed (fall somewhere on a bell
curve). Again, this is a strong assumption, but still gives robust results.

Predict the Onset of Diabetes:The test problem we will use in this tutorial is the Pima Indians
Diabetes problem.
This problem is comprised of 768 observations of medical details for Pima Indians patents. The
records describe instantaneous measurements taken from the patient such as their age, the
number of times pregnant and blood workup. All patients are women aged 21 or older. All
attributes are numeric, and their units vary from attribute to attribute.
Each record has a class value that indicates whether the patient suffered an onset of diabetes
within 5 years of when the measurements were taken (1) or not (0).
This is a standard dataset that has been studied a lot in machine learning literature. A good
prediction accuracy is 70%-76%.
Below is a sample from the pima-indians.data.csv file to get a sense of the data we will be
working with.NOTE: Download this file and save it with a .csv extension (e.g. pima-indians-
diabetes.data.csv). See this file for a description of all the attributes.
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
1 8,183,64,0,0,23.3,0.672,32,1
6,148,72,35,0,33.6,0.627,50,1
1,89,66,23,94,28.1,0.167,21,0
2 1,85,66,29,0,26.6,0.351,31,0
3 8,183,64,0,0,23.3,0.672,32,1
4 1,89,66,23,94,28.1,0.167,21,0
5 0,137,40,35,168,43.1,2.288,33,1

Naive Bayes Algorithm Tutorial:This tutorial is broken down into the following steps:

1. Handle Data: Load the data from CSV file and split it into training and test datasets.
2. Summarize Data: summarize the properties in the training dataset so that we can
calculate probabilities and make predictions.
3. Make a Prediction: Use the summaries of the dataset to generate a single prediction.
4. Make Predictions: Generate predictions given a test dataset and a summarized training
dataset.
5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the
percentage correct out of all predictions made.
6. Tie it Together: Use all of the code elements to present a complete and standalone
implementation of the Naive Bayes algorithm.

1. Handle Data:The first thing we need to do is load our data file. The data is in CSV format
without a header line or any quotes. We can open the file with the open function and read the
data lines using the reader function in the csv module.
We also need to convert the attributes that were loaded as strings into numbers that we can work
with them. Below is the loadCsv() function for loading the Pima indians dataset.
import csv
def loadCsv(filename):
lines = csv.reader(op
dataset = list(lines)
1 import csv
2 def loadCsv(filename):
3 lines = csv.reader(open(filename, "rb"))
4 dataset = list(lines)
5 for i in range(len(dataset)):
6 dataset[i] = [float(x) for x in dataset[i]]
7 return dataset

We can test this function by loading the pima indians dataset and printing the number of data
instances that were loaded.

filename = 'pima-indians-diabete
dataset = loadCsv(filename)
print('Loaded data file {0} w ith {
1 filename = 'pima-indians-diabetes.data.csv'
2 dataset = loadCsv(filename)
3 print('Loaded data file {0} with {1} rows').format(filename, len(dataset))

Running this test, you should see something like:

Loaded data file pima-indians-di

1 Loaded data file pima-indians-diabetes.data.csv rows


Next we need to split the data into a training dataset that Naive Bayes can use to make
predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to
split the data set randomly into train and datasets with a ratio of 67% train and 33% test (this is a
common ratio for testing an algorithm on a dataset).
Below is the splitDataset() function that will split a given dataset into a given split ratio.
import random
def splitDataset(dataset, splitRa
trainSize = int(len(data
1 import random
trainSet = []
2 def splitDataset(dataset, splitRatio):
3 trainSize = int(len(dataset) * splitRatio)
4 trainSet = []
5 copy = list(dataset)
6 while len(trainSet) < trainSize:
7 index = random.randrange(len(copy))
8 trainSet.append(copy.pop(index))
9 return [trainSet, copy]

We can test this out by defining a mock dataset with 5 instances, split it into training and testing
datasets and print them out to see which data instances ended up where.

dataset = [[1], [2], [3], [4], [5]]


splitRatio = 0.67
1 train, test = splitDataset(dataset
dataset = [[1], [2], [3], [4],
print('Split {0} row s into train w i
[5]]
2 splitRatio = 0.67
3 train, test = splitDataset(dataset, splitRatio)
4 print('Split {0} rows into train with {1} and test with {2}').format(len(dataset), train, test)

Running this test, you should see something like:

Split 5 row s into train w ith [[4], [

1 Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]]
2. Summarize Data
The naive bayes model is comprised of a summary of the data in the training dataset. This
summary is then used when making predictions.
The summary of the training data collected involves the mean and the standard deviation for
each attribute, by class value. For example, if there are two class values and 7 numerical
attributes, then we need a mean and standard deviation for each attribute (7) and class value (2)
combination, that is 14 attribute summaries.
These are required when making predictions to calculate the probability of specific attribute
values belonging to each class value.
We can break the preparation of this summary data down into the following sub-tasks:

1. Separate Data By Class


2. Calculate Mean
3. Calculate Standard Deviation
4. Summarize Dataset
5. Summarize Attributes By Class

Separate Data By Class


The first task is to separate the training dataset instances by class value so that we can calculate
statistics for each class. We can do that by creating a map of each class value to a list of
instances that belong to that class and sort the entire dataset of instances into the appropriate
lists.
The separateByClass() function below does just this.
def separateByClass(dataset):
separated = {}
for i in range(len(data
1 def separateByClass(dataset):
vector = da
2 separated = {}
3 for i in range(len(dataset)):
4 vector = dataset[i]
5 if (vector[-1] not in separated):
6 separated[vector[-1]] = []
7 separated[vector[-1]].append(vector)
8 return separated
You can see that the function assumes that the last attribute (-1) is the class value. The function
returns a map of class values to lists of data instances.
We can test this function with some sample data, as follows:
dataset = [[1,20,1], [2,21,0], [3,2
separated = separateByClass(d
1 print('Separated
dataset = [[1,20,1], [2,21,0],
[3,22,1]]
instances: {0}')
2 separated = separateByClass(dataset)
3 print('Separated instances: {0}').format(separated)

Running this test, you should see something like:

Separated instances: {0: [[2, 21

1 Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}
Calculate Mean
We need to calculate the mean of each attribute for a class value. The mean is the central middle
or central tendency of the data, and we will use it as the middle of our gaussian distribution when
calculating probabilities.
We also need to calculate the standard deviation of each attribute for a class value. The standard
deviation describes the variation of spread of the data, and we will use it to characterize the
expected spread of each attribute in our Gaussian distribution when calculating probabilities.The
standard deviation is calculated as the square root of the variance. The variance is calculated as
the average of the squared differences for each attribute value from the mean. Note we are using
the N-1 method, which subtracts 1 from the number of attribute values when calculating the
variance.
import math
def mean(numbers):
return sum(numbers)/
1 import math
2 def mean(numbers):
3 return sum(numbers)/float(len(numbers))
4
5 def stdev(numbers):
6 avg = mean(numbers)
7 variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
8 return math.sqrt(variance)

We can test this by taking the mean of the numbers from 1 to 5.

numbers = [1,2,3,4,5]
print('Summary of {0}: mean={1}
numbers = [1,2,3,4,5]
1
print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers),
2
stdev(numbers))
Running this test, you should see something like:

Summary of [1, 2, 3, 4, 5]: mean

1 Summary of [1, 2, 3, 4, 5]: mean=3.0, stdev=1.58113883008


Summarize Dataset
Now we have the tools to summarize a dataset. For a given list of instances (for a class value) we
can calculate the mean and the standard deviation for each attribute.
The zip function groups the values for each attribute across our data instances into their own lists
so that we can compute the mean and standard deviation values for the attribute.
def summarize(dataset):
summaries = [(mean(a
del summaries[-1]
1 def summarize(dataset):
return summaries
2 summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
3 del summaries[-1]
4 return summaries

We can test this summarize() function with some test data that shows markedly different mean
and standard deviation values for the first and second data attributes.

dataset = [[1,20,0], [2,21,1], [3,2


summary = summarize(dataset)
print('Attribute summaries: {0}').
1 dataset = [[1,20,0], [2,21,1], [3,22,0]]
2 summary = summarize(dataset)
3 print('Attribute summaries: {0}').format(summary)

Running this test, you should see something like:

Attribute summaries: [(2.0, 1.0),

1 Attribute summaries: [(2.0, 1.0), (21.0, 1.0)]


Summarize Attributes By Class
We can pull it all together by first separating our training dataset into instances grouped by class.
Then calculate the summaries for each attribute.
def summarizeByClass(dataset
separated = separate
summaries = {}
1 def summarizeByClass(dataset):
for classValue, instan
2 separated = separateByClass(dataset)
3 summaries = {}
4 for classValue, instances in separated.iteritems():
5 summaries[classValue] = summarize(instances)
6 return summaries
We can test this summarizeByClass() function with a small test dataset.

dataset = [[1,20,1], [2,21,0], [3,2


summary = summarizeByClass(
print('Summary by class value:
1 dataset = [[1,20,1], [2,21,0], [3,22,1], [4,22,0]]
2 summary = summarizeByClass(dataset)
3 print('Summary by class value: {0}').format(summary)

Running this test, you should see something like:

Summary by class value:


{0: [(3.0, 1.4142135623730951)
1: [(2.0, 1.4142135623730951),
1 Summary by class value:
2 {0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)],
3 1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)]}
3. Make Prediction
We are now ready to make predictions using the summaries prepared from our training data.
Making predictions involves calculating the probability that a given data instance belongs to
each class, then selecting the class with the largest probability as the prediction.
We can divide this part into the following tasks:

1. Calculate Gaussian Probability Density Function


2. Calculate Class Probabilities
3. Make a Prediction
4. Estimate Accuracy

Calculate Gaussian Probability Density Function


We can use a Gaussian function to estimate the probability of a given attribute value, given the
known mean and standard deviation for the attribute estimated from the training data.
Given that the attribute summaries where prepared for each attribute and class value, the result is
the conditional probability of a given attribute value given a class value.
See the references for the details of this equation for the Gaussian probability density function. In
summary we are plugging our known details into the Gaussian (attribute value, mean and
standard deviation) and reading off the likelihood that our attribute value belongs to the class.
In the calculateProbability() function we calculate the exponent first, then calculate the main
division. This lets us fit the equation nicely on two lines.
import math
def calculateProbability(x, mean
1 importexponent
math = math.exp(
return (1 / (math.sqrt(
2 def calculateProbability(x, mean, stdev):
3 exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
4 return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
We can test this with some sample data, as follows.

x = 71.5
mean = 73
1 stdev
x = =71.5
6.2
probability = calculateProbability
2 mean = 73
3 stdev = 6.2
4 probability = calculateProbability(x, mean, stdev)
5 print('Probability of belonging to this class: {0}').format(probability)

Running this test, you should see something like:

Probability of belonging to this c

1 Probability of belonging to this class: 0.0624896575937


Calculate Class Probabilities
Now that we can calculate the probability of an attribute belonging to a class, we can combine
the probabilities of all of the attribute values for a data instance and come up with a probability
of the entire data instance belonging to the class.
We combine probabilities together by multiplying them. In the calculateClassProbabilities()
below, the probability of a given data instance is calculated by multiplying together the attribute
probabilities for each class. the result is a map of class values to probabilities.
def calculateClassProbabilities(s
probabilities = {}
for classValue, class
1 def calculateClassProbabilities(summaries,
probabilities
inputVector):
2 probabilities = {}
3 for classValue, classSummaries in summaries.iteritems():
4 probabilities[classValue] = 1
5 for i in range(len(classSummaries)):
6 mean, stdev = classSummaries[i]
7 x = inputVector[i]
8 probabilities[classValue] *= calculateProbability(x, mean, stdev)
9 return probabilities

We can test the calculateClassProbabilities() function.

summaries = {0:[(1, 0.5)], 1:[(20


inputVector = [1.1, '?']
1 probabilities
summaries = calculateClassPro
= {0:[(1, 0.5)],
print('Probabilities for each clas
1:[(20, 5.0)]}
2 inputVector = [1.1, '?']
3 probabilities = calculateClassProbabilities(summaries, inputVector)
4 print('Probabilities for each class: {0}').format(probabilities)
Running this test, you should see something like:

Probabilities for each class: {0:

1 Probabilities for each class: {0: 0.7820853879509118, 1: 6.298736258150442e-05}


Make a Prediction
Now that can calculate the probability of a data instance belonging to each class value, we can
look for the largest probability and return the associated class.

The predict() function belong does just that.

def predict(summaries, inputVe


probabilities = calcula
bestLabel, bestProb =
1 def predict(summaries, inputVector):
for classValue, proba
2 probabilities = calculateClassProbabilities(summaries, inputVector)
3 bestLabel, bestProb = None, -1
4 for classValue, probability in probabilities.iteritems():
5 if bestLabel is None or probability > bestProb:
6 bestProb = probability
7 bestLabel = classValue
8 return bestLabel

We can test the predict() function as follows:

summaries = {'A':[(1, 0.5)], 'B':[(


inputVector = [1.1, '?']
1 result = predict(summaries, inpu
summaries = {'A':[(1,
0.5)], 'B':[(20, 5.0)]}
print('Prediction: {0}').format(res
2 inputVector = [1.1, '?']
3 result = predict(summaries, inputVector)
4 print('Prediction: {0}').format(result)

Running this test, you should see something like:

Prediction: A

1 Prediction: A
4. Make Predictions
Finally, we can estimate the accuracy of the model by making predictions for each data instance
in our test dataset. The getPredictions() will do this and return a list of predictions for each test
instance.
def getPredictions(summaries, t
predictions = []
for i in range(len(testS
1 def getPredictions(summaries,
result = pre
testSet):
2 predictions = []
3 for i in range(len(testSet)):
4 result = predict(summaries, testSet[i])
5 predictions.append(result)
6 return predictions

We can test the getPredictions() function.

summaries = {'A':[(1, 0.5)], 'B':[(


testSet = [[1.1, '?'], [19.1, '?']]
1 predictions
summaries = getPredictions(sum
= {'A':[(1, 0.5)],
print('Predictions: {0}').format(pr
'B':[(20, 5.0)]}
2 testSet = [[1.1, '?'], [19.1, '?']]
3 predictions = getPredictions(summaries, testSet)
4 print('Predictions: {0}').format(predictions)

Running this test, you should see something like:

Predictions: ['A', 'B']

1 Predictions: ['A', 'B']


5. Get Accuracy
The predictions can be compared to the class values in the test dataset and a classification
accuracy can be calculated as an accuracy ratio between 0& and 100%. The getAccuracy() will
calculate this accuracy ratio.
def getAccuracy(testSet, predic
correct = 0
for x in range(len(test
1 def getAccuracy(testSet,
if testSet[x]
predictions):
2 correct = 0
3 for x in range(len(testSet)):
4 if testSet[x][-1] == predictions[x]:
5 correct += 1
6 return (correct/float(len(testSet))) * 100.0

We can test the getAccuracy() function using the sample code below.

testSet = [[1,1,1,'a'], [2,2,2,'a'], [3


predictions = ['a', 'a', 'a']
1 accuracy
testSet= getAccuracy(testSe
= [[1,1,1,'a'], [2,2,2,'a'],
print('Accuracy: {0}').format(ac
[3,3,3,'b']]
2 predictions = ['a', 'a', 'a']
3 accuracy = getAccuracy(testSet, predictions)
4 print('Accuracy: {0}').format(accuracy)

Running this test, you should see something like:


Accuracy: 66.6666666667

1 Accuracy: 66.6666666667
6. Tie it Together
Finally, we need to tie it all together.Below provides the full code listing for Naive Bayes
implemented from scratch in Python.
# Example of Naive Bayes imple
import csv
import random
CONCLUSION:
import math

Hence, we have implement concept of Navie Bayes for Concurrent/Distributed application

ASSGINMENT NO:9

TITLE: Implementation of K-NN approach takes suitable example.

OBJECTIVES:
1. Students should be able to implement k-NN Algoritm.
2. Student should be able to identify k-NN algorithm in programming.
PROBLEM STATEMENT:
Implement a simple approach for kNN Algorithm-means C++.

SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
20

INPUT: Data file with cluster number


OUTPUT: Clusters
MATHEMATICAL MODEL:
Let S be the solution perspective of the class

S={s, e, i, o, f, DD, NDD, success, failure}

s=initial state that

e = the end state

i= input of the system here is value of k and data

o=output of the system. Here is k number of clusters

DD-deterministic data it helps identifying the load store functions or assignment functions.

NDD- Non deterministic data of the system S to be solved.

Success- desired outcome generated.

Failure-Desired outcome not generated or forced exit due to system error

THEORY:
The k-means algorithm
The k-means algorithm is a simple iterative method to partition a given dataset into a user-
specified number of clusters, k. This algorithm has been discovered by several researchers across
different disciplines, most notably Lloyd (1957, 1982) Forgey (1965), Friedman and Rubin
(1967), and McQueen (1967). A detailed history of k-means alongwith descrip-tions of several
variations are given in.Gray and Neuhoff provide a nice historical background for k-means
placed in the larger context of hill-climbing algorithms.

The algorithm operates on a set of d-dimensional vectors, D = {xi | i = 1, . . . , N }, where xi ∈


_d denotes the i th data point. The algorithm is initialized by picking k points in _d as the initial
k cluster representatives or “centroids”. Techniques for selecting these initial seeds include
sampling at random from the dataset, setting them as the solution of clustering a small subset of
the data or perturbing the global mean of the data k times. Then the algorithm iterates between
two steps till convergence:
Step 1: Data Assignment. Each data point is assigned to its closest centroid, with ties broken
arbitrarily. This results in a partitioning of the data
Step 2: Relocation of “means”. Each cluster representative is relocated to the center (mean) of all
data points assigned to it. If the data points come with a probability measure (weights), then the
relocation is to the expectations (weighted mean) of the data partitions.
The algorithm converges when the assignments (and hence the cj values) no longer change. The
algorithm execution is visually depicted in. Note that each iteration needs N × k comparisons,
which determines the time complexity of one iteration. The number of itera-tions required for
convergence varies and may depend on N , but as a first cut, this algorithm can be considered
linear in the dataset size.
One issue to resolve is how to quantify “closest” in the assignment step. The default measure of
closeness is the Euclidean distance, in which case one can readily show that the non-negative
cost function,
N
argmin||xi − cj||22 (1)
i j
=1

will decrease whenever there is a change in the assignment or the relocation steps, and hence
convergence is guaranteed in a finite number of iterations. The greedy-descent nature of k-means
on a non-convex cost also implies that the convergence is only to a local optimum, and indeed
the algorithm is typically quite sensitive to the initial centroid locations. Figure 2 1 illustrates
how a poorer result is obtained for the same dataset as in Fig. 1 for a different choice of the three
initial centroids. The local minima problem can be countered to some

Fig 1

Fig. 2 Effect of an inferior initialization on the k-means results extent by running the algorithm
multiple times with different initial centroids, or by doing limited local search about the
converged solution.
 Generalizations and connections

As mentioned earlier, k-means is closely related to fitting a mixture of k isotropic Gaussians to


the data. Moreover, the generalization of the distance measure to all Bregman divergences is
related to fitting the data with a mixture of k components from the exponential family of
distributions. Another broad generalization is to view the “means” as probabilistic models
instead of points in Rd . Here, in the assignment step, each data point is assigned to the most
likely model to have generated it. In the “relocation” step, the model parameters are updated to
best fit the assigned datasets. Such model-based k-means allow one to cater to more complex
data, e.g. sequences described by Hidden Markov models.
One can also “kernelize” k-means. Though boundaries between clusters are still linear in the
implicit high-dimensional space, they can become non-linear when projected back to the original
space, thus allowing kernel k-means to deal with more complex clus-ters. Dhillon et al. [19] have
shown a close connection between kernel k-means and spectral clustering. The K-medoid
algorithm is similar to k-means except that the centroids have to belong to the data set being
clustered. Fuzzy c-means is also similar, except that it computes fuzzy membership functions for
each clusters rather than a hard one.
Despite its drawbacks, k-means remains the most widely used partitional clustering algorithm in
practice. The algorithm is simple, easily understandable and reasonably scal-able, and can be
easily modified to deal with streaming data. To deal with very large datasets, substantial effort
has also gone into further speeding up k-means, most notably by using kd-trees or exploiting the
triangular inequality to avoid comparing each data point with all the centroids during the
assignment step. Continual improvements and generalizations of the
basic algorithm have ensured its continued relevance and gradually increased its effectiveness as
well.
 How to implement k-Nearest Neighbors in Python

1. Handle Data

The first thing we need to do is load our data file. The data is in CSV format without a header
line or any quotes. We can open the file with the open function and read the data lines using the
reader function in the csv module.

import csv
w ith open('iris.data', 'rb') as csv
lines = csv.reader(cs
1 import csvin lines:
for row

2 with open('iris.data', 'rb') as csvfile:


3 lines = csv.reader(csvfile)
4 for row in lines:
5 print ', '.join(row)
Next we need to split the data into a training dataset that kNN can use to make predictions and a
test dataset that we can use to evaluate the accuracy of the model. We first need to convert the
flower measures that were loaded as strings into numbers that we can work with. Next we need
to split the data set randomly into train and datasets. A ratio of 67/33 for train/test is a standard
ratio used. Pulling it all together, we can define a function called loadDataset that loads a CSV
with the provided filename and splits it randomly into train and test datasets using the provided
split ratio.

import csv
import random
def loadDataset(filename, split,
1 import
w ithcsv
open(filename, 'r

2 import random
3 def loadDataset(filename, split, trainingSet=[] , testSet=[]):
4 with open(filename, 'rb') as csvfile:
5 lines = csv.reader(csvfile)
6 dataset = list(lines)
7 for x in range(len(dataset)-1):
8 for y in range(4):
9 dataset[x][y] = float(dataset[x][y])
10 if random.random() < split:
11 trainingSet.append(dataset[x])
12 else:
13 testSet.append(dataset[x])
Download the iris flowers dataset CSV file to the local directory. We can test this function out
with our iris dataset, as follows:
trainingSet=[]
testSet=[]
1 loadDataset('iris.data',
trainingSet=[] 0.66, trai
2 printtestSet=[]
'Train: ' + repr(len(trainingS

3 loadDataset('iris.data', 0.66, trainingSet, testSet)


4 print 'Train: ' + repr(len(trainingSet))
5 print 'Test: ' + repr(len(testSet))

2. Similarity
In order to make predictions we need to calculate the similarity between any two given data
instances. This is needed so that we can locate the k most similar data instances in the training
dataset for a given member of the test dataset and in turn make a prediction.
Given that all four flower measurements are numeric and have the same units, we can directly
use the Euclidean distance measure. This is defined as the square root of the sum of the squared
differences between the two arrays of numbers (read that again a few times and let it sink in).
Additionally, we want to control which fields to include in the distance calculation. Specifically,
we only want to include the first 4 attributes. One approach is to limit the euclidean distance to a
fixed length, ignoring the final dimension.
Putting all of this together we can define the euclideanDistance function as follows:

import math
def euclideanDistance(instance
distance = 0
1 importformathx in range(length):

2 def euclideanDistance(instance1, instance2, length):


3 distance = 0
4 for x in range(length):
5 distance += pow((instance1[x] - instance2[x]), 2)
6 return math.sqrt(distance)
We can test this function with some sample data, as follows:

data1 = [2, 2, 2, 'a']


data2 = [4, 4, 4, 'b']
distance = euclideanDistance(d
1 printdata1 = [2,
'Distance: 2, 2, 'a']
' + repr(distance

2 data2 = [4, 4, 4, 'b']


3 distance = euclideanDistance(data1, data2, 3)
4 print 'Distance: ' + repr(distance)

3. Neighbors

Now that we have a similarity measure, we can use it collect the k most similar instances for a
given unseen instance. This is a straight forward process of calculating the distance for all
instances and selecting a subset with the smallest distance values. Below is the getNeighbors
function that returns k most similar neighbors from the training set for a given test instance
(using the already defined euclideanDistance function)

import operator
def getNeighbors(trainingSet, te
import operator
distances = []
1
def getNeighbors(trainingSet,
length = len(testInstan testInstance, k):
2
distances = []
3
length = len(testInstance)-1
4
for x in range(len(trainingSet)):
5
dist = euclideanDistance(testInstance, trainingSet[x], length)
6
distances.append((trainingSet[x], dist))
7
distances.sort(key=operator.itemgetter(1))
8
neighbors = []
9
for x in range(k):
10
neighbors.append(distances[x][0])
11
return neighbors
12

We can test out this function as follows:

trainSet = [[2, 2, 2, 'a'], [4, 4, 4, '


testInstance = [5, 5, 5]
k=1
1 neighbors
trainSet = [[2, 2, 2, 'a'],
= getNeighbors(trainS [4, 4, 4, 'b']]
2 testInstance = [5, 5, 5]
3 k=1
4 neighbors = getNeighbors(trainSet, testInstance, 1)
5 print(neighbors)

4. Response
Once we have located the most similar neighbors for a test instance, the next task is to devise a
predicted response based on those neighbors. We can do this by allowing each neighbor to vote
for their class attribute, and take the majority vote as the prediction. Below provides a function
for getting the majority voted response from a number of neighbors. It assumes the class is the
last attribute for each neighbor.

import operator
def getResponse(neighbors):
classVotes = {}
importforoperator
x in range(len(neig
1
def getResponse(neighbors):
2
classVotes = {}
3
for x in range(len(neighbors)):
4
response = neighbors[x][-1]
5
if response in classVotes:
6
classVotes[response] += 1
7
else:
8
classVotes[response] = 1
9
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
10
return sortedVotes[0][0]
11

We can test out this function with some test neighbors, as follows:

neighbors = [[1,1,1,'a'], [2,2,2,'a'


response = getResponse(neigh
print(response)
1 neighbors = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
2 response = getResponse(neighbors)
3 print(response)
This approach returns one response in the case of a draw, but you could handle such cases in a
specific way, such as returning no response or selecting an unbiased random response.

5. Accuracy

We have all of the pieces of the kNN algorithm in place. An important remaining concern is how
to evaluate the accuracy of predictions. An easy way to evaluate the accuracy of the model is to
calculate a ratio of the total correct predictions out of all predictions made, called the
classification accuracy. Below is the getAccuracy function that sums the total correct predictions
and returns the accuracy as a percentage of correct classifications.

def getAccuracy(testSet, predic


correct = 0
for x in range(len(test
1 def getAccuracy(testSet,
if testSet[x] predictions):
2 correct = 0
3 for x in range(len(testSet)):
4 if testSet[x][-1] is predictions[x]:
5 correct += 1
6 return (correct/float(len(testSet))) * 100.0
We can test this function with a test dataset and predictions, as follows:

testSet = [[1,1,1,'a'], [2,2,2,'a'], [3


predictions = ['a', 'a', 'a']
accuracy = getAccuracy(testSe
1 print(accuracy)
testSet = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
2 predictions = ['a', 'a', 'a']
3 accuracy = getAccuracy(testSet, predictions)
4 print(accuracy)
6. Main
We now have all the elements of the algorithm and we can tie them together with a main
function. Below is the complete example of implementing the kNN algorithm from scratch in
Python.

# Example of kNN implemented f

import csv
Program-
import random

Running the example, you will see the results of each prediction compared to the actual class
value in the test set. At the end of the run, you will see the accuracy of the model. In this case, a
little over 98%.

...
> predicted='Iris-virginica', actua
> predicted='Iris-virginica', actua
1 > predicted='Iris-virginica',
... actua

2 > predicted='Iris-virginica', actual='Iris-virginica'


3 > predicted='Iris-virginica', actual='Iris-virginica'
> predicted='Iris-virginica', actual='Iris-virginica'
4 > predicted='Iris-virginica', actual='Iris-virginica'
5 > predicted='Iris-virginica', actual='Iris-virginica'
6 Accuracy: 98.0392156862745%
7

 Ideas For Extensions

This section provides you with ideas for extensions that you could apply and investigate with the
Python code you have implemented as part of this tutorial.

 Regression: You could adapt the implementation to work for regression problems
(predicting a real-valued attribute). The summarization of the closest instances could
involve taking the mean or the median of the predicted attribute.
 Normalization: When the units of measure differ between attributes, it is possible for
attributes to dominate in their contribution to the distance measure. For these types of
problems, you will want to rescale all data attributes into the range 0-1 (called
normalization) before calculating similarity. Update the model to support data
normalization.
 Alternative Distance Measure: There are many distance measures available, and you
can even develop your own domain-specific distance measures if you like. Implement an
alternative distance measure, such as Manhattan distance or the vector dot product.

There are many more extensions to this algorithm you might like to explore. Two
additional ideas include support for distance-weighted contribution for the k-most similar
instances to the prediction and more advanced data tree-based structures for searching for
similar instances.

CONCLUSION: Thus we have studied Implementation of K-NN approach with suitable


example
GROUP C
ASSIGNMENT NO:1

TITLE: Code Generation using iburg tool

OBJECTIVES:
1. How to use iburg tool ?

2. To apply the iburg tool to generate target code.

PROBLEM STATEMENT:
Study of iburg tool & generate target code using that tool

SOFTWARE REQUIRED: 64-bit Fedora or equivalent OS, iburg.

INPUT: Tree grammer

OUTPUT : C code for Parse Tree


MATHEMATICAL MODEL:

S={s, e, i, o, f, success, failure}

s=initial state that is start of program

e = the end state or end of program

i= Grammer rules in in Backus-Naur form

o= C source code for grammer-specific tree parsers

Success- desired output is generated

Success- desired output is not generated

Function Mapping

Tree grammer C source code


rules in Backus- for grammer
Naur form Iburg program specific tree
parser

THEORY:
Iburg is a program that generates a fast tree parser. It is compatible with Burg. Both programs
accept a cost-augmented tree grammar and emit a C program that discovers an optimal parse of
trees in the language described by the grammar. They have been used to construct fast optimal
instruction selectors for use in code generation. Burg uses BURS; Iburg's matchers do dynamic
programming at compile time
Iburg that reads a burg specification and writes a matcher that does DP at compile time. The
matcher is hard coded.

Iburg was built to test early versions of what evolved into burg's specification language and
interface, but it is useful in its own right because it is simpler and thus easier for novices to
understand, because it allows dynamic cost computation, and because it admits a larger class of
tree grammars
Iburg tree parser generator has been mainly developed at Princeton University. It accepts
Backus-Naur form specifications of tree grammers and generates C source code for grammer-
specific tree parsers.

Figure 1 shows an extended BNF grammar for burg and iburg specifications. Grammar symbols
are displayed in slanted type and terminal symbols are displayed in typewriter type. { X }
denotes zero or more instances of X, and [X] denotes an optional X. Specifications consist of
declarations, a %% separator, and rules. The declarations declare terminals | the operators in
subject trees | and associate a unique, positive external symbol number with each one. Non-
terminals are declared by their presence on the left side of rules. The %start declaration
optionally declares a non-terminal as the start symbol. In Figure 1, term and nonterm denote
identifiers that are terminals and non-terminals, respectively
Rules define tree patterns in a fully parenthesized prefix form. Every non-terminal denotes a tree.
Each operator has a fixed arity, which is inferred from the rules in which it is used. A chain rule
is a rule whose pattern is another non-terminal. If no start symbol is declared, the non-terminal
defined by the first rule is used. Each rule has a unique, positive external rule number , which
comes after the pattern and is preceded by a \=". As described below, external rule numbers are
used to report the matching rule to a user-supplied semantic action routine. Rules end with an
optional non-negative, integer cost; omitted costs default to zero.
Figure 2 shows a fragment of a burg specification for the VAX. This example uses upper-case
for terminals and lower-case for non-terminals. Lines 1{2 declare the operators and their external
symbol numbers, and lines 4{15 give the rules.
The external rule numbers correspond to the line numbers to simplify interpreting subsequent
figures. In practice, these numbers are usually generated by a preprocessor that accepts a richer
form of specification (e.g., including YACC-style semantic actions), and emits a burg
specification . Only the rules on lines 4, 6, 7, and 9 have non-zero costs. The rules on lines 5, 9,
12, and 13 are chain rules.
The operators in Figure 2 are some of the operators in lcc's intermediate language. The operator
names are formed by concatenating a generic operator name with a one-character type suffix like
C, I, or P, which denote character, integer, and pointer operations, respectively. The operators
used in Figure 2 denote integer addition (ADDI), forming the address of a local variable
(ADDRLP), integer assignment (ASGNI), an integer constant (CNSTI), \widening" a character
to an integer (CVCI), the integer 0 (I0I), and fetching a character (INDIRC). The rules show that
ADDI and ASGNI are binary, CVCI and INDIRC are unary, and ADDRLP, CNSTI, and I0I are
leaves.

MATCHING
Both versions of burg generate functions that the client calls to label and reduce subject trees.
The labeling function, label(p), makes a bottom-up, left-to-right pass over the subject tree p
computing the rules that cover the tree with the minimum cost, if there is such a cover. Each
node is labeled with (M;C) to indicate that \the pattern associated with external rule M matches
the node with cost C."

Figure 3 shows the intermediate language tree for the assignment expression in the C fragment
{ int i; char c; i = c + 4; }
The left child of the ASGNI node computes the address of i. The right child computes the
address of c, fetches the character, widens it to an integer, and adds 4 to the widened value,
which the ASGNI assigns to i.
The other annotations in Figure 3 show the results of labeling. (M;C) denote labels from matches
and [M;C] denote labels from chain rules. The rule from Figure 2 denoted by each M is also
shown. Each C sums the costs of the nonterminals on right-hand side and the cost of the relevant
pattern or chain rule. For example, the pattern in line 11 of Figure 2 matches the node ADDRLP
i with cost 0, so the node is labeled with (11; 0). Since this pattern denotes a disp, the chain rule
in line 9 applies with a cost of 0 for matching a disp plus 1 for the chain rule itself. Likewise, the
chain rules in lines 5 and 13 apply because the chain rule in line 9 denotes a reg.
Nodes are annotated with (M;C) only if C is less than all previous matches for the non-terminal
on the left-hand side of rule M. For example, the ADDI node matches the disp pattern in line 10
of Figure 2, which means it also matches all rules with disp alone on the right-hand side, namely
line 9. By transitivity, it also matches the chain rules in lines 5 and 13. But all three of these
chain rules yield cost 2, which isn't better than previous matches for those non-terminals.
Once labeled, a subject tree is reduced by traversing it from the top down and performing
appropriate semantic actions, such as generating and emitting code. Reducers are supplied by
clients, but burg generates functions that assist in these traversals, e.g., one function that returns
M and another that identifies subtrees for recursive visits. iburg does the dynamic programming
at compile time and annotates nodes with data equivalent to (M;C). Its \state numbers" are really
pointers to records that hold these data.
IMPLEMENTATION
iburg generates a state function that uses a straightforward implementation of tree pattern
matching. It generates hard code instead of tables. Its \state numbers" are pointers to state
records, which hold vectors of the (M;C) values for successful matches. The state record for the
specification in Figure 2 is
struct state {
int op;
struct state *left, *right;
short cost[6];
short rule[6];
};
iburg also generates integer codes for the non-terminals, which index the cost and rule vectors:
#define stmt_NT 1
#define disp_NT 2
#define rc_NT 3
#define reg_NT 4
#define con_NT 5
By convention, the start non-terminal has value 1.
State records are cleared when allocated, and external rule numbers are positive. Thus, a non
zero value for p->rule[X] indicates that p's node matched a rule that defines non-terminal X.

CONCLUSION:
Hence, we have successfully studied iburg tool & how to generate target code using that tool.

REFERENCES
1. A V Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques, and Tools", Pearson
Edition, ISBN 81-7758-590-8.
2. J R Levin, T Mason, D Brown, Lex and Yacc", O'Reilly, 2000 ISBN 81-7366-061-X
3. Compiler Construction Using Java, JavaCC and Yacc, Anthony J. Dos Reis, Wiley
ISBN 978-0-470-94959-7

You might also like