Lab Manual - CL1
Lab Manual - CL1
Lab Manual - CL1
Laboratory Manual
Computer Engineering
Laboratory Assignments
Group A (Mandatory Six Assignments)
1. Using Divide and Conquer Strategies design a function for Binary Search using C.
2. Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using C++.
3. Lexical analyzer for sample language using LEX.
4. Parser for sample language using YACC.
5. Int code generation for sample language using LEX and YACC.
6. Implement a simple approach for k-means/ k-medoids clustering using C++.
Group B (Any Six Assignments: atleast 3 from the selected Elective)
1. 8-Queens Matrix is Stored using JSON/XML having first Queen placed, use back-tracking to place
remaining Queens to generate final 8-queen's Matrix using Python.
2. Implementation of 0-1 knapsack problem using branch and bound approach.
3. Code optimization using DAG.
4. Code generation using DAG / labeled tree.
5. Generating abstract syntax tree using LEX and YACC.
6. Implementing recursive descent parser for sample language.
7. Implement Apriori approach for datamining to organize the data items on a shelf using following table
of items purchased in a Mall
Transaction ID Item1 Item2 Item3 Item4 Item5 Item6
T1 Mnago Onion Jar Key-chain Eggs Chocolates
T2 Nuts Onion Jar Key-chain Eggs Chocolates
T3 Mnago Apple Key-chain Eggs - -
T4 Mnago Toothbrush Corn Key-chain Chocolates
T5 Corn Onion Onion Key-chain Knife Eggs
8. Implement Decision trees on Digital Library Data to mirror more titles(PDF) in the library application,
compare it with Naive Bayes algorithm.
9. Implement Naive Bayes for Concurrent/Distributed application. Approach should handle categorical
and continuous data.
10. Implementation of K-NN approach take suitable example.
Group C (Any One Assignment)
1.Code generation using \iburg" tool.
GROUP A
ASSIGNMENT NO:1
TITLE: Using Divide and Conquer Strategies design a function for Binary Search using C++.
OBJECTIVES:
To learn Divide and Conquer Strategies by using a function for Binary Search using C++.
PROBLEM STATEMENT: Write a program using Divide and Conquer Strategies design a
function for Binary Search using C++.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-19,
Windows 8 with Multicore CPU equivalent to Intel i5/7
MATHEMATICAL MODEL:
Breaking the problem into several sub-problems that are similar to the original problem
but smaller in size,
Solve the sub-problem recursively (successively and independently), and then
Combine these solutions to subproblems to create a solution to the original problem.
Binary Search (simplest application of divide-and-conquer) :-
Binary Search is an extremely well-known instance of divide-and-conquer paradigm. Given an
ordered array of n elements, the basic idea of binary search is that for a given element we
"probe" the middle element of the array. We continue in either the lower or upper segment of the
array, depending on the outcome of the probe until we reached the required (given) element.
Problem Let A[1 . . . n] be an array of non-decreasing sorted order; that is A [i] ≤ A [j]
whenever 1 ≤ i ≤ j ≤ n. Let 'q' be the query point. The problem consist of finding 'q' in the
array A. If q is not in A, then find the position where 'q' might be inserted.
Formally, find the index i such that 1 ≤ i ≤ n+1 and A[i-1] < x ≤ A[i].
Sequential Search
Look sequentially at each element of A until either we reach at the end of an array A or find an
item no smaller than 'q'.Sequential search for 'q' in array A
Analysis:-
This algorithm clearly takes a θ(r), where r is the index returned. This is Ω(n) in the worst case
and O(1) in the best case.
If the elements of an array A are distinct and query point q is indeed in the array then loop
executed (n + 1) / 2 average number of times. On average (as well as the worst case), sequential
search takes θ(n) time.
Binary Search:-
Look for 'q' either in the first half or in the second half of the array A. Compare 'q' to an element
in the middle, n/2 , of the array. Let k = n/2 . If q ≤ A[k], then search in the A[1 . . . k];
otherwise search T[k+1 . . n] for 'q'. Binary search for q in subarray A[i . . j] with the promise
that
Analysis:-
Binary Search can be accomplished in logarithmic time in the worst case , i.e., T(n) = θ(log n).
This version of the binary search takes logarithmic time in the best case.
Iterative Version of Binary Search:-Interactive binary search for q, in array A[1 . . n]
CONCLUSION
Thus we have studied binary search using divide and conquer strategy.
ASSIGNMENT NO:2
TITLE: Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using C+
+
OBJECTIVES: Compare The Performance Of Concurrence Quick Sort Algorithm By Using
Divide & Conquer Strategies.
THEORY: The quick sort algorithm is amenable to a concurrent implementation. Each step in
the recursion – each time the partitioning routing is called – operates on an independent sub
array. In Quick sort one can sort an array by partitioning the elements around a pivot element
and then doing the same to each of the partitions. The partitioning step consists of picking an
pivot element in the array, p, and moving the elements in the array around so that all the
elements less than p have indices less than p and vice versa. Partitioning will usually move
several array elements, including p, around. the partitioning has put p into the place it belongs in
the array – no element will ever move from the set that’s smaller than p to the set that’s larger;
once a partitioning step is complete, its pivot element is in its sorted position.
Procedure: Consider One Coordinator & Different Workers (Coordinator Will Distribute Work
among Different Workers.). The coordinator sends a message to an idle worker telling it to sort
the array and waits to receive messages from the workers about the progress of the algorithm. A
worker partitions a sub-array, and every time that worker gets ready to call the partition routine
on a smaller array, it checks to see if there is an idle worker to assign the work to. If so, it sends
a message to the worker to start working on the sub-problem; if not the current worker makes
calls the partition routine itself. After each partitioning, two recursive calls are (usually) made,
so there are plenty of chances to start other workers. The diagram below shows two workers
sorting the same 5-element array. Each blue line represents the flow of control of a worker
thread, and the red arrow represents the message sent from one worker to start the other. (Since
the workers proceed working concurrently, it is no longer guaranteed that the smaller elements in
the array will be ordered before the larger; what is certain is that the two workers will never try
to manipulate the same elements.)
CONCLUSION:
Thus we have studied Concurrent Quick Sort using divide and conquer strategy.
ASSIGNMENT NO:3
Let S be the solution perspective of the class Weather Report such that
S={s, e, i, o, f, success, failure}
s=Start of program
e = the end of program
i=Sample language statement.
o=Result of statement
Success-token is generated.
Failure-token is not generated or forced exit due to system error.
Computational Model
e1 e2
Where,
S={Start state}
A={genrate token()}
R={Final Result}
THEORY:
Lex is a program generator designed for lexical processing of character input streams. It accepts
a high-level, problem oriented specification for character string matching, and produces a
program in a general purpose langua ge which recognizes regular expressions. The regular
expressions are specified by the user in the so urce specifications given to Lex.
The Lex written code recognizes these expressions in an input stream and partitions the input
stream into strings matching the expressions. At the boundaries between strings program sections
provided by the user are executed. The Lex source file associates the regular expressions and the
program fragments. As each expression appears in the input to the program written by Lex, the
corresponding fragment is executed
-----------------------------
Declaration Part
-----------------------------
%%
-----------------------------
Translation Rule
-----------------------------
%%
-----------------------------
Auxiliary Procedures
-----------------------------
2) Declaration part :- Contains the declaration for variables required for LEX program and C
program.
3) yywrap() :- This function is used for taking i/p from more than one file.
3) yyin :- This is input file pointer used to change value of input file pointer. Default file
pointer is pointing to stdin i.e. keyboard.
4) yyout :- This is output file pointer used to change value of output file pointer. Default
output file pointer is pointing to stdout i.e. Monitor.
How to execute LEX program :- For executing LEX program follow the following steps
Compile *.l file with lex command
# lex *.l It will generate lex.yy.c file for your lexical analyzer.
Compile lex.yy.c file with cc command
# cc lex.yy.c It will generate object file with name a.out.
Execute the *.out file to see the output
# ./a.out
CONCLUSION
s=Start of program
i=Arithmetic expression.
Success-parser is generated.
Computational Model
S e1 A e2 B e3
R
Where,
S={Start state}
A={Genrate token()}
B={Parse_token()}
R={Final Result}
THEORY:
1) YACC Specifications: - Parser generator facilitates the construction of the front end of a
compiler. YACC is LALR parser generator. It is used to implement hundreds of compilers.
YACC is command (utility) of the UNIX system. YACC stands for “Yet Another Compiler
Complier”. File in which parser generated is with .y extension. e.g. parser.y, which is
containing YACC specification of the translator. After complete specification UNIX
command. YACC transforms parser.y into a C program called y.tab.c using LR parser. The
program y.tab.c is automatically generated. We can use command with –d option as yacc –d
parser.y By using –d option two files will get generated namely y.tab.c and y.tab.h. The header
file y.tab.h will store all the token information and so you need not have to create y.tab.h
explicitly. The program y.tab.c is a representation of an LALR parser written in C, along with
other C routines that the user may have prepared. By compiling y.tab.c with the ly library that
contains the LR parsing program using the command. cc ytabc – ly . we obtain the desired
object program a.out that perform the translation specified by the original program. If
procedure is needed, they can be compiled or loaded with y.tab.c, just as with any C program.
LEX recognizes regular expressions, whereas YACC recognizes entire grammar. LEX divides
the input stream into tokens, while YACC uses these tokens and groups them together
logically.LEX and YACC work together to analyze the program syntactically. The YACC can
report conflicts or ambiguities (if at all) in the form of error messages.
-----------------------------
Declaration Section
-----------------------------
%%
-----------------------------
Translation Rule Section
-----------------------------
%%
-----------------------------
Auxiliary Procedures Section
-----------------------------
Declaration Section :-
The definition section can include a literal block, C code copied verbatim to the beginning of the
generated C file, usually containing declaration and #include lines. There may be %union,
%start, %token, %type, %left, %right, and %nonassoc declarations. (See "%union Declaration,"
"Start Declaration," "Tokens," "%type Declarations," and "Precedence and Operator
Declarations.") It can also contain comments in the usual C format, surrounded by "/*" and "*/".
All of these are optional, so in a very simple parser the definition section may be completely
empty.
Translation rule Section :-
%{
#include "y.tab.h"
#include<math.h>
extern int yylval;
%}
Here, we include the header file that is generated while executing the .y file. We also include
math.h as we will be using a function atoi (that type casts string to integer).
Lastly, when a lexical analyzer passes a token to the parser, it can also pass a value for the token.
In order to pass the value that our parser can use (for the passed token), the lexical analyser has
to store it in the variable yylval.
Before storing the value in yylval we have to specify its data type. In our program we want to
perform mathematical operations on the input, hence we declare the variable yylval as integer.
1) Rules Section :
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
[ \t] ; /* ignore white space */
\n return 0; /* logical EOF */
. return yytext[0];
In rules section, we match the pattern for numbers and pass the token NUMBER to the
parser. As we know the matched string is stored in yytext which is a string, hence we
type cast the string value to integer. We ignore spaces, and for all other input characters
we just pass them to the parser.
YACC Specification : (yacc.y file)
1) Declaration Section:
%token NUMBER
%left '+' '-'
%left '/' '*'
%right '^'
%nonassoc UMINUS
In declaration section we declare all the variables that we will be using through out the
program, also we include all the necessary files.Apart from that, we also declare tokens that are
recognized by the parser. As we are writing a parser specification for calculator, we have only
one token that is NUMBER.To deal with ambiguous grammar, we have to specify the
associativity and precedence of the operators. As seen above +,-,* and / are left associative
whereas the Unary minus and power symbol are non-associative.The precedence of the
operators increase as we come down the declaration. Hence the lowest precedence is of + and –
and the highest is of Unary minus.
2) Rules Section:The rules section consists of all the production that perform the operations. One
example is given as follows :
expression: expression '+' expression { $$ = $1 + $3; }
| NUMBER { $$ = $1; }
;
When a number token is returned by the lexical analyser, it converts it into expression and its
value is assigned to expression (non-terminal).When addition happens the value of both the
expression are added and are assigned to the expression that has resulted from the reduction.
3) Auxillary Function Section:In main we just call function yyparse.
We also have to define the function yyerror(). This function is called when there is a syntactic
error.
CONCLUSION:
Thus we have studied Parser for sample language using YACC.
ASSIGNMENT NO:5
TITLE: Int. code generation for sample language using LEX and YACC.
OBJECTIVES:
1. To understand fourth phase of compiler: Intermediate code generation.
2. To learn and use compiler writing tools.
3. To learn how to write three address code for given statement.
PROBLEM STATEMENT:
Write an attributed translation grammar to recognize declarations of simple variables, "for",
assignment, if, if-else statements as per syntax of C and generate equivalent three address code
for the given input made up of constructs mentioned above using LEX and YACC .
SOFTWARE REQUIRED: Linux Operating Systems, GCC
INPUT: Input data as Sample language.
OUTPUT: It will generate Intermediate language for sample language.
MATHEMATICAL MODEL:
Let S be the solution perspective of the class Weather Report such that
S={s, e, i, o, f, success, failure}
s=initial state of grammar
e = the end state of grammar.
i=Sample language Statement.
o=Intermediate code for language statement
Success-Intermediate code is is generated.
Failure-Intermediate code is not generated or forced exit due to system error.
THEORY:
In the analysis-synthesis model of a compiler, the front end analyzes a source program and
creates an intermediate representation, from which the back end generates target code. This
facilitates retargeting: enables attaching a back end for the new machine to an existing front end.
A compiler front end is organized, where parsing, static checking, and intermediate-code
generation are done sequentially; sometimes they can be combined and folded into parsing. All
schemes can be implemented by creating a syntax tree and then walking the tree.
Static Checking
This includes type checking which ensures that operators are applied to compatible operands. It
also includes any syntactic checks that remain after parsing like
• flow–of-control checks
– Ex: Break statement within a loop construct
• Uniqueness checks
– Labels in case statements
• Name-related checks
Intermediate Representations
We could translate the source program directly into the target language. However, there are
benefits to having an intermediate, machine-independent representation.
• A clear distinction between the machine-independent and machine-dependent parts of the
compiler
• Retargeting is facilitated; the implementation of language processors for new machines will
require replacing only the back-end
• We could apply machine independent code optimisation techniques
Intermediate representations span the gap between the source and target languages.
• High Level Representations
– closer to the source language
– easy to generate from an input program
– code optimizations may not be straightforward
• Low Level Representations
– closer to the target machine
– Suitable for register allocation and instruction selection
Intermediate Languages:Three ways of intermediate code representation:
1. Syntax tree
2. Postfix notation
3. Three address code
The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.
Graphical Representations
Syntax tree
A syntax tree depicts the natural hierarchical structure of a source program. A dag (Directed
Acyclic Graph) gives the same information but in a more compact way because common
subexpressions are identified. A syntax tree and dag for the assignment statement
a:=b*-c+b*-c
are as follows:
assign(=)
a +
* *
b uminus b uminus
c c
Postfix notation
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of the tree
in which a node appears immediately after its children. The postfix notation for the syntax tree
given above is a b c uminus * b c uminus * + assign
Three-Address Code
Three-address code is a sequence of statements of the general form x : = y op z where x, y and z
are names, constants, or compiler-generated temporaries; op stands for any operator, such as a
fixed- or floating-point arithmetic operator, or a logical operator on Boolean valued data. Thus a
source language expression like x+ y*z might be translated into a sequence
t1 : = y * z
t2 : = x + t1
where t1 and t2 are compiler-generated temporary names. The reason for the term “three-
address code” is that each statement usually contains three addresses, two for the operands and
one for the result.
Implementation of Three-Address Statements: A three-address statement is an abstract form
of intermediate code. In a compiler, these statements can be implemented as records with fields
for the operator and the operands. Three such representations are: Quadruples, Triples, Indirect
triples.
A. Quadruples
A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result. The op
field contains an internal code for the operator. The 3 address statement x = y op z is represented
by placing y in arg1, z in arg2 and x in result. The contents of fields arg1, arg2 and result are
normally pointers to the symbol-table entries for the names represented by these fields. If so,
temporary names must be entered into the symbol table as they are created. Fig a) shows
quadruples for the assignment a : b * c + b * c
B. Triples:
To avoid entering temporary names into the symbol table, we might refer to a temporary value
by the position of the statement that computes it. If we do so, three-address statements can be
represented by records with only three fields: op, arg1 and arg2. The fields arg1 and arg2, for
the arguments of op, are either pointers to the symbol table or pointers into the triple structure
( for temporary values ). Since three fields are used, this intermediate code format is known as
triples. Fig b) shows the triples for the assignment statement a: = b * c + b * c.
op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) := a (4)
Fig.b. Triples
Quadruples & Triple representation of three-address statement
C. Indirect triples:
Indirect triple representation is the listing pointers to triples rather-than listing the triples
themselves. Let us use an array statement to list pointers to triples in the desired order. Fig c)
shows the indirect triple representation.
Statement op
(1) (15)
(2) (16)
(3) (17)
(4) (18)
(5) (19)
op arg1 arg2
(14) uminus c
(15) * b (14)
(16) uminus c
(17) * b (16)
(18) + (15) (17)
(19) := a (18)
Fig c): Indirect triples representation of three address statements
Steps to execute the program
$ lex filename.l (eg: comp.l)
$ yacc -d filename.y (eg: comp.y)
$cc lex.yy.c y.tab.c –ll –ly –lm
$./a .out
ALGORITHM:
Write a LEX and YACC program to generate Intermediate Code for arithmetic expression
LEX program
1. Declaration of header files specially y.tab.h which contains declaration for Letter, Digit, expr.
2. End declaration section by %%
3. Match regular expression.
4. If match found then convert it into char and store it in yylval.p where p is pointer declared in
YACC
5. Return token
6. If input contains new line character (\n) then return 0
8. If input contains „.‟ then return yytext[0]
9. End rule-action section by %%
10. Declare main function
a. open file given at command line
b.if any error occurs then print error and exit
c. assign file pointer fp to yyin
d.call function yylex until file ends
11. End
YACC program
1. Declaration of header files
2. Declare structure for threeaddresscode representation having fields of argument1, argument2,
operator, result.
3. Declare pointer of char type in union.
4. Declare token expr of type pointer p.
5. Give precedence to „*‟,‟/‟.
6. Give precedence to „+‟,‟-‟.
7. End of declaration section by %%.
8. If final expression evaluates then add it to the table of three address code.
9. If input type is expression of the form.
a. exp‟+‟exp then add to table the argument1, argument2, operator.
b.exp‟-‟exp then add to table the argument1, argument2, operator.
c. exp‟*‟exp then add to table the argument1, argument2, operator.
d.exp‟/‟exp then add to table the argument1, argument2, operator.
e. „(„exp‟)‟ then assign $2 to $$.
f. Digit OR Letter then assigns $1 to $$.
10. End the section by %%.
11. Declare file *yyin externally.
12. Declare main function and call yyparse function untill yyin ends
13. Declare yyerror for if any error occurs.
14. Declare char pointer s to print error.
15. Print error message.
16. End of the program.
In short:
Addtotable function:It will add the argument1, argument2, operator and temporary variable to
the structure array of threeaddresscode.
Threeaddresscode function:It will print the values from the structure in the form first
temporary variable, argument1, operator, argument2
Quadruple Functio:It will print the values from the structure in the form first operator,
argument1, argument2, result field
Triple Function:It will print the values from the structure in the form first argument1,
argument2, and operator. The temporary variables in this form are integer / index instead of
variables.
CONCLUSION:
Hence, we have successfully studied concept of Intermediate code generation of sample
language
ASSIGNMENT NO:6(D)
TITLE: Implement a simple approach for k-means / k-medoids clustering using C++.
OBJECTIVES:
1. Students should be able to implement k-means program.
2. Student should be able to identify k-means of each cluster .using c++ programming.
PROBLEM STATEMENT:
Implement a simple approach for k-means / k-medoids clustering using C++.
MATHEMATICAL MODEL:
Let S be the solution perspective of the class
S={s, e, i, o, f, DD, NDD, success, failure}
s=initial state that
e = the end state
i= input of the system here is value of k and data
o=output of the system. Here is k number of clusters
DD-deterministic data it helps identifying the load store functions or assignment functions.
NDD- Non deterministic data of the system S to be solved.
Success- desired outcome generated.
Failure-Desired outcome not generated or forced exit due to system error.
THEORY:
Introduction
The popular k-means algorithm for clustering has been around since the late 1950s, and the
standard algorithm was proposed by Stuart Lloyd in 1957. Given a set of points, k-means
clustering aims to partition each point into a cluster (where and, the number of clusters, is a
parameter). The partitioning is done to minimize the objective function where is the centroid of
cluster . The standard algorithm is a two-step algorithm:
Assignment step. Each point in is assigned to the cluster whose centroid it is
closest to.
Update step. Using the new cluster assignments, the centroids of each cluster are
recalculated.
The algorithm has converged when no more assignment changes are happening with each
iteration. However, this algorithm can get stuck in local minima of the objective function and is
particularly sensitive to the initial cluster assignments. Also, situations can arise where the
algorithm will never converge but reaches steady state -- for instance, one point may be changing
between two cluster assignments.
There is vast literature on the k-means algorithm and its uses, as well as strategies for choosing
initial points effectively and keeping the algorithm from converging in local minima. mlpack
does implement some of these, notably the Bradley-Fayyad algorithm (see the reference below)
for choosing refined initial points. Importantly, the C++ KMeans class makes it very easy to
improve the k-means algorithm in a modular way.
@inproceedings{bradley1998refining,
title={Refining initial points for k-means clustering},
author={Bradley, Paul S. and Fayyad, Usama M.},
booktitle={Proceedings of the Fifteenth International Conference on Machine
Learning (ICML 1998)},
volume={66},
year={1998}
}
mlpack provides:
a simple command-line executable to run k-means
a simple C++ interface to run k-means
Command-Line 'kmeans'
mlpack provides a command-line executable, kmeans, to allow easy execution of the k-means
algorithm on data. Complete documentation of the executable can be found by typing
$ kmeans --help
Below are several examples demonstrating simple use of the kmeans executable.
MetricType: controls the distance metric used for clustering (by default, the squared
Euclidean distance is used)
InitialPartitionPolicy: the method by which initial clusters are set; by default,
RandomPartition is used
EmptyClusterPolicy: the action taken when an empty cluster is encountered; by default,
MaxVarianceNewCluster is used
ASSIGNMENT NO:1
TITLE: 8-Queens Matrix is Stored using JSON/XML having first Queen placed, use back-
tracking to place remaining Queens to generate final 8-queen's Matrix using Python.
OBJECTIVES: To learn 8-queens matrix
PROBLEM STATEMENT: Write a program : 8-Queens Matrix is Stored using JSON/XML
having first Queen placed, use back- tracking to place remaining Queens to generate final 8-
queen's Matrix using Python.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
19, Eclipse
INPUT: First Queen placed matrix
OUTPUT: Final 8-queens matrix
MATHEMATICAL MODEL:
INPUT:
I=Set of 8 quuens in 8*8 matrix.
Function:
F1=In the second iteration, the value of b will be added to the first position in our openList.
F2= openList contains b first, then a: [b, a].
F3=In the third iteration, our openList will look like [c, b, a].
F4=The end result is that the data is reversed, add the items to our OPENLIST in reverse.
First c gets inserted, then b, then a.
OUTPUT:
O1=Success Case: It is the case when all the inputs are given by system are entered correctly and
8-Queen problem is solved.
O2=Failure Case: It is the case when the input does not match the validation Criteria.
Computational Model
THEORY
Backtracking a problem solving approach which is closest to the brute force method. In this, we
explore each path which may lead to solution, taking one decision at a time and as soon as we
find that the path which selected does not lead to solution, we go back to the place where we tool
most recent decision. At that place, we will explore other opportunities to go to different path, if
available. If there are no option available, we go back further. We go back till the time we find
alternate path to be followed or at the start. If we reach to start without finding any path reaching
to solution, then there is no solution present, else we would have found it following one of the
paths.
Backtracking is depth first traversal of path in graph where nodes are states of the solution and
edge are between two states of solution only if one state can be reached from another state.:
Typical examples of backtracking are : N queen problem, Sudoku, Knight problem, crosswords.
The 8 queen problem is a case of more general set of problems namely “n queen
Problem”. The basic idea: How to place n queen on n by n board, so that they don’t attack each
other. As we can expect the complexity of solving the problem increases with n. We will briefly
introduce solution by backtracking.
The board should be regarded as a set of constraints and the solution is simply satisfying
all constraints. For example: Q1 attacks some positions, therefore Q2 has to comply with these
constraints and take place, not directly attacked by Q1. Placing Q3 is harder, since we have to
satisfy constraints of Q1
and Q2. Going the same way we may reach point, where the constraints make the Placement of
the next queen impossible. Therefore we need to relax the constraints and find new solution. To
do this we are going backwards and finding new admissible solution. To keep everything in
order we keep the simple rule: last placed, first displaced. In other words if we place successfully
queen on the ith column but cannot find solution for (i+1) th queen, then going backwards we
will try to find other admissible solution for the ith queen first. This process is called backtrack
Let’s discuss this with example.
Algorithm:
-Start with one queen at the first column first row
-Continue with second queen from the second column first row
-Go up until find a permissible situation
-Continue with next queen
How implement backtrack in code. Remember that we used backtrack when we cannot find
admissible position for a queen in a column . Otherwise we go further with the next column
until we place a queen on the last column. Therefore your code must have fragment:
int PlaceQueen(int board[8], int row)
If (Can place queen on ith column)
PlaceQueen(newboard, 0)
Else
PlaceQueen(oldboard,oldplace+1)
End
If you can place queen on ith column try to place a queen on the next one, or backtrack and try
to place a queen on position abovethe solution found for i-1 column.
ALGORITHM :
3) Try all rows in the current column. Do following for every tried row.
a) If the queen can be placed safely in this row then mark this [row, column] as part of the
solution and recursively check if placing queen here leads to a solution.
c) If placing queen doesn't lead to a solution then umark this [row,column] (Backtrack) and go
to step (a) to try other rows.
3) If all rows have been tried and nothing worked, return false to trigger backtracking.
ANALYSIS :
ASSIGNMENT NO:2
TITLE: Implementation of 0-1 knapsack problem using branch and bound approach.
OBJECTIVES: To implement and apply 0-1 knapsack using branch and bound
PROBLEM STATEMENT: Implementation of 0-1 knapsack problem using branch and bound
approach
SOFTWARE REQUIREMENT: Latest version of 64 Bit Operating Systems Open Source
Fedora-20
INPUT:
OUTPUT:
MATHEMATICAL MODEL:
MATHEMATICAL MODEL:
Let S be the solution perspective of optimized code
THEORY:
Knapsack Problem
There are many different knapsack problems. The first and classical one is the binaryknapsack
problem. It has the following story. A tourist is planning a tour in the mountains. He has a lot of
objects which may be useful during the tour. For exampleice pick and can opener can be among
the objects. We suppose that the following conditions are satisfied.
• Each object has a positive value and a positive weight. (E.g. a balloon filled withhelium has a
negative weight. The value is thedegree of contribution of the object to the success of the tour.
• The objects are independent from each other. (E.g. can and can opener are not independent as
any of them without the other one has limited value.)
• The knapsack of the tourist is strong and large enough to contain all possible objects.
• The strength of the tourist makes possible to bring only a limited total weight.
• But within this weight limit the tourist want to achieve the maximal total value.
The following notations are used to the mathematical formulation of the problem:
n the number of objects;
j the index of the objects;
wj the weight of object j;
vj the value of object j;
b the maximal weight what the tourist can bring.
For each object j a so-called binary or zero-one decision variable, say xj , is
introduced:
xj =1 if object j is present on the tour
0 if object j isn’t present on the tour.
Notice that
wjxj =wj if object j is present on the tour,
0 if object j isn’t present on the tour is the weight of the object in the knapsack.
Similarly vjxj is the value of the object on the tour. The total weight in the
knapsack is
Xn
j=1
wjxj
which may not exceed the weight limit. Hence the mathematical form of the problem
ismax
Xn
j=1
vjxj (24.1)
Xn
j=1
wjxj _ b (24.2)
xj = 0 or 1, j = 1, . . . , n . (24.3)
The difficulty of the problem is caused by the integrality requirement. If constraint
is substituted by the relaxed constraint, i.e. by
Algorithm:
Algorithm UBound(cp,cw,k,m)
{
b:=cp;c:=cw;
for i:=k+1 to n do {
If(c+w[i]<=m)then
{c:=c+w[i];b=b-p[i];}
}
return b;
}
CONCLUSION: Thus we have studied 0-1 knapsack problem using branch and bound
approach.
ASSIGNMENT NO:3
DD-deterministic data it helps identifying the load store functions or assignment functions.
Computational Model
THEORY:
Optimization is the process of transforming a piece of code to make more efficient (either in
terms of time or space) without changing its output or side-effects.In the code optimization phase
the intermediate code is improved to run the output faster and occupies the lesser space.
Output of this phase is another intermediate code to improve the efficiency. The basic
requirement of optimization methods should comply with is that an optimized program must
have the same output and side effects as its non-optimized version. This requirement, however,
may be ignored in case the benefit from optimization is estimated to be more important than
probable consequences of a change in the program behavior.
Optimization can be performed by automatic optimizers or programmers. An optimizer is either
a specialized software tool or a built-in unit of a compiler (the so-called optimizing
compiler).Optimizations are classified into high-level and low-level optimizations. High-level
optimizations are usually performed by the programmer who handles abstract entities (functions,
procedures, classes, etc.) and keeps in mind the general framework of the task to optimize the
design of a system.
Control-Flow Analysis:-In control-flow analysis, the compiler figures out even more
information about ow the program does its work, only now it can assume that there are no
syntactic or semantic errors in the code.
Control-flow analysis begins by constructing a control-flow graph , which is a graph of the
different possible paths program flow could take through a function. To build the graph, we first
divide the code into basic blocks.
Constant Propagation:- If a variable is assigned a constant value, then subsequent uses of that
variable can be replaced by the constant as long as no intervening assignment has changed the
value of the variable.
Code Motion :- Code motion (also called code hoisting ) unifies sequences of code common to
one or more basic blocks to reduce code size and potentially avoid expensive re-evaluation.
Peephole Optimizations :- Peephole optimization is a pass that operates on the target assembly
and only considers a few instructions at a time (through a "peephole") and attempts to do simple,
machine-dependent code improvements
Redundant instruction elimination:- At source code level, the following can be done by the user
int add_ten(int x) int add_ten(int x) int add_ten(int x) int add_ten(int x)
{ { { {
int y, z; int y; int y = 10; return x + 10;
y = 10; y = 10; return x + y; }
z = x + y; y = x + y; }
return z;} return y;}
At compilation level, the compiler searches for instructions redundant in nature. Multiple loading
and storing of instructions may carry the same meaning even if some of them are removed. For
example:
MOV x, R0
MOV R0, R1
We can delete the first instruction and re-write the sentence as: MOV x, R1
Unreachable code:-Unreachable code is a part of the program code that is never accessed
because of programming constructs. Programmers may have accidently written a piece of code
that can never be reached.
Example:
void add_ten(int x)
{
return x + 10; printf(“value of x is %d”, x); }
In this code segment, the printf statement will never be executed as the program control returns
back before it can execute, hence printf can be removed.
Flow of control optimization:-There are instances in a code where the program control jumps
back and forth without performing any significant task. These jumps can be removed. Consider
the following chunk of code:...
MOV R1, R2
GOTO L1
...
L1 : GOTO L2
L2 : INC R1
In this code,label L1 can be removed as it passes the control to L2. So instead of jumping to L1
and then to L2, the control can directly reach L2, as shown below:...
MOV R1, R2
GOTO L2
...
L2 : INC R1
Algebraic expression simplification:- There are occasions where algebraic expressions can be
made simple. For example, the expression a = a + 0 can be replaced by a itself and the expression
a = a + 1 can simply be replaced by INC a.
Strength reduction:- There are operations that consume more time and space. Their ‘strength’
can be reduced by replacing them with other operations that consume less time and space, but
produce the same result.For example, x * 2 can be replaced by x << 1, which involves only one
left shift. Though the output of a * a and a2 is same, a2 is much more efficient to implement.
Optimization Phases
The phase which represents the pertinent, possible flow of control is often called control flow
analysis. If this representation is graphical, then a flow graph depicts all possible execution
paths. Control flow analysis simplifies the data flow analysis. Data flow analysis is the proces of
collecting information about the modification, preservation , and use of program "quantities"--
usually variables and expressions.Once control flow analysis and data flow analysis have been
done, the next phase, the improvement phase, improves the program code so that it runs faster or
uses less space. This phase is sometimes termed optimization. Thus, the term optimization is
used for this final code improvement, as well as for the entire process which includes control
flow analysis and data flow analysis. Optimization algorithms attempt to remove useless code,
eliminate redundant expressions, move invariant computations out of loops, etc.
Basic Blocks
A basic block is a sequence of intermediate representation constructs (quadruples, abstract
syntax trees, whatever) which allow no flow of control in or out of the block except at the top or
bottom. Figure 3 shows the structure of a basic block.
We will use the term statement for the intermediate representation and show it in quadruple form
because quadruples are easier to read than IR in tree form.
Leaders:A basic block consists of a leader and all code before the next leader. We define a
leader to be (1) the first statement in the program (or procedure), (2) the target of a branch,
identified most easily because it has a label, and (3) the statement after a "diverging flow " : the
statement after a conditional or unconditional branch.
Basic blocks can be built during parsing if it is assumed that all labels are referenced or after
parsing without that assumption. Following Example shows the outline of a FOR-Loop and its
basic blocks.Since a basic block consists of straight-line code, it computes a set of expressions.
Many optimizations are really transformations applied to basic blocks and to sequences of basic
blocks. Basic blocks for For Loop shown in below figure .One we have computed basic blocks,
we can create the control flow graph.
Building a Flow Graph: A flow graph shows all possible execution paths. We will use this
information to perform optimizations .Formally, a flow graph is a directed graph G, with N
nodes and E edges. Example: Building the Control Flow Graph from basic block
The previous sections looked at the control flow graph nodes of basic blocks. DAGs, on the other
hands, create a useful structure for the intermediate representation within the basic blocks. A
directed graph with no cycles, called a DAG (Directed Acyclic Graph), is used to represent a
basic block. Thus, the nodes of a flow graph are themselves graphs! We can create a DAG
instead of an abstract syntax tree by modifying the operations for constructing the nodes. If the
node already exists, the operation returns a pointer to the existing node. Example 4 shows this for
the two-line assignment statement example.
Example: A DAG for two-line assignment statement
In above example there are two references to a and two references to the quantity bb * 12 .
Data Structure for a DAG: As above example shows, each node may have more than one
pointer to it. We can represent a DAG internally by a process known as value-numbering . Each
node is numbered and this number is entered whenever this node is reused. This is shown
following Example.
EXAMPLE: Value-numbering
Node Left Child Right Child
1. X1
2. a
3. bb
4. 12
5. * (3) (4)
6. + (2) (5)
8.y := (1) (6)
8. X2
9. 2
10. / (2) (9)
11. + (10) (5)
12. := (8) (11)
CONCLUSION:
Hence, we have successfully implemented code optimization using directed acyclic graph.
ASSIGNMENT NO: 4
OBJECTIVES:
1. To express concept of DAG and Labeled Tree.
2. To apply the code generation algorithm to generate target code.
PROBLEM STATEMENT:
Accept Postfix expression. Create a DAG from that expression. Apply Labeling algorithm to
DAG and then apply code generation algorithm to generate target code from DAG.
INPUT: Input data as DAG/Labeled Tree which is generated from postfix expression.
OUTPUT: It will create a target code.
MATHEMATICAL MODEL:
Let S be the solution perspective of the code generation such that
i= postfix expression
o=target code
Gencode() = { Handles different cases of tree & accordingly generate target code}
COMPUTATIONAL MODEL:
e2
e1 e3
S A B
Where S : Initial State
B: Target Code
e2 : Compute Label
System Accepts postfix expression & enters into A state . In that state, it generates tree, at the
same time, calculates label of tree. After that, it passes root node of Label tree to gencode
function & enters into B state. In B state it applies code generation algorithm & generate target
code.
Success- desired output is generated as target code in assembly language form
Failure- desired output is not generated as target code in assembly language form
THEORY:
The Labeling Algorithm:The labeling can be done by visiting nodes in a bottom-up order so that
a node is not visited until all it’s children are labeled.
In the important special case that n is binary node and it’s children have labels l1 and l2, the
above formula reduces to
Example:
Node a is labeled 1 since it is left leaf. Node b is labeled 0 since it is right leaf. Node t1 is
labeled 1 because the labels of it’s children are unequal and the maximum label of a child is
1. Fig.1.2 shows labeled tree that results.
Code generation from a Labeled Tree
Procedure GENCODE(n)
RSTACK –stack of registers, R0,...,R(r-1)
TSTACK –stack of temporaries, T0,T1,...
A call to Gencode(n) generates code to evaluate a tree T, rooted at node n, into the
register top(RSTACK) ,and
o the rest of RSTACK remains in the same state as the one before the call
A swap of the top two registers of RSTACK is needed at some points in the algorithm to
ensure that a node is evaluated into the same register as its left child
CONCLUSION:
Hence, we have successfully studied DAG, Labeling algorithm and code generation from
Labeled Tree.
ASSIGNMENT NO:5
PROBLEM STATEMENT: Generate an abstract syntax tree using Lex and YACC
s=initial state
s e1 e2
Where,
s is the initial state
e1 is the input
B is the token generator and
R is the result.
THEORY:
What is abstract syntax tree?
In computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of
the abstract syntactic structure of source code written in a programming language. Each node of
the tree denotes a construct occurring in the source code. The syntax is "abstract" in not
representing every detail appearing in the real syntax. For instance, grouping parentheses are
implicit in the tree structure, and a syntactic construct like an if-condition-then expression may
be denoted by means of a single node with three branches.
This distinguishes abstract syntax trees from concrete syntax trees, traditionally designated parse
trees, which are often built by a parser during the source code translation and compiling process.
Once built, additional information is added to the AST by means of subsequent processing, e.g.,
contextual analysis. Abstract syntax trees are also used in program analysis and program
transformation systems.
The representation of SourceCode as a tree of nodes representing constants or variables (leaves)
and operators or statements (inner nodes). Also called a "parse tree". An AbstractSyntaxTree is
often the output of a parser (or the "parse stage" of a compiler), and forms the input to semantic
analysis and code generation (this assumes a phased compiler; many compilers interleave the
phases in order to conserve memory).
They are widely used in compilers, due to their property of representing the structure of program
code. An AST is usually the result of the syntax analysis phase of a compiler. It often serves as
an intermediate representation of the program through several stages that the compiler requires,
and has a strong impact on the final output of the compiler.
Generating abstract syntax tree using Lex and YACC
Yacc actions appear to the right of each rule, much like lex actions. We can associate pretty
much any C code that we like with each rule. Consider the yacc file. Tokens are allowed to have
values, which we can refer to in our yacc actions. The $1 in printf("%s",$1) (the 5th rule for exp)
refers to the value of the IDENTIFIER (notice that IDENTIFIER tokens are specified to have
string values). The $1 in printf("%d",$1) (the 6th rule for exp) refers to the value of the
INTEGER LITERAL. We use $1 in each of these cases because the tokens are the first items in
the right-hand side of the rule.
Example
4*5+6
First, we shift INTEGER LITERAL(4) onto the stack. We then reduce by the rule exp:
INTEGER LITERAL, and execute the printf statement, printing out a 4. We then shift a TIMES,
then shift an INTEGER LITERAL(5). Next, we reduce by the rule exp: INTEGER LITERAL
and print out a 5. Next, we reduce by the rule exp: exp TIMES exp (again, remember those
precedence directives!) and print out a *. Next, we shift a PLUS and an INTEGER LITERAL(6).
We reduce by exp: INTEGER LITERAL (printing out a 6), then we reduce by the rule exp: exp
+ exp (printing out a +), giving an output of:
45*6+
So what does this parser do? It converts infix expressions into postfix expressions.
Creating an Abstract Syntax Tree
C Definitions Between %{ and %} in the yacc file are the C definitions. The #include is
necessary for the yacc actions to have access to the tree types and constructors, defined in
treeDefinitions.h. The global variable root is where yacc will store the finished abstract syntax
tree.
• %union: The %union{ } command defines every value that could occur on the stack – not only
token values, but non-terminal values as well. This %union tells yacc that the types of values that
we could push on the stack are integers, strings (char *), and expressionTrees. Each element on
the yacc stack contains two items – a state, and a union (which is defined by %union). by
%union{ }.
• %token: For tokens with values, %token <field> tells yacc the type of the token. For instance,
in the rule exp : INTEGER LITERAL { $$ = ConstantExpression($1)}, yacc looks at the union
variable on the stack to get the value of $1. The command %token <integer value> INTEGER
LITERAL tells yacc to use the integer value field of the union on the stack when looking for the
value of an INTEGER LITERAL.
• %type: Just like for tokens, these commands tell yacc what values are legal for non-terminal
commands
• %left Precedence operators: • Grammar rules The rest of the yacc file, after the %%, are the
grammar rules, of the form
<non-terminal> : <rhs> { /* C code */ }
where <non-terminal> is a non-terminal and <rhs> is a sequence of tokens and non-terminals.
Let’s look at an example. For clarity, we will represent the token INTEGER LITERAL(3) as just
3. Pointers will be represented graphically with arrows. The stack will grow down in our
diagrams. Consider the input string 3+4*5. First, the stack is empty, and the input is 3+4*x.
Example: Creating an abstract syntax tree for simple expressions
%{
#include "treeDefinitions.h"
expressionTree root;
%}
%union{
int integer_value;
char *string_value;
expressionTree expression_tree;
}
%token <integer_value> INTEGER_LITERAL
%token <string_value> IDENTIFIER
%token PLUS MINUS TIMES DIVIDE
%type <expresstion_tree> exp
%left PLUS MINUS
%left TIMES DIVIDE
%%
prog : exp { root = $$; }
exp : exp PLUS exp { $$ = OperatorExpression(PlusOp,$1,$3); }
| exp MINUS exp { $$ = OperatorExpression(MinusOp,$1,$3); }
| exp TIMES exp { $$ = OperatorExpression(TimesOp,$1,$3); }
| exp DIVIDE exp { $$ = OperatorExpression(DivideOp,$1,$3); }
| IDENTIFIER { $$ = IdentifierExpression($1); }
| INTEGER_LITERAL { $$ = ConstantExpression($1); }
/* File treeDefinitions.c */
#include "treeDefinitions.h"
#include <stdio.h>
expressionTree operatorExpression(optype op, expressionTree left,
expressionTree right) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = operatorExp;
retval->u.oper.op = op;
retval->u.oper.left = left;
retval->u.oper.right = right;
return retval;
}
expressionTree IdentifierExpression(char *variable) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = variableExp;
retval->u.variable = variable;
return retval;
}
expressionTree ConstantExpression(int constantval) {
expressionTree retval = (expressionTree) malloc(sizeof(struct expression));
retval->kind = constantExp;
retval->u.constantval = constantval;
return retval;
}
CONCLUSION
Hence, we have successfully studied concept of abstract syntax tree.
ASSIGNMENT NO:6
e1 e2
S T
e3
CONCLUSION:
Hence, we have successfully studied to eliminate Left Recursion and generate a
Recursive Descent Parser.
ASSGINMENT NO:7
TITLE
Implement Apriori approach for data mining to organize the data items
OBJECTIVES:
1. Students should be able to implement Apriori approach for data mining.
2. Student should be able to identify frequent item sets using association rule.
PROBLEM STATEMENT: Implement Apriori approach for data mining to organize the data
items on a shelf using following table of items purchased in a Mall
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
20
INPUT: Data items (item purchase in mall e.g. Mango, Onion etc)
MATHEMATICAL MODEL:
THEORY:
Apriori is a classic algorithm for frequent item set mining and association rule learning over
transactional databases. It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item sets appear sufficiently
often in the database. The frequent item sets determined by Apriori can be used to determine
association rules which highlight general trends in the database. This has applications in domains
such as market basket analysis. Apriori uses a "bottom up" approach, where frequent subsets are
extended one item at a time a step known as candidate generation, and groups of candidates are
tested against the data. The algorithm terminates when no further successful extensions are
found. Apriori is designed to operate on databases containing transactions (for example,
collections of items bought by customers, or details of a website frequentation)
Pseudo Code For Apriori Algorithm:
Association rule learning:It is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered in
databases using different measures of interestingness. For example, the rule
found in the sales data of a supermarket would indicate that if a customer buys onions and
potatoes together, they are likely to also buy hamburger meat. Such information can be used as
the basis for decisions about marketing activities such as, e.g., promotional pricing or product
placements. In addition to the above example from market basket analysis association rules are
employed today in many application areas including Web usage mining, intrusion detection,
Continuous production, and bioinformatics. In contrast with sequence mining, association rule
learning typically does not consider the order of items either within a transaction or across
transactions.
Association rule mining is defined as: Let be a set of binary attributes
called items. Let be a set of transactions called the database. Each
transaction in has a unique transaction ID and contains a subset of the items in . A rule is
defined as an implication of the form where and . The sets of
items (for short itemsets) and are called antecedent (left-hand-side or LHS) and consequent
(right-hand-side or RHS) of the rule respectively.
Suppose you have records of large number of transactions at a shopping center as follows:
Learning association rules basically means finding the items that are purchased together more
frequently than others.
For example in the above table you can see Item1 and item2 are bought together frequently.
So as I said Apriori is the classic and probably the most basic algorithm to do it. Now if you
search online you can easily find the pseudo-code and mathematical equations and stuff. I would
like to make it more intuitive and easy, if I can.
I would like if a 10th or a 12th grader can understand this without any problem. So I will try and
not use any terminologies or jargons.
Let’s start with a non-simple example,
Step 1: Count the number of transactions in which each item occurs, Note ‘O=Onion’ is bought
4 times in total, but, it occurs in just 3 transactions.
Item No of
transactions
M 3
O 3
N 2
K 5
E 4
Y 3
D 1
A 1
U 1
C 2
I 1
Step 2: Now remember we said the item is said frequently bought if it is bought at least 3 times.
So in this step we remove all the items that are bought less than 3 times from the above table and
we are left with
Item Number of
transactions
M 3
O 3
K 5
E 4
Y 3
This is the single items that are bought frequently. Now let’s say we want to find a pair of items
that are bought frequently. We continue from the above table (Table in step 2)
Step 3: We start making pairs from the first item, like MO,MK,ME,MY and then we start with
the second item like OK,OE,OY. We did not do OM because we already did MO when we were
making pairs with M and buying a Mango and Onion together is same as buying Onion and
Mango together. After making all the pairs we get,
Item pairs
MO
MK
ME
MY
OK
OE
OY
KE
KY
EY
Step 4: Now we count how many times each pair is bought together. For example M and O is
just bought together in {M,O,N,K,E,Y}
While M and K is bought together 3 times in {M,O,N,K,E,Y}, {M,A,K,E} AND {M,U,C, K, Y}
After doing that for all the pairs we get
While we are on this, suppose you have sets of 3 items say ABC, ABD, ACD, ACE, BCD and
you want to generate item sets of 4 items you look for two sets having the same first two
alphabets.
ABC and ABD -> ABCD
ACD and ACE -> ACDE
And so on … In general you have to look for sets having just the last alphabet/item different.
Step 7: So we again apply the golden rule, that is, the item set must be bought together at least 3
times which leaves us with just OKE, Since KEY are bought together just two times.
Thus the set of three items that are bought together most frequently are O,K,E.
CONCLUSION:
Hence, we have successfully studied concept of Association rule & Apriori Algorithm
ASSGINMENT NO: 8
MATHEMATICAL MODEL:
Let S be the solution perspective of the class
DD-deterministic data it helps identifying the load store functions or assignment functions.
THEORY:
The Naive Bayes algorithm is an intuitive method that uses the probabilities of each attribute
belonging to each class to make a prediction. It is the supervised learning approach you would
come up with if you wanted to model a predictive modeling problem probabilistically.
Naive bayes simplifies the calculation of probabilities by assuming that the probability of each
attribute belonging to a given class value is independent of all other attributes. This is a strong
assumption but results in a fast and effective method.The probability of a class value given a
value of an attribute is called the conditional probability. By multiplying the conditional
probabilities together for each attribute for a given class value, we have a probability of a data
instance belonging to that class.To make a prediction we can calculate probabilities of the
instance belonging to each class and select the class value with the highest probability.Naive
bases is often described using categorical data because it is easy to describe and calculate using
ratios. A more useful version of the algorithm for our purposes supports numeric attributes and
assumes the values of each numerical attribute are normally distributed (fall somewhere on a bell
curve). Again, this is a strong assumption, but still gives robust results.
Predict the Onset of Diabetes:The test problem we will use in this tutorial is the Pima Indians
Diabetes problem.
This problem is comprised of 768 observations of medical details for Pima Indians patents. The
records describe instantaneous measurements taken from the patient such as their age, the
number of times pregnant and blood workup. All patients are women aged 21 or older. All
attributes are numeric, and their units vary from attribute to attribute.
Each record has a class value that indicates whether the patient suffered an onset of diabetes
within 5 years of when the measurements were taken (1) or not (0).
This is a standard dataset that has been studied a lot in machine learning literature. A good
prediction accuracy is 70%-76%.
Below is a sample from the pima-indians.data.csv file to get a sense of the data we will be
working with.NOTE: Download this file and save it with a .csv extension (e.g. pima-indians-
diabetes.data.csv). See this file for a description of all the attributes.
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
1 8,183,64,0,0,23.3,0.672,32,1
6,148,72,35,0,33.6,0.627,50,1
1,89,66,23,94,28.1,0.167,21,0
2 1,85,66,29,0,26.6,0.351,31,0
3 8,183,64,0,0,23.3,0.672,32,1
4 1,89,66,23,94,28.1,0.167,21,0
5 0,137,40,35,168,43.1,2.288,33,1
Naive Bayes Algorithm Tutorial:This tutorial is broken down into the following steps:
1. Handle Data: Load the data from CSV file and split it into training and test datasets.
2. Summarize Data: summarize the properties in the training dataset so that we can
calculate probabilities and make predictions.
3. Make a Prediction: Use the summaries of the dataset to generate a single prediction.
4. Make Predictions: Generate predictions given a test dataset and a summarized training
dataset.
5. Evaluate Accuracy: Evaluate the accuracy of predictions made for a test dataset as the
percentage correct out of all predictions made.
6. Tie it Together: Use all of the code elements to present a complete and standalone
implementation of the Naive Bayes algorithm.
1. Handle Data:The first thing we need to do is load our data file. The data is in CSV format
without a header line or any quotes. We can open the file with the open function and read the
data lines using the reader function in the csv module.
We also need to convert the attributes that were loaded as strings into numbers that we can work
with them. Below is the loadCsv() function for loading the Pima indians dataset.
import csv
def loadCsv(filename):
lines = csv.reader(op
dataset = list(lines)
1 import csv
2 def loadCsv(filename):
3 lines = csv.reader(open(filename, "rb"))
4 dataset = list(lines)
5 for i in range(len(dataset)):
6 dataset[i] = [float(x) for x in dataset[i]]
7 return dataset
We can test this function by loading the pima indians dataset and printing the number of data
instances that were loaded.
filename = 'pima-indians-diabete
dataset = loadCsv(filename)
print('Loaded data file {0} w ith {
1 filename = 'pima-indians-diabetes.data.csv'
2 dataset = loadCsv(filename)
3 print('Loaded data file {0} with {1} rows').format(filename, len(dataset))
We can test this out by defining a mock dataset with 5 instances, split it into training and testing
datasets and print them out to see which data instances ended up where.
1 Split 5 rows into train with [[4], [3], [5]] and test with [[1], [2]]
2. Summarize Data
The naive bayes model is comprised of a summary of the data in the training dataset. This
summary is then used when making predictions.
The summary of the training data collected involves the mean and the standard deviation for
each attribute, by class value. For example, if there are two class values and 7 numerical
attributes, then we need a mean and standard deviation for each attribute (7) and class value (2)
combination, that is 14 attribute summaries.
These are required when making predictions to calculate the probability of specific attribute
values belonging to each class value.
We can break the preparation of this summary data down into the following sub-tasks:
1 Separated instances: {0: [[2, 21, 0]], 1: [[1, 20, 1], [3, 22, 1]]}
Calculate Mean
We need to calculate the mean of each attribute for a class value. The mean is the central middle
or central tendency of the data, and we will use it as the middle of our gaussian distribution when
calculating probabilities.
We also need to calculate the standard deviation of each attribute for a class value. The standard
deviation describes the variation of spread of the data, and we will use it to characterize the
expected spread of each attribute in our Gaussian distribution when calculating probabilities.The
standard deviation is calculated as the square root of the variance. The variance is calculated as
the average of the squared differences for each attribute value from the mean. Note we are using
the N-1 method, which subtracts 1 from the number of attribute values when calculating the
variance.
import math
def mean(numbers):
return sum(numbers)/
1 import math
2 def mean(numbers):
3 return sum(numbers)/float(len(numbers))
4
5 def stdev(numbers):
6 avg = mean(numbers)
7 variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
8 return math.sqrt(variance)
numbers = [1,2,3,4,5]
print('Summary of {0}: mean={1}
numbers = [1,2,3,4,5]
1
print('Summary of {0}: mean={1}, stdev={2}').format(numbers, mean(numbers),
2
stdev(numbers))
Running this test, you should see something like:
We can test this summarize() function with some test data that shows markedly different mean
and standard deviation values for the first and second data attributes.
x = 71.5
mean = 73
1 stdev
x = =71.5
6.2
probability = calculateProbability
2 mean = 73
3 stdev = 6.2
4 probability = calculateProbability(x, mean, stdev)
5 print('Probability of belonging to this class: {0}').format(probability)
Prediction: A
1 Prediction: A
4. Make Predictions
Finally, we can estimate the accuracy of the model by making predictions for each data instance
in our test dataset. The getPredictions() will do this and return a list of predictions for each test
instance.
def getPredictions(summaries, t
predictions = []
for i in range(len(testS
1 def getPredictions(summaries,
result = pre
testSet):
2 predictions = []
3 for i in range(len(testSet)):
4 result = predict(summaries, testSet[i])
5 predictions.append(result)
6 return predictions
We can test the getAccuracy() function using the sample code below.
1 Accuracy: 66.6666666667
6. Tie it Together
Finally, we need to tie it all together.Below provides the full code listing for Naive Bayes
implemented from scratch in Python.
# Example of Naive Bayes imple
import csv
import random
CONCLUSION:
import math
ASSGINMENT NO:9
OBJECTIVES:
1. Students should be able to implement k-NN Algoritm.
2. Student should be able to identify k-NN algorithm in programming.
PROBLEM STATEMENT:
Implement a simple approach for kNN Algorithm-means C++.
SOFTWARE REQUIRED: Latest version of 64 Bit Operating Systems Open Source Fedora-
20
DD-deterministic data it helps identifying the load store functions or assignment functions.
THEORY:
The k-means algorithm
The k-means algorithm is a simple iterative method to partition a given dataset into a user-
specified number of clusters, k. This algorithm has been discovered by several researchers across
different disciplines, most notably Lloyd (1957, 1982) Forgey (1965), Friedman and Rubin
(1967), and McQueen (1967). A detailed history of k-means alongwith descrip-tions of several
variations are given in.Gray and Neuhoff provide a nice historical background for k-means
placed in the larger context of hill-climbing algorithms.
will decrease whenever there is a change in the assignment or the relocation steps, and hence
convergence is guaranteed in a finite number of iterations. The greedy-descent nature of k-means
on a non-convex cost also implies that the convergence is only to a local optimum, and indeed
the algorithm is typically quite sensitive to the initial centroid locations. Figure 2 1 illustrates
how a poorer result is obtained for the same dataset as in Fig. 1 for a different choice of the three
initial centroids. The local minima problem can be countered to some
Fig 1
Fig. 2 Effect of an inferior initialization on the k-means results extent by running the algorithm
multiple times with different initial centroids, or by doing limited local search about the
converged solution.
Generalizations and connections
1. Handle Data
The first thing we need to do is load our data file. The data is in CSV format without a header
line or any quotes. We can open the file with the open function and read the data lines using the
reader function in the csv module.
import csv
w ith open('iris.data', 'rb') as csv
lines = csv.reader(cs
1 import csvin lines:
for row
import csv
import random
def loadDataset(filename, split,
1 import
w ithcsv
open(filename, 'r
2 import random
3 def loadDataset(filename, split, trainingSet=[] , testSet=[]):
4 with open(filename, 'rb') as csvfile:
5 lines = csv.reader(csvfile)
6 dataset = list(lines)
7 for x in range(len(dataset)-1):
8 for y in range(4):
9 dataset[x][y] = float(dataset[x][y])
10 if random.random() < split:
11 trainingSet.append(dataset[x])
12 else:
13 testSet.append(dataset[x])
Download the iris flowers dataset CSV file to the local directory. We can test this function out
with our iris dataset, as follows:
trainingSet=[]
testSet=[]
1 loadDataset('iris.data',
trainingSet=[] 0.66, trai
2 printtestSet=[]
'Train: ' + repr(len(trainingS
2. Similarity
In order to make predictions we need to calculate the similarity between any two given data
instances. This is needed so that we can locate the k most similar data instances in the training
dataset for a given member of the test dataset and in turn make a prediction.
Given that all four flower measurements are numeric and have the same units, we can directly
use the Euclidean distance measure. This is defined as the square root of the sum of the squared
differences between the two arrays of numbers (read that again a few times and let it sink in).
Additionally, we want to control which fields to include in the distance calculation. Specifically,
we only want to include the first 4 attributes. One approach is to limit the euclidean distance to a
fixed length, ignoring the final dimension.
Putting all of this together we can define the euclideanDistance function as follows:
import math
def euclideanDistance(instance
distance = 0
1 importformathx in range(length):
3. Neighbors
Now that we have a similarity measure, we can use it collect the k most similar instances for a
given unseen instance. This is a straight forward process of calculating the distance for all
instances and selecting a subset with the smallest distance values. Below is the getNeighbors
function that returns k most similar neighbors from the training set for a given test instance
(using the already defined euclideanDistance function)
import operator
def getNeighbors(trainingSet, te
import operator
distances = []
1
def getNeighbors(trainingSet,
length = len(testInstan testInstance, k):
2
distances = []
3
length = len(testInstance)-1
4
for x in range(len(trainingSet)):
5
dist = euclideanDistance(testInstance, trainingSet[x], length)
6
distances.append((trainingSet[x], dist))
7
distances.sort(key=operator.itemgetter(1))
8
neighbors = []
9
for x in range(k):
10
neighbors.append(distances[x][0])
11
return neighbors
12
4. Response
Once we have located the most similar neighbors for a test instance, the next task is to devise a
predicted response based on those neighbors. We can do this by allowing each neighbor to vote
for their class attribute, and take the majority vote as the prediction. Below provides a function
for getting the majority voted response from a number of neighbors. It assumes the class is the
last attribute for each neighbor.
import operator
def getResponse(neighbors):
classVotes = {}
importforoperator
x in range(len(neig
1
def getResponse(neighbors):
2
classVotes = {}
3
for x in range(len(neighbors)):
4
response = neighbors[x][-1]
5
if response in classVotes:
6
classVotes[response] += 1
7
else:
8
classVotes[response] = 1
9
sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
10
return sortedVotes[0][0]
11
We can test out this function with some test neighbors, as follows:
5. Accuracy
We have all of the pieces of the kNN algorithm in place. An important remaining concern is how
to evaluate the accuracy of predictions. An easy way to evaluate the accuracy of the model is to
calculate a ratio of the total correct predictions out of all predictions made, called the
classification accuracy. Below is the getAccuracy function that sums the total correct predictions
and returns the accuracy as a percentage of correct classifications.
import csv
Program-
import random
Running the example, you will see the results of each prediction compared to the actual class
value in the test set. At the end of the run, you will see the accuracy of the model. In this case, a
little over 98%.
...
> predicted='Iris-virginica', actua
> predicted='Iris-virginica', actua
1 > predicted='Iris-virginica',
... actua
This section provides you with ideas for extensions that you could apply and investigate with the
Python code you have implemented as part of this tutorial.
Regression: You could adapt the implementation to work for regression problems
(predicting a real-valued attribute). The summarization of the closest instances could
involve taking the mean or the median of the predicted attribute.
Normalization: When the units of measure differ between attributes, it is possible for
attributes to dominate in their contribution to the distance measure. For these types of
problems, you will want to rescale all data attributes into the range 0-1 (called
normalization) before calculating similarity. Update the model to support data
normalization.
Alternative Distance Measure: There are many distance measures available, and you
can even develop your own domain-specific distance measures if you like. Implement an
alternative distance measure, such as Manhattan distance or the vector dot product.
There are many more extensions to this algorithm you might like to explore. Two
additional ideas include support for distance-weighted contribution for the k-most similar
instances to the prediction and more advanced data tree-based structures for searching for
similar instances.
OBJECTIVES:
1. How to use iburg tool ?
PROBLEM STATEMENT:
Study of iburg tool & generate target code using that tool
Function Mapping
THEORY:
Iburg is a program that generates a fast tree parser. It is compatible with Burg. Both programs
accept a cost-augmented tree grammar and emit a C program that discovers an optimal parse of
trees in the language described by the grammar. They have been used to construct fast optimal
instruction selectors for use in code generation. Burg uses BURS; Iburg's matchers do dynamic
programming at compile time
Iburg that reads a burg specification and writes a matcher that does DP at compile time. The
matcher is hard coded.
Iburg was built to test early versions of what evolved into burg's specification language and
interface, but it is useful in its own right because it is simpler and thus easier for novices to
understand, because it allows dynamic cost computation, and because it admits a larger class of
tree grammars
Iburg tree parser generator has been mainly developed at Princeton University. It accepts
Backus-Naur form specifications of tree grammers and generates C source code for grammer-
specific tree parsers.
Figure 1 shows an extended BNF grammar for burg and iburg specifications. Grammar symbols
are displayed in slanted type and terminal symbols are displayed in typewriter type. { X }
denotes zero or more instances of X, and [X] denotes an optional X. Specifications consist of
declarations, a %% separator, and rules. The declarations declare terminals | the operators in
subject trees | and associate a unique, positive external symbol number with each one. Non-
terminals are declared by their presence on the left side of rules. The %start declaration
optionally declares a non-terminal as the start symbol. In Figure 1, term and nonterm denote
identifiers that are terminals and non-terminals, respectively
Rules define tree patterns in a fully parenthesized prefix form. Every non-terminal denotes a tree.
Each operator has a fixed arity, which is inferred from the rules in which it is used. A chain rule
is a rule whose pattern is another non-terminal. If no start symbol is declared, the non-terminal
defined by the first rule is used. Each rule has a unique, positive external rule number , which
comes after the pattern and is preceded by a \=". As described below, external rule numbers are
used to report the matching rule to a user-supplied semantic action routine. Rules end with an
optional non-negative, integer cost; omitted costs default to zero.
Figure 2 shows a fragment of a burg specification for the VAX. This example uses upper-case
for terminals and lower-case for non-terminals. Lines 1{2 declare the operators and their external
symbol numbers, and lines 4{15 give the rules.
The external rule numbers correspond to the line numbers to simplify interpreting subsequent
figures. In practice, these numbers are usually generated by a preprocessor that accepts a richer
form of specification (e.g., including YACC-style semantic actions), and emits a burg
specification . Only the rules on lines 4, 6, 7, and 9 have non-zero costs. The rules on lines 5, 9,
12, and 13 are chain rules.
The operators in Figure 2 are some of the operators in lcc's intermediate language. The operator
names are formed by concatenating a generic operator name with a one-character type suffix like
C, I, or P, which denote character, integer, and pointer operations, respectively. The operators
used in Figure 2 denote integer addition (ADDI), forming the address of a local variable
(ADDRLP), integer assignment (ASGNI), an integer constant (CNSTI), \widening" a character
to an integer (CVCI), the integer 0 (I0I), and fetching a character (INDIRC). The rules show that
ADDI and ASGNI are binary, CVCI and INDIRC are unary, and ADDRLP, CNSTI, and I0I are
leaves.
MATCHING
Both versions of burg generate functions that the client calls to label and reduce subject trees.
The labeling function, label(p), makes a bottom-up, left-to-right pass over the subject tree p
computing the rules that cover the tree with the minimum cost, if there is such a cover. Each
node is labeled with (M;C) to indicate that \the pattern associated with external rule M matches
the node with cost C."
Figure 3 shows the intermediate language tree for the assignment expression in the C fragment
{ int i; char c; i = c + 4; }
The left child of the ASGNI node computes the address of i. The right child computes the
address of c, fetches the character, widens it to an integer, and adds 4 to the widened value,
which the ASGNI assigns to i.
The other annotations in Figure 3 show the results of labeling. (M;C) denote labels from matches
and [M;C] denote labels from chain rules. The rule from Figure 2 denoted by each M is also
shown. Each C sums the costs of the nonterminals on right-hand side and the cost of the relevant
pattern or chain rule. For example, the pattern in line 11 of Figure 2 matches the node ADDRLP
i with cost 0, so the node is labeled with (11; 0). Since this pattern denotes a disp, the chain rule
in line 9 applies with a cost of 0 for matching a disp plus 1 for the chain rule itself. Likewise, the
chain rules in lines 5 and 13 apply because the chain rule in line 9 denotes a reg.
Nodes are annotated with (M;C) only if C is less than all previous matches for the non-terminal
on the left-hand side of rule M. For example, the ADDI node matches the disp pattern in line 10
of Figure 2, which means it also matches all rules with disp alone on the right-hand side, namely
line 9. By transitivity, it also matches the chain rules in lines 5 and 13. But all three of these
chain rules yield cost 2, which isn't better than previous matches for those non-terminals.
Once labeled, a subject tree is reduced by traversing it from the top down and performing
appropriate semantic actions, such as generating and emitting code. Reducers are supplied by
clients, but burg generates functions that assist in these traversals, e.g., one function that returns
M and another that identifies subtrees for recursive visits. iburg does the dynamic programming
at compile time and annotates nodes with data equivalent to (M;C). Its \state numbers" are really
pointers to records that hold these data.
IMPLEMENTATION
iburg generates a state function that uses a straightforward implementation of tree pattern
matching. It generates hard code instead of tables. Its \state numbers" are pointers to state
records, which hold vectors of the (M;C) values for successful matches. The state record for the
specification in Figure 2 is
struct state {
int op;
struct state *left, *right;
short cost[6];
short rule[6];
};
iburg also generates integer codes for the non-terminals, which index the cost and rule vectors:
#define stmt_NT 1
#define disp_NT 2
#define rc_NT 3
#define reg_NT 4
#define con_NT 5
By convention, the start non-terminal has value 1.
State records are cleared when allocated, and external rule numbers are positive. Thus, a non
zero value for p->rule[X] indicates that p's node matched a rule that defines non-terminal X.
CONCLUSION:
Hence, we have successfully studied iburg tool & how to generate target code using that tool.
REFERENCES
1. A V Aho, R Sethi, J D Ullman, Compilers: Principles, Techniques, and Tools", Pearson
Edition, ISBN 81-7758-590-8.
2. J R Levin, T Mason, D Brown, Lex and Yacc", O'Reilly, 2000 ISBN 81-7366-061-X
3. Compiler Construction Using Java, JavaCC and Yacc, Anthony J. Dos Reis, Wiley
ISBN 978-0-470-94959-7