Aiml FPP
Aiml FPP
Aiml FPP
THANDALAM – 602105
FACULTY PREPARATION PROGRAM- REVEIW REPORT (Rev 0.1 wef 01.11.2019)
PART -1
Name of the mentor: ___________________________ Subject Title & Code: ___________________
Names of the members handling the subject : ___________________________ Branch:
: ___________________________ Sem/Year:
1. Lesson Plan
Not prepared – to be prepared and submitted
Prepared as per the format and no changes recommended
Prepared but the following changes are recommended
Remarks &Recommendations:
2. Notes On Lesson
Number of units covered in the NOL
Percentage of coverage
Unit-1 Unit 2 Unit3 Unit 4 Unit5
Mission
To impart quality technical education imbibed with proficiency and humane values. To provide
right ambience and opportunities for the students to develop into creative, talented and
globally competent professionals. To promote research and development in technology and
management for the benefit of the society.
Rajalakshmi Engineering College
Department of Information Technology
Vision
To be a Department of Excellence in Information Technology Education, Research and
Development.
Mission
To train the students to become highly knowledgeable in the field of Information Technology.
To promote continuous learning and research in core and emerging areas.
To develop globally competent students with strong foundations, who will be able to adapt to
changing technologies.
Rajalakshmi Engineering College
Department of Information Technology
PROGRAMME EDUCATIONAL OBJECTIVES
PEO I
To provide essential background in Science, basic Electronics and applied Mathematics.
PEO II
To prepare the students with fundamental knowledge in programming languages and to
develop applications.
PEO III
To engage the students in life-long learning, and make them to remain current in their
profession and obtain additional qualifications to enhance their career positions in IT industries.
PEO IV
To enable the students to implement computing solutions for real world problems and carry out
basic and applied research leading to new innovations in Information Technology (IT) and
related interdisciplinary areas.
PEO V
To familiarize the students with the ethical issues in engineering profession, issues related to the
worldwide economy, nurturing of current job related skills and emerging technologies.
PROGRAMME OUTCOMES
PO1: Engineering knowledge: Apply the knowledge of mathematics, science,
engineering fundamentals, and an engineering specialization to the solution of complex
engineering problems.
PO2: Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences.
PO3: Design/development of solutions: Design solutions for complex engineering
problems and design system components or processes that meet the specified needs
with appropriate consideration for the public health and safety, and the cultural,
societal, and environmental considerations.
PO4: Conduct investigations of complex problems: Use research-based knowledge and
research methods including design of experiments, analysis and interpretation of data,
and synthesis of the information to provide valid conclusions.
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources,
and modern engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge
to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional
engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member
or leader in diverse teams, and in multidisciplinary settings.
PO10: Communication: Communicate effectively on complex engineering activities with
the engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PO11: Project management and finance: Demonstrate knowledge and understanding of
the engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and in multidisciplinary
environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological
change.
PROGRAM SPECIFIC OUTCOMES (PSOs)
1 Study of Prolog.
2 Write simple fact for the statements using PROLOG.
3 Write predicates One converts centigrade temperatures to Fahrenheit, the other checks if
a temperature is below freezing.
4 Write a program to solve the Monkey Banana problem.
5 WAP in turbo prolog for medical diagnosis and show the advantage and disadvantage of
green and red cuts.
6 Write a program to solve 4-Queen problem.
7 Write a program to solve traveling salesman problem
8 Write a program to solve water jug problem.
9 Write a python program to implement linear regression.
10 Write a python program for ML classification algorithms
a. Logistic Regression
b. Decision Tree
11 Write a python program to implement
a. K-Nearest Neighbor algorithm
b. SVM
12 Write a python program to implement a simple Neural Network.
Contact Hours : 30
Course Outcomes:
LESSON PLAN
COURSE OBJECTIVES
Proposed Actual
Session
Date/ Date/ Topics to be covered Ref Teaching Aids
No
Period Period
Introduction to AI T1 PPT+Class Notes
Problem formulation T1 Class Notes
Problem Definition T1 Class Notes
Production systems T1 Class Notes
Control strategies T1 Class Notes
Search strategies T1 Class Notes
Problem characteristics T1 Class Notes
Production system characteristics T1 Class Notes
Specialized productions system T1 Class Notes
Problem solving methods T1 Class Notes
Problem graphs T1 Class Notes
Matching T1 Class Notes
Indexing T1 Class Notes
Heuristic functions T1 Class Notes
Hill Climbing T1 Class Notes
Depth first and Breath first T1 Class Notes
Constraints satisfaction T1 Class Notes
Related algorithms, Measure of T1 Class Notes
performance and analysis of search
algorithms.
UNIT – 2 KNOWLEDGE REPRESENTATION AND INFERENCE
Proposed Actual
Session Teaching
Date/ Date/ Topics to be covered Ref
No Aids
Period Period
Game playing T1 PPT+Class Notes
Knowledge representation using Predicate T1 PPT+Class Notes
logic and calculus
Structured representation of knowledge T1 PPT+Class Notes
Production based system T1 PPT+Class Notes
Frame based system T1 PPT+Class Notes
Inference – Backward chaining T1 Class Notes
Forward chaining T1 Class Notes
Rule value approach T1 Class Notes
Fuzzy reasoning T1 Class Notes
Certainty factors T1 Class Notes
Bayesian Theory-Bayesian Network- T1 Class Notes
Dempster – Shafer theory. T1 Class Notes
UNIT – 3 MACHINE LEARNING BASICS
Proposed Actual
Session Teaching
Date/ Date/ Topics to be covered Ref
No Aids
Period Period
Learning T2 Class Notes
Designing a learning system T2 Class Notes
Perspectives and issues in T2 Class Notes
machine learning
Concept Learning – as task T2 Class Notes
Concept Learning – as search T2 Class Notes
Types of Machine Learning T2 Class Notes
Supervised Learning T2 Class Notes
Regression T2 Class Notes
Classification. T2 Class Notes
Testing Machine Learning T2 Class Notes
Algorithms- Over fitting
Training, Testing and Validation Sets, T2 Class Notes
The confusion Matrix, Accuracy Metrics, T2 Class Notes
ROC Curve T2 Class Notes
Unbalanced Datasets, Measurement T2 Class Notes
Precision.
UNIT – 4 NEURAL NETWORKS
Proposed Actual
Session Teaching
Date/ Date/ Topics to be covered Ref
No Aids
Period Period
The Brain and the Neuron T2 Class Notes
Neural Networks T2 Class Notes
The Perceptron T2 Class Notes
Linear Separability T2 Class Notes
Linear Regression-Examples. T2 Class Notes
Unsupervised Learning T2 Class Notes
The K-means algorithm T2 Class Notes
Vector Quantization T2 Class Notes
The Self organizing feature map. T2 Class Notes
UNIT – 5 EXPERT SYSTEMS
Proposed Actual
Session Teaching
Date/ Date/ Topics to be covered Ref
No Aids
Period Period
Expert systems T1 PPT+Class Notes
Architecture of expert T1 PPT+Class Notes
systems,
Roles of expert systems – T1 PPT+Class Notes
Knowledge Acquisition – T1 PPT+Class Notes
Meta knowledge, T1 PPT+Class Notes
Heuristics. T1 PPT+Class Notes
Typical expert systems – T1 PPT+Class Notes
MYCIN
DART T1 PPT+Class Notes
XOON T1 PPT+Class Notes
Expert systems shells. T1 PPT+Class Notes
IT19643 - ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE
Definition: AI
Artificial Intelligence is the study of how to make computers do things, which, at the
moment, people do better.
According to the father of Artificial Intelligence, John McCarthy, it is “The science and
engineering of making intelligent machines, especially intelligent computer programs”.
From a business perspective AI is a set of very powerful tools, and methodologies for
using those tools to solve business problems.
AI Vocabulary
Intelligence relates to tasks involving higher mental processes, e.g. creativity, solving
problems, pattern recognition, classification, learning, induction, deduction, building
analogies, optimization, language processing, knowledge and many more. Intelligence is
the computational part of the ability to achieve goals.
AI Techniques depict how we represent, manipulate and reason with knowledge in order
to solve problems. Knowledge is a collection of ‘facts’. To manipulate these facts by a
program, a suitable representation is required. A good representation facilitates problem
solving.
Learning means that programs learn from what facts or behaviour can represent. Learning
denotes changes in the systems that are adaptive in other words, it enables the system to do
the same task(s) more efficiently next time.
Problems of AI:
Intelligence does not imply perfect understanding; every intelligent being has limited
perception, memory and computation. Many points on the spectrum of intelligence versus cost are
viable, from insects to humans. AI seeks to understand the computations required from intelligent
behaviour and to produce computer systems that exhibit intelligence. Aspects of intelligence
studied by AI include perception, communicational using human languages, reasoning, planning,
learning and memory.
Task Domains of AI
Mundane Tasks:
Perception
-Vision
-Speech
Natural Languages
-Understanding
-Generation
-Translation
Common sense reasoning
Robot Control
Formal Tasks
Games: chess, checkers etc
Mathematics: Geometry, logic, Proving properties of programs
Expert Tasks:
Engineering (Design, Fault finding, Manufacturing planning)
Scientific Analysis
Medical Diagnosis
Financial Analysis
AI Technique:
Artificial Intelligence research during the last three decades has concluded that
Intelligence requires knowledge. To compensate overwhelming quality, knowledge possesses less
desirable properties.
It is huge.
It is difficult to characterize correctly.
It is constantly varying.
It differs from data by being organized in a way that corresponds to its application.
It is complicated.
The knowledge captures generalizations that share properties, are grouped together, rather
than being allowed separate representation.
It can be understood by people who must provide it—even though for many programs bulk of
the data comes automatically from readings.
In many AI domains, how the people understand the same people must supply the knowledge
to a program.
It can be easily modified to correct errors and reflect changes in real conditions.
It can be widely used even if it is incomplete or inaccurate.
It can be used to help overcome its own sheer bulk by helping to narrow the range of
possibilities that must be usually considered.
In order to characterize an AI technique let us consider two different problems and a series of
approaches for solving each of them.
1. Tic-Tac-Toe
2. Question Answering
Tic-Tac-Toe
The programs increase in complexity, their use of generalizations, the clarity of their
knowledge and the extensibility of their approach. In this way they move towards being
representations of AI techniques.
MAKE2 which returns 5 if the centre square is blank; otherwise it returns any blank non
corner square, i.e. 2, 4, 6 or 8.
POSSWIN (p) returns 0 if player p cannot win on the next move and otherwise returns the
number of the square that gives a winning move. It checks each line using products 3*3*2 =
18 gives a win for X, 5*5*2=50 gives a win for O, and the winning move is the holder of the
blank.
GO (n) makes a move to square n setting BOARD[n] to 3 or 5.
This algorithm is more involved and takes longer but it is more efficient in storage which
compensates for its longer time. It depends on the programmer’s skill.
The structure of the data consists of BOARD which contains a nine element vector, a list of
board positions that could result from the next move and a number representing an estimation
of how the board position leads to an ultimate win for the player to move.
This algorithm looks ahead to make a decision on the next move by deciding which the most
promising move or the most suitable move at any stage would be and selects the same.
Consider all possible moves and replies that the program can make. Continue this process for
as long as time permits until a winner emerges, and then choose the move that leads to the
computer program winning, if possible in the shortest time.
Actually this is most difficult to program by a good limit but it is as far that the technique can
be extended to in any game. This method makes relatively fewer loads on the programmer in
terms of the game technique but the overall game strategy must be known to the adviser.
Question Answering
Let us consider Question Answering systems that accept input in English and provide
answers also in English. This problem is harder than the previous one as it is more difficult to
specify the problem properly. Another area of difficulty concerns deciding whether the answer
obtained is correct, or not, and further what is meant by ‘correct’.
For example, consider the following situation:
Rani went shopping for a new Coat. She found a red one she really liked.
When she got home, she found that it went perfectly with her favourite dress.
Questions
1. What did Rani go shopping for?
2. What did Rani find that she liked?
3. Did Rani buy anything?
Program 1:
A set of templates that match common questions and produce patterns used to match
against inputs. Templates and patterns are used so that a template that matches a given question is
associated with the corresponding pattern to find the answer in the input text. For example, the
template who did x y generates x y z if a match occurs and z is the answer to the question. The
given text and the question are both stored as strings.
Algorithm
Answering a question requires the following four steps to be followed:
Compare the template against the questions and store all successful matches to produce a
set of text patterns.
Pass these text patterns through a substitution process to change the person or voice and
produce an expanded set of text patterns.
Apply each of these patterns to the text; collect all the answers and then print the answers.
Reply with the set of answers just collected
In question 1 we use the template WHAT DID X Y which generates Rani go shopping for z
and after substitution we get Rani goes shopping for z and Rani went shopping for z giving z
[equivalence] a new coat
In question 2 we need a very large number of templates and also a scheme to allow the
insertion of ‘find’ before ‘that she liked’; the insertion of ‘really’ in the text; and the substitution
of ‘she’ for ‘Rani’ gives the answer ‘a red one’.
question 3 cannot be answered.
Program 2:
A structure called English consists of a dictionary, grammar and some semantics about the
vocabulary we are likely to come across. This data structure provides the knowledge to convert
English text into a storable internal form and also to convert the response back into English. The
structured representation of the text is a processed form and defines the context of the input text
by making explicit all references such as pronouns.
Program 3:
World model contains knowledge about objects, actions and situations that are described
in the input text. This structure is used to create integrated text from input text. The diagram
shows how the system’s knowledge of shopping might be represented and stored. This
information is known as a script and in this case is a shopping script.
Algorithm
Convert the question to a structured form using both the knowledge contained in Method 2
and the World model, generating even more possible structures, since even more knowledge is
being used. Sometimes filters are introduced to prune the possible answers.
To answer a question, the scheme followed is:
Convert the question to a structured form as before but use the world model to resolve any
ambiguities that may occur.
The structured form is matched against the text and the requested segments of the question
are returned.
Are we trying to produce programs that do the tasks the same way that people do?
OR
Are we trying to produce programs that simply do the tasks the easiest way that is
possible?
Programs in the first class attempt to solve problems that a computer can easily solve and
do not usually use AI techniques. AI techniques usually include a search, as no direct method is
available, the use of knowledge about the objects involved in the problem area and abstraction on
which allows an element of pruning to occur, and to enable a solution to be found in real time;
otherwise, the data could explode in size. Examples of these trivial problems in the first class,
which are now of interest only to psychologists are EPAM (Elementary Perceiver and Memorizer)
which memorized garbage syllables.
The second class of problems attempts to solve problems that are non-trivial for a
computer and use AI techniques. We wish to model human performance on these:
1. To test psychological theories of human performance. Ex. PARRY a program to simulate
the conversational behavior of a paranoid person.
2. To enable computers to understand human reasoning – for example, programs that answer
questions based upon newspaper articles indicating human behavior.
3. To enable people to understand computer reasoning. Some people are reluctant to accept
computer results unless they understand the mechanisms involved in arriving at the results.
4. To exploit the knowledge gained by people who are best at gathering information. This
persuaded the earlier workers to simulate human behavior in the SB part of AISB simulated
behavior. Examples of this type of approach led to GPS (General Problem Solver).
PROBLEM FORMULATION
To solve the problem of playing a game, we require the rules of the game and targets for
winning as well as representing positions in the game. The opening position can be defined as the
initial state and a winning position as a goal state. Moves from initial state to other states leading
to the goal state follow legally. However, the rules are far too abundant in most games—
especially in chess, where they exceed the number of particles in the universe. Thus, the rules
cannot be supplied accurately and computer programs cannot handle easily. The storage also
presents another problem but searching can be achieved by hashing.
The number of rules that are used must be minimized and the set can be created by
expressing each rule in a form as possible. The representation of games leads to a state space
representation and it is common for well-organized games with some structure. This
representation allows for the formal definition of a problem that needs the movement from a set of
initial positions to one of a set of target positions. It means that the solution involves using known
techniques and a systematic search. This is quite a common method in Artificial Intelligence.
Example: Play Chess
Fig: One legal Chess Move Fig: Another way to Describe Chess Move
In this problem, we use two jugs called four and three; four holds a maximum of four
gallons of water and three a maximum of three gallons of water. How can we get two gallons of
water in the four jug?
The state space is a set of prearranged pairs giving the number of gallons of water in the
pair of jugs at any time, i.e., (four, three) where four = 0, 1, 2, 3 or 4 and three = 0, 1, 2 or 3.
The start state is (0, 0) and the goal state is (2, n) where n may be any but it is limited to three
holding from 0 to 3 gallons of water or empty. Three and four shows the name and numerical
number shows the amount of water in jugs for solving the water jug problem.
The major production rules for solving this problem are shown below:
Initial condition Goal comment
1. (four, three) if four < 4 (4, three) fill four from tap
2. (four, three) if three< 3 (four, 3) fill three from tap
3. (four, three) If four > 0 (0, three) empty four into drain
4. (four, three) if three > 0 (four, 0) empty three into drain
5. (four, three) if four + three<4 (four + three, 0) empty three into four
6. (four, three) if four + three<3 (0, four + three) empty four into three
7. (0, three) If three > 0 (three, 0) empty three into four
8. (four, 0) if four > 0 (0, four) empty four into three
9. (0, 2) (2, 0) empty three into four
10. (2, 0) (0, 2) empty four into three
11. (four, three) if four < 4 (4, three-diff) pour diff, 4-four, into four from three
12. (three, four) if three < 3 (four-diff, 3) pour diff, 3-three, into three from four
and a solution is given below four three rule
Fig: Production Rules for the Water Jug Problem
The problem solved by using the production rules in combination with an appropriate
control strategy, moving through the problem space until a path from an initial state to a goal state
is found. In this problem solving process, search is the fundamental concept. For simple problems
it is easier to achieve this goal by hand but there will be cases where this is far too difficult.
PRODUCTION SYSTEMS
Production systems provide search structures that forms the core of many intelligent
processes. Hence it is useful to structure AI programs in a way that facilitates describing and
performing the search process. Do not be confused by other uses of the word production, such as
to describe what is done in factories. A production system consists of:
1. A set of rules, each consisting of a left side and a right hand side. Left hand side or pattern
determines the applicability of the rule and a right side describes the operation to be performed if
the rule is applied.
2. One or more knowledge/databases that contain whatever information is appropriate for the
particular task. Some parts of the database may be permanent, while other parts of it may pertain
only to the solution of the current problem. The information in these databases may be structured
in any appropriate way.
3. A control strategy that specifies the order in which the rules will be compared to the database
and a way of resolving the conflicts that arise when several rules match at once.
4. A rule applier.
This example can be solved by the operator sequence UP, RIGHT, UP, LEFT, DOWN.
Control Strategies
Heuristic Searches:
A heuristic is a method that
might not always find the best solution
but is guaranteed to find a good solution in reasonable time.
By sacrificing completeness it increases efficiency.
Useful in solving tough problems which
could not be solved any other way.
solutions take an infinite time or very long time to compute.
The classic example of heuristic search methods is the travelling salesman problem.
PROBLEM CHARACTERISTICS
A problem may have different aspects of representation and explanation. In order to
choose the most appropriate method for a particular problem, it is necessary to analyze the
problem along several key dimensions.
Heuristic search is a very general method applicable to a large class of problem. It
includes a variety of techniques. In order to choose an appropriate method, it is necessary to
analyze the problem with respect to the following considerations.
To use the heuristic search for problem solving, we suggest analysis of the problem for the
following considerations:
Decomposability of the problem into a set of independent smaller sub problems
Possibility of undoing solution steps, if they are found to be unwise
Predictability of the problem universe
Possibility of obtaining an obvious solution to a problem without comparison of all other
possible solutions
Type of the solution: whether it is a state or a path to the goal state
Role of knowledge in problem solving
Nature of solution process: with or without interacting with the user
Problem Decomposition
Suppose to solve the expression is:
This problem can be solved by breaking it into smaller problems, each of which we can
solve by using a small collection of specific rules. Using this technique of problem
decomposition, we can solve very large problems very easily. This can be considered as an
intelligent behaviour.
Method I:
1. Siva was a man.
2. Siva was born in 1905.
3. All men are mortal.
4. Now it is 2008, so Siva’s age is 103 years.
5. No mortal lives longer than 100 years.
Method II:
1. Siva is a worker in the company.
2. All workers in the company died in 1952.
Answer: So Siva is not alive. It is the answer from the above methods.
We are interested to answer the question; it does not matter which path we follow. If we
follow one path successfully to the correct answer, then there is no reason to go back and check
another path to lead the solution.
Each component of the above sentence may have more than one interpretation. Some of
the sources of ambiguity in this sentence are the following
1. Bank may refer to a financial institutions or a side of a river.
2. Pasta salad is a salad containing pasta. Ex. Dog food does not contain Dog.
We need to produce only the interpretation itself. No record of the processing by which
the interpretation was found is necessary.
Consider the water jug problem; the statement of solution to this problem must be a
sequence of operations that produce the final state.
What is the Role of Knowledge?
B: Boat
T: Tiger
G: Goat
Gr:Grass
Step 1:
According to the question, this step will be (B, T, G, and Gr) as all the Missionaries and the
Carnivals are at one side of the bank of the river. Different states from this state can be
implemented as
The states (B, T, O, O) and (B, O, O, Gr) will not be countable because at a time the
Boatman and the Tiger or the Boatman and grass cannot go. (According to the question)
Step 2:
Now consider the current state of step-1 i.e. the state (B, O, G, and O).
The state is the right side of the river.
So on the left side the state may be (B, T, O,
Gr) i.e._B,O,G,O_ _ _B, T, O, Gr_
(Right) (Left)
Step 3:
Now proceed according to the left and right sides of the river such that the condition of the
problem must be satisfied.
Step 4:
First, consider the first current state on the right side of step 3 i.e.
Now consider the second current state on the right side of step-3 i.e.
Step 5:
Step 6:
Step 7:
Hence the final state will be (B, T, G, Gr) which are on the right side of the river. Comments:
This problem requires a lot of space for its state implementation.
It takes a lot of time to search the goal node.
The production rules at each level of state are very strict.
Consider the chess game playing. Here the set of valid moves is very large. To reduce the size
of this set only useful moves are identified.
At the time of playing the game, the next move will very much depend upon the current
move. As the game is going on, there will be only „few‟ moves which are applicable in next
move.
Hence it will be a wasteful effort to check the applicability of all moves. Rather the important
and valid legal moves are directly stored as rules and through indexing the applicable rules
are found. Here the indexing will store the current board position.
The indexing makes the matching process easy, at the cost of lack of generality in the
statement rules.
Practically there is a trade-off between the ease of writing rules and simplicity of matching
process.
The indexing techniques are not very well suited for the rule base where rules are written in
high level predicated.
In PROLOG and many theorem proving systems, rules are indexed by predicates they
contain. Hence all the applicable rules can be indexed quickly.
2. Matching with variable:
In the rule base if the preconditions are not stated as exact descriptions of particular situation,
the indexing technique does not work well.
In certain situations they describe properties that the situation must have. In the situations
where single condition is matched against a single element in state description.
The unification procedure can be used. However in practical situation it is required to match
complete set of rules that match the current state.
In forward and backward chaining system, the depth first search technique is used to select
the individual rule. In the situations where multiple rules are applicable, conflict resolution
technique is used to choose appropriate rule.
In case of the situations requiring multiple matches, the unification can be applied
recursively, but a more efficient method is to use RETE matching algorithm.
3. Complex matching variable:
A more complex matching process is required when preconditions of a rule specify required
properties that are not stated explicitly in the description of current state.
However the real world is full of uncertainties and sometimes practically it is not possible to
define the rule in exact fashion.
The matching process becomes more complicated in the situation where preconditions
approximately match the current situations
E.g a speech understanding program must contain the rules that map from a description of a
physical wave form to phones.
Because of the presence of noise the signal becomes so variable that there will be
approximate match between the rules that describe an ideal sound and the input that describes
that ideal world.
Approximate matching is particularly difficult to deal with, because as we increase the
tolerance allowed in the match the new rules need to be written and it will increase number of
rules.
It will increase the size of main search process.
But approximate matching is nevertheless superior to exact matching in situations such as
speech understanding, where exact matching may result in no rule being matched and search
process coming to grinding halt.
SEARCH STRATEGIES
What is Search?
Search is the systematic examination of states to find path from the start/root state to the
goal state.
The set of possible states, together with operators defining their connectivity constitute the
search space.
The output of a search algorithm is a solution, that is, a path from the initial state to a state
that satisfies the goal test.
Search Tree
Having formulated some problems, we now need to solve them. This is done by a search through
the state space. A search tree is generated by the initial state and the successor function that together define
the state space. In general, we may have a search graph rather than a search tree, when the same state can
be reached from multiple paths.
Types of Search
There are three broad classes of search processes:
1) Uninformed- Blind Search
There is no specific reason to prefer one part of the search space to any other, in
finding a path from initial state to goal state.
Systematic, exhaustive Search
• Depth-first-search
• Breadth-first-search
2) Informed – Heuristic search - there is specific information to focus the search.
Hill climbing
Branch and bound
Best first
A*
3) Game playing – there are at least two partners opposing to each other.
Minimax (a, b pruning)
Means ends analysis
UNINFORMED SEARCH STRATEGIES
Uninformed Search Strategies have no additional information about states beyond that
provided in the problem definition.
Strategies that know whether one non goal state is “more promising” than another are called
informed search or heuristic search strategies.
There are five uninformed search strategies as given below.
Breadth-first search
Uniform-cost search
Depth-first search
Depth-limited search
Iterative deepening search
BREADTH-FIRST SEARCH
Breadth-first search is a simple strategy in which the root node is expanded first, then all successors
of the root node are expanded next, then their successors, and so on. In general, all the nodes are
expanded at a given depth in the search tree before any nodes at the next level are expanded.
Breath-first-search is implemented by calling TREE-SEARCH with an empty fringe that is a first-in-
first-out (FIFO) queue, assuring that the nodes that are visited first will be expanded first.
In other Words, calling TREE-SEARCH (problem, FIFO-QUEUE ()) results in breadth-first-search.
The FIFO queue puts all newly generated successors at the end of the queue, which means that
Shallow nodes are expanded before deeper nodes.
Fig: Breadth-first searches on a simple binary tree. At each stage, the node to be expanded next
indicated by a marker.
Properties of breadth-first-search
Assume every state has b successors. The root of the search tree generates b nodes at the first level,
each of which generates b more nodes, for a total of b2 at the second level. Each of these generates b more
nodes, yielding b3 nodes at the third level, and so on. Now suppose that the solution is at depth d. In the
worst case, we would expand all but the last node at level d, generating b d+1 - b nodes at level d+1. Then
the total number of nodes generated is
b + b2 + b3 + …+ bd+ ( bd+1 + b) = O(bd+1).
Every node that is generated must remain in memory, because it is either part of the fringe or is an
ancestor of a fringe node. The space complexity is, therefore, the same as the time complexity
Uniform-Cost Search
Instead of expanding the shallowest node, uniform-cost search expands the node n with the
lowest path cost.
Uniform-cost search does not care about the number of steps a path has, but only about their
total cost.
Nodes that have been expanded and have no descendants in the fringe can be removed
from the memory; these are shown in black. Nodes at depth 3 are assumed to have no
successors and M is the only goal node.
This strategy can be implemented by TREE-SEARCH with a last-in-first-out (LIFO)
queue, also known as a stack.
Depth-first-search has very modest memory requirements. It needs to store only a single
path from the root to a leaf node, along with the remaining unexpanded sibling nodes for
each node on the path. Once the node has been expanded, it can be removed from the
memory, as soon as its descendants have been fully explored (Refer Figure 1.13).
For a state space with a branching factor b and maximum depth m,depth-first- search
requires storage of only bm + 1 nodes.
Drawback of Depth-first-search
The drawback of depth-first-search is that it can make a wrong choice and get stuck going
down very long(or even infinite) path when a different choice would lead to solution near
the root of the search tree.
For example, depth-first-search will explore the entire left sub tree even if node C is a goal
node.
HEURISTIC SEARCH TECHNIQUES
The heuristic function is a way to inform the search about the direction to a goal. Itprovides an
informed way to guess which neighbor of a node will lead to a goal.
There is nothing magical about a heuristic function. It must use only information that can be
readily obtained about a node. Typically a trade-off exists between the amount of work it takes to
derive a heuristic value for a node and how accurately the heuristic value of a node measures the
actual path cost from the node to a goal.
A heuristic function, h(n), provides an estimate of the cost of the path from a given node to the
closest goal state. Must be zero if node represents a goal state.
Example: Straight-line distance from current location to the goal location in a road navigation
problem.
A standard way to derive a heuristic function is to solve a simpler problem and to use the actual
cost in the simplified problem as the heuristic function of the original problem.
Heuristic Search
Direct techniques blind search) are not always possible (they require too much time or memory).
Weak techniques can be effective if applied correctly on the right kinds of tasks.
Typically require domain specific information.
Generate and Test Strategy Generate-And-Test Algorithm
Generate-and-test search algorithm is a very simple algorithm that guarantees to find
a solution if
Done systematically and there exist a solution.
Algorithm: Generate-And-Test
Local heuristic:
+1-For each block that is resting on the thing it is supposed to be resting on.
−1 for each block that is rested on wrong block
Global heuristic:
For each block that has the correct support structure:
+1 to every block in the support structure. For each block that has a wrong support
structure:
−1to every block in the support structure.
Simulated Annealing
The problem of local maxima has been overcome in simulated annealing search. In normal hill
climbing search, the movements towards downhill are never made. In such algorithms the search
may stuck up to local maximum. Thus this search cannot guarantee complete solutions.
In contrast, a random search ( or movement) towards successor chosen randomly from the set of
successor will be complete, but it will be extremely inefficient. The combination of hill climbing
and random search, which yields both efficiency and completeness, is called simulated annealing.
The simulated annealing method was originally developed for the physical process of annealing.
That is how the name simulated annealing was found and restored.
In simulated annealing searching algorithm, instead of picking the best move, a random move is
picked. The standard simulated annealing uses term objective function instead of heuristic
function. If the move improves the situation it is accepted otherwise the algorithm accepts the
move with some probability less than
This probability is
P= e-∆E/kT
Where - E is positive charge in energy level, t is temperature and k is Boltzman constant. As indicated
by the equation the probability decreases with badness of the move (evaluation gets worsened by
amount - E). The rate at which - E is cooled is called annealing schedule. The proper annealing
schedule is maintained to monitor T.
This process has following differences from hill climbing search:
• The annealing schedule is maintained.
• Moves to worse states are also accepted.
• In addition to current state, the best state record is maintained. The algorithm of simulated
annealing is presented as follows:
Algorithm: ―simulated annealing
1. Evaluate the initial state. Mark it as current state. Till the current state is not a goal state, initialize
best state to current state. If the initial state is the best state, return it and quit.
2. Initialize T according to annealing schedule.
3. Repeat the following until a solution is obtained or operators are not left:
a. Apply yet unapplied operator to produce a new state
b. For new state compute - E= value of current state – value of new state. If the new state is the
goal state then stop, or if it is better than current state, make it as current state and record as
best state.
c. If it is not better than the current state, then make it current state with probability P.
d. Revise T according to annealing schedule
4. Return best state as answer.
Best-First Search:
It is a way of combining the advantages of both depth-first and breadth-first search into a
single method.
1. OR Graphs
Depth-first search is good because it allows a solution to be found without all competing
branches having to be expanded. Breadth-first search is good because it does not get trapped
on dead-end paths.
One way of combining the two is to follow a single path at a time, but switch paths whenever
some competing path looks more promising than the current one does.
At each step of the best-first search process, we select the most promising of the nodes we
have generated so far. This is done by applying an appropriate heuristic function to each of
them. We then expand the chosen node by using the rules to generate its successors.
But eventually, if a solution is not found, that branch will start to look less promising than
one of the top-level branches that had been ignored.
At that point, the now more promising, previously ignored branch will be explored.
But the old branch is not forgotten. Its last node remains in the set of generated but
unexpanded nodes.
Since node D is the most promising, it is expanded next, producing two successor nodes, E
and F. But then the heuristic function is applied to them.
Now another path, that going through node B, looks more promising, so it is pursued,
generating nodes G and H.
But again when these new nodes are evaluated they look less promising than another path, so
attention is returned to the path through D to E. E is then expanded, yielding nodes I and J.
At the next step, J will be expanded, since it is the most promising. This process can continue
until a solution is found Fig (below)
It is a general heuristic based search technique. In best first search, in the graph of problem
representation, one evaluation function (which corresponds to heuristic function) is attached with
every node. The value of evaluation function may depend upon cost or distance of current node from
goal node. The decision of which node to be expanded depends on the value of this evaluation
function. The best first can understood from following tree. In the tree, the attached value with nodes
indicates utility value. The expansion of nodes according to best first search is illustrated in the
following figure.
Fig: Tree getting expansion according to best first search
Here, at any step, the most promising node having least value of utility function is chosen for
expansion.
In the tree shown above, best first search technique is applied; however it is beneficial
sometimes to search a graph instead of tree to avoid the searching of duplicate paths. In the
process to do so, searching is done in a directed graph in which each node represents a point
in the problem space. This graph is known as OR-graph. Each of the branches of an OR
graph represents an alternative problem solving path.
Two lists of nodes are used to implement a graph search procedure discussed above. These are
1. OPEN: these are the nodes that have been generated and have had the heuristic function
applied to them but not have been examined yet.
2. CLOSED: these are the nodes that have already been examined. These nodes are kept in
the memory if we want to search a graph rather than a tree because whenever a node will
be generated, we will have to check whether it has been generated earlier.
The best first search is a way of combining the advantage of both depth first and breath first
search. The depth first search is good because it allows a solution to be found without all
competing branches have to be expanded.
Breadth first search is good because it does not get trapped on dead ends of path. The way of
combining this is to follow a single path at a time but switches between paths whenever some
competing paths looks more promising than current one does.
Hence at each step of best first search process, we select most promising node out of
successor nodes that have been generated so far.
f(n)=g(n)+h(n)
Where f(n)= evaluation function
g (n)= cost (or distance) of current node from start node. h (n)= cost of current node from goal
node.
In A* algorithm the most promising node is chosen from expansion. The promising node is
decided based on the value of heuristic function.
Normally the node having lowest value of f (n) is chosen for expansion. We must note that the
goodness of a move depends upon the nature of problem, in some problems the node having least
value of heuristic function would be most promising node, where in some situation, the node
having maximum value of heuristic function is chosen for expansion.
A* algorithm maintains two lists. One store the list of open nodes and other maintain the list of
already expanded nodes. A* algorithm is an example of optimal search algorithm.
A search algorithm is optimal if it has admissible heuristic. An algorithm has admissible heuristic
if its heuristic function h(n) never overestimates the cost to reach the goal. Admissible heuristic
are always optimistic because in them, the cost of solving the problem is less than what actually
is. The A* algorithm works as follows:
A* algorithm:
1. Place the starting node ‗s„on ‗OPEN„ list.
2. If OPEN is empty, stop and return failure.
3. Remove from OPEN the node ‗n„that has the smallest value of f*(n). if node ‗n is a goal
node, return success and stop otherwise.
4. Expand ‗n„ generating all of its successors ‗n„ and place ‗n„ on CLOSED. For every
successor ‗n„if ‗n„is not already OPEN , attach a back pointer to ‗n„. Compute f*(n) and
place it on CLOSED.
5. Each ‗n„ that is already on OPEN or CLOSED should be attached to back pointers which
reflect the lowest f*(n) path. If ‗n„ was on CLOSED and its pointer was changed, remove it
and place it on OPEN.
6. Return to step 2.
Problem Reduction
Problem Reduction with AO* Algorithm
• When a problem can be divided into a set of sub problems, where each sub problem can be
solved separately and a combination of these will be a solution, AND-OR graphs or AND –
OR trees are used for representing the solution.
• The decomposition of the problem or problem reduction generates AND arcs. One AND may
point to any number of successor nodes. All these must be solved so that the arc will rise to
many arcs, indicating several possible solutions.
• Hence the graph is known as AND - OR instead of AND. Figure shows an AND - OR graph.
AO* Algorithm:
Let G consists only to the node representing the initial state call this node INTT.
Compute h' (INIT).
Until INIT is labelled SOLVED or hi (INIT) becomes greater than FUTILITY,
repeat the following procedure.
(I) Trace the marked arcs from INIT and select an unbounded node NODE.
(II) Generate the successors of NODE. if there are no successors then assign
FUTILITY as h' (NODE). This means that NODE is not solvable. If there are
successors then for each one called SUCCESSOR, that is not also an ancestor of
NODE do the following
(a) Add SUCCESSOR to graph G
(b) If successor is not a terminal node, mark it solved and assign zero to its h ' value.
(c) If successor is not a terminal node, compute it h' value.
(III) Propagate the newly discovered information up the graph by doing the following.
Let S be a set of nodes that have been marked SOLVED. Initialize S to NODE.
Until S is empty repeat the following procedure;
(a) Select a node from S call if CURRENT and remove it from S.
(b) Compute h' of each of the arcs emerging from CURRENT, Assign minimum
h' to CURRENT.
(c) Mark the minimum cost path a s the best out of CURRENT.
(d) Mark CURRENT SOLVED if all of the nodes connected to it through the
new marked are have been labelled SOLVED.
(e) If CURRENT has been marked SOLVED or its h ' has just changed, its new
status must be propagate backwards up the graph. Hence all the ancestors of
CURRENT are added to S.
AO* Search Procedure
Constraint Satisfaction
The general problem is to find a solution that satisfies a set of constraints. The heuristics
which are used to decide what node to expand next and not to estimate the distance to the
goal.
Examples of this technique are design problem, labeling graphs robot path planning and crypt
arithmetic puzzles.
In constraint satisfaction problems a set of constraints are available. This is the search space.
Initial State is the set of constraints given originally in the problem description. A goal state
is any state that has been constrained enough.
Constraint satisfaction is a two-step process.
1. First constraints are discovered and propagated throughout the system.
2. Then if there is not a solution search begins, a guess is made and added to this constraint.
Propagation then occurs with this new constraint.
Algorithm
1. Propagate available constraints:
Open all objects that must be assigned values in a complete solution.
Repeat until inconsistency or all objects are assigned valid values:
Select an object and strengthen as much as possible the set of constraints that apply to object.
If set of constraints different from previous set then open all objects that share any of these
constraints. Remove selected object.
If union of constraints discovered above defines a solution return solution.
If union of constraints discovered above defines a contradiction return failure.
Make a guess in order to proceed.
Repeat until a solution is found or all possible solutions exhausted:
Select an object with a no assigned value and try to strengthen its constraints.
Recursively invoke constraint satisfaction with the current set of constraints plus the selected
strengthening constraint.
Crypt arithmetic puzzles are examples of constraint satisfaction problems in which the goal to
discover some problem state that satisfies a given set of constraints. Some problems of crypt
arithmetic are show below
Here each decimal digit is to be assigned to each of the letters in such a way that the answer
to the problem is correct. If the same letter occurs more than once it must be assigned the
same digit each time. No two different letters may be assigned the same digit.
The puzzle SEND + MORE = MONEY, after solving, will appear like this
Ans.
The heuristics and production rules are specific to the following example:
Heuristics Rules
1. If sum of two ‗n„ digit operands yields ‗n+1„ digit result then the ‗n+1„th digit has to
be one.
2. Sum of two digits may or may not generate carry.
3. Whatever might be the operands the carry can be either 0 or 1.
4. No two distinct alphabets can have same numeric code.
5. Whenever more than 1 solution appears to be existing, the choice is governed by the
fact that no two alphabets can have same number code.
Means – end Analysis
Means-ends analysis allows both backward and forward searching. This means we could solve major
parts of a problem first and then return to smaller problems when assembling the final solution.
The means-ends analysis algorithm can be said as follows:
1. until the goal is reached or no more procedures are available:
Describe the current state the goal state and the differences between the two.
Use the difference that describes a procedure that will hopefully get nearer to goal.
Use the procedure and update current state
If goal is reached then success otherwise fail.
For using means-ends analysis to solve a given problem, a mixture of the two directions,
forward and backward, is appropriate. Such a mixed strategy solves the major parts of a
problem first and then goes back and solves the small problems that arise by putting the big
pieces together.
The means-end analysis process detects the differences between the current state and goal
state. Once such difference is isolated an operator that can reduce the difference has to be
found.
The operator may or may not be applied to the current state. So a sub problem is set up of
getting to a state in which this operator can be applied.
In operator sub goaling backward chaining and forward chaining is used in which first the
operators are selected and then sub goals are set up to establish the preconditions of the
operators.
If the operator does not produce the goal state we want, then we have second sub problem of
getting from the state it does produce to the goal. The two sub problems could be easier to
solve than the original problem, if the difference was chosen correctly and if the operator
applied is really effective at reducing the difference. The means-end analysis process can
then be applied recursively.
This method depends on a set of rules that can transform one problem state into another.
These rules are usually not represented with complete state descriptions on each side.
Instead they are represented as a left side that describes the conditions that must be met for
the rules to be applicable and a right side that describes those aspects of the problem state
that will be changed by the application of the rule. A separate data structure called a
difference table which uses the rules by the differences that they can be used to reduce.
Means-Ends Analysis (MEA)
We have presented collection of strategies that can reason either forward or backward, but
for a given problem, one direction or the other must be chosen.
A mixture of the two directions is appropriate. Such a mixed strategy would make it possible
to solve the major parts of a problem first and then go back and solve the small problems
that arise in “gluing” the big pieces together.
The technique of Means-Ends Analysis allows us to do that.
Algorithm: Means-Ends Analysis
1. Compare CURRENT to GOAL. If there is no difference between them then return.
2. Otherwise, select the most important difference and reduce it by doing the following until
success of failure is signalled:
a) Select an as yet untried operator O that is applicable to the current difference. If there are no
such operators, then signal failure.
b) Attempt to apply O to CURRENT. Generate descriptions of two states: O-START, a state in
which O‟s preconditions are satisfied and O-RESULT, the state that would result if O were
applied in O-START.
c) If (FIRST-PART <- MEA( CURRENT, O-START)) and (LAST-PART <- MEA(O-
RESULT, GOAL)) are successful, then signal success and return the result of concatenating
FIRST-PART, O, and LAST-PART.
MEA: Operator Sub goaling
MEA process centersaround the detection of differences between the current state and the
goal state.
Once such a difference is isolated, an operator that can reduce the difference must be found.
If the operator cannot be applied to the current state, we set up a sub problem of getting to a
state in which it can be applied.
The kind of backward chaining in which operators are selected and then sub goals are set up
to establish the preconditions of the operators.
MEA : Household Robot Preconditions Results
Application Operator
PUSH(Obj, Loc) At(robot, obj)^ Large(obj)^ At(obj, loc)^ At(robot,
Clear(obj)^ armempty loc)
CARRY(Obj, loc) At(robot, obj)^ Small(obj) At(obj, loc)^At(robot, loc)
GAMEPLAYING
Introduction
Game Playing is one of the oldest sub-fields in AI. Game playing involves abstract and pure
form of competition that seems to require intelligence. It is easy to represent the states and
actions. To implement the game playing very little world knowledge is required.
The most common used AI technique in game is search. Game playing research has
contributed ideas on how to make the best use of time to reach good decisions.
Game playing is a search problem defined by:
Initial state of the game
Operators defining legal moves
Successor function
Terminal test defining end of game states
Goal test
Path cost/utility/payoff function
More popular games are too complex to solve, requiring the program to take its best guess. “For
example in chess, the search tree has 1040 nodes (with branching factor of 35). It is the
opponent because of whom uncertainty arises.
Characteristics of game playing
1. There are always “unpredictable” opponents:
The opponent introduces uncertainty
The opponent also wants to win
The solution for this problem is a strategy, which specifies a move for every possible opponent
reply.
2. Time limits:
Game are often played under strict time constraints (eg: chess) and therefore must be very
effectively handled.
There are special games where two players have exactly opposite goals. There are also
perfect information games (such as chess and go) where both the players have access to
the same information about the game in progress (e.g. tic-tac-toe).
In imperfect game, information games (such as bridge or certain card games and games
where dice is used). Given sufficient time and space, usually an optimum solution can be
obtained for the former by exhaustive search, though not for the latter.
Types of games
There are basically two types of games
Deterministic games
Chance games
Game like chess and checker are perfect information deterministic games whereas games like
scrabble and bridge are imperfect information. We will consider only two player discrete,
perfect information games, such as tic-tac-toe, chess, checkers etc... .
Two- player games are easier to imagine and think and more common to play.
Minimize search procedure
Typical characteristic of the games is to look ahead at future position in order to succeed. There is a
natural correspondence between such games and state space problems.
In a game like tic-tac-toe
States-legal board positions
Operators-legal moves
Goal-winning position
The game starts from a specified initial state and ends in position that can be declared win for one
player and loss for other or possibly a draw. Game tree is an explicit representation of all possible
plays of the game. We start with a 3 by 3 grid.
Then the two players take it in turns to place a there marker on the board ( one player uses the „X‟
marker, the other uses the „O‟ marker). The winner is the player who gets 3 of these markers in a
row, eg..if X wins
Two-ply search
To play an entire game we need to combine search oriented and non-search oriented
techniques. The idea way to use a search procedure to find a solution to the problem
statement is to generate moves through the problem space until a goal state is reached.
Unfortunately for games like chess even with a good plausible move generator, it is not
possible to search until goal state is reached. In the amount of time available it is possible to
generate the tree at the most 10 to 20 ply deep.
Then in order to choose the best move, the resulting board positions must be compared to
discover which is most advantageous. This is done using the static evaluation function.
The static evaluation function evaluates individual board positions by estimating how much
likely they are eventually to lead to a win.
The min-max procedure is a depth-first, depth limited search procedure.
If the limit of search has reached, compute the static value of the current position relative to
the appropriate layer as given below (maximizing or minimizing player). Report the result
(value and path).
If the level is minimizing level(minimizer’s turn)
Generate the successors of the current position. Apply MINIMAX to each of the successors.
Return the minimum of the result.
If the level is a maximizing level. Generate the successors of current position Apply
MINIMAX to each of these successors. Return the maximum of the result. The maximum
algorithm uses the following procedures
1. MOVEGEN(POS)
It is plausible move generator. It returns a list of successors of „Pos‟.
2. STSTIC (Pos,Depth)
The static evaluation function that returns a number representing the goodness of “pos” from
the current point of view.
3. DEEP-ENOUGH
It returns true if the search to be stopped at the current level else it returns false.
A MINIMAX example
For example, the value for D is 6, which is the maximum value of its children, while the value for
C is 4 which is the minimum value of F and G. In this example the best sequence of moves found by
the maximizing/minimizing procedure is the path through nodes A, B, D and H, which is called the
principal continuation. The nodes on the path are denoted as PC (principal continuation) nodes.
For simplicity we can modify the game tree values slightly and use only maximization
operations. The trick is to maximize the scores by negating the returned values from the children
instead of searching for minimum scores and estimate the values at leaves from the player‟s own
viewpoint
Alpha-beta cut-offs
The basic idea of alpha-beta cutoffs is “It is possible to compute the correct min-max decision
without looking at every node in the search tree”. This is called pruning (allow us to ignore
portions of the search tree that make no difference to the final choice).
The general principle of alpha-beta pruning is
Consider a node n somewhere in the tree, such that a player has a chance to move to this node.
If player has a better chance m either at the parent node of n ( or at any choice point further up)
then n will never be reached in actual play.
When we are doing a search with alpha-beta cut-offs, if a node’s value is too high, the minimizer
will make sure it’s never reached (by turning off the path to get a lower value). Conversely, if a
node’s value is too low, the maximize will make sure it’s never reached.
This gives us the following definitions
o Alpha: the highest value that the maximizer can guarantee himself by making some
move at the current node OR at some node earlier on the path to this node.
o Beta: the lowest value that the minimizer can guarantee by making some move at the
current node OR at some node earlier on the path to this node.
The maximizer is constantly trying to push the alpha value up by finding better moves; the
minimizer is trying to push the beta value down. If a node’s value is between alpha and beta,
then the players might reach it.
At the beginning, at the root of the tree, we don’t have any guarantees yet about what values
the maximizer and minimizer can achieve. So we set beta to ∞ and alpha to -∞. Then as we
move down the tree, each node starts with beta and alpha values passed down from its parent.
Consider a situation in which the MIN – children of a MAX-node have been partially inspected.
Alpha-beta for a max node
At this point the “tentative” value which is backed up so far of F is 8. MAX is not interested in any
move which has a value of less than 8, since it is already known that 8 are the worst that MAX can
do, so far. Thus the node D and all its descendent can be pruned or excluded from further
exploration, since MIN will certainly go for a value of 3 rather than8.
Fig: Partial Inspections of MIN Children
MIN is trying to minimize the game-value. So far, the value 2 is the best available form
MIN‟s point of view. MIN will immediately reject node D, which can be stopped for further
exploration.
In a game tree, each node represents a board position where one of the players gets to
choose a move. For example, in the fig below look at the node C. As soon as we look at its
left child, we realize that if the players reach node C, the minimizer can limit the utility
maximize can get utility 6 by going to node B instead, so he would never let the game reach
C. therefore we don`t even have to look at C‟s other children.
Initially at the root of the tree, there is no guarantee about what values the maximizer and
minimizer can achieve. So beta is set to ∞ and alpha to -∞. Then as we move down the tree,
each node starts with beta and alpha values passed down from its parent.
It’s a maximize node, and then alpha is increased if a child value is greater than the current
alpha value. Similarly, at a minimizer node, beta may be decreased. This is shown in the fig.
Fig: Tree with alpha-beta cut-offs
At each node, the alpha and beta values may be updated as we iterate over the node’s children.
At node E, when alpha is updated to a value of 8, it ends up exceeding beta. This is a point
where alpha beta pruning is required we know the minimizer would never let the game reach
this node so we don’t have to look at its remaining children. In fact, pruning happens exactly
when the alpha and beta lines hit each other in the node value.
Problem solving requires large amount of knowledge and some mechanism for manipulating
thatknowledge.
The Knowledge and the representation are distinct, play a central but distinguishable roles
in intelligentsystem
Knowledgeisadescriptionoftheworld;itdeterminesasystem'scompetence
by what it knows.
Representation is the way knowledge is encoded; it defines the
system'sperformance in doing something.
Facts Truths about the real world and what we represent. This can be regarded as
theknowledgelevel
In simple words, we:
Need to know about thingswe want to represent, and Need
some means by which things we can manipulate.
The Mutilated Checkerboard Problem: “Consider a normal checker board from which two
squares, in opposite corners, have been removed. The task is to cover all the remaining
squares exactly with dominoes, each of which covers two squares. No overlapping, either of
dominoes on top of each other or of dominoes over the boundary of the mutilated board is
allowed. Can this task bedone?”
(a) (b) (c)
Fig: Three representation of a Mutilated Checker Board
The dotted line on top indicates the abstract reasoning process that a program is
intended to model.
The solid lines on bottom indicate the concrete reasoning process that the program
performs.
Forward and Backward Representation
The forward and backward representations are elaborated below
KR System Requirements
A good knowledge representation enables fast and accurate access to knowledge and
understanding of the content.
A knowledge representation system should have following properties.
Representational Adequacy the ability to represent all kinds of knowledge that are
needed in that domain.
Inferential Adequacy the ability to manipulate the representational structures to derive
new structure corresponding to new knowledge inferred from old.
Inferential Efficiency the ability to incorporate additional information into the
knowledge structure that can be used to focus attention of the inference mechanisms in
the most promising direction.
Acquisitional Efficiency the ability to acquire new knowledge using automatic methods
whenever possible rather than reliance on human intervention
Note: To date no single system can optimizes all of the above properties.
2.2 KNOWLEDGE REPRESENTATIONSCHEMES
There are four types of Knowledge representation:
Relational, Inheritable, Inferential, and Declarative/Procedural.
Relational Knowledge:
Provides a framework to compare two objects based on equivalent attributes.
Any instance in which two different objects are compared is a relational type of knowledge.
Inheritable Knowledge
Obtained from associated objects.
It prescribes a structure in which new objects are created which may inherit all or a subset
of attributes from existing objects.
Inferential Knowledge
Is inferred from objects through relations among objects.
e.g., a word alone is a simple syntax, but with the help of other words in phrase the reader
may infer more from a word; this inference within linguistic is called semantics.
Declarative Knowledge
Statement in which knowledge is specified, but the use to which that knowledge is to be put
is not given.
e.g. laws, people's name; these are facts which can stand alone, not dependent on other
knowledge;
Procedural Knowledge
A representation in which the control information, to use the knowledge, is embedded in
the knowledge itself.
e.g. Computer programs, directions, and recipes; these indicate specific use or
implementation;
Relational Knowledge
This knowledge associates elements of one domain with another domain.
Relational knowledge is made up of objects consisting of attributes and their corresponding
associated values.
The results of this knowledge type are a mapping of elements among different domains.
The facts about a set of objects are put systematically in columns.
This representation provides little opportunity for inference.
Given the facts it is not possible to answer simple question such as
“Who is the heaviest player? ".
If a procedure for finding heaviest player is provided, then these facts will enable that
procedure to compute an answer.
We can ask things like who "bats – left" and "throws –right".
Player Height Weight Bats - Throw
Aaron 6-0 180 Right - Right
Mays 5-10 170 Right - Right
Ruth 6-2 215 Left - Left
Williams 6-3 205 Left - Right
Table: Simple Relational Knowledge
Inheritable Knowledge
Here the knowledge elements inherit attributes from their parents.
The knowledge is embodied in the design hierarchies found in the functional, physical and
process domains. Within the hierarchy, elements inherit attributes from their parents, but in
many cases not all attributes of the parent elements be prescribed to the child elements.
The inheritance is a powerful form of inference, but not adequate. The basic KR needs to
be augmented with inference mechanism.
The KR in hierarchical structure, shown below, is called “semantic network” or a
collection of “frames” or “slot-and-filler structure". The structure shows property
inheritance and way for insertion of additional knowledge.
Property inheritance: The objects or elements of specific classes inherit attributes
and values from more general classes. The classes are organized in a generalized
hierarchy.
Fig: Inferential Knowledge
From these three statements we can infer that:
“Wonder lives either on land or on water."
Note: If more information is made available about these objects and their relations, then more knowledge
can be inferred.
Declarative/Procedural Knowledge
Difference between Declarative/Procedural knowledge is not very clear.
Declarative knowledge:
Here, the knowledge is based on declarative facts about axioms and domains.
Axioms are assumed to be true unless a counter example is found to invalidate them.
Domains represent the physical world and the perceived functionality.
Axioms and domains thus simply exists and serve as declarative statements that can stand
alone
Procedural knowledge:
Here, the knowledge is a mapping process between domains that specify “what to do when”
and the representation is of “how to make it” rather than “what it is”. May have inferential
efficiency, but no inferential adequacy and Acquisitional efficiency
Example: A parser in a natural language has the knowledge that a noun phrase may contain
articles, adjectives and nouns. It thus accordingly call routines that know how to process
articles, adjectives and nouns.
Issues in Knowledge Representation
The fundamental goal of Knowledge Representations to facilitate inference
(conclusions) from knowledge.
The issues that arise while using KR techniques are many.
Some of these are explained below.
Important Attributes:
Any attribute of objects so basic that they occur in almost every problem
domain?
Relationship among attributes:
Is any important relationship that exists among object attributes?
Choosing Granularity:
Which would be a totally separate assertion, and we would not be able to draw any
conclusions about similarities between Socrates and Plato. It would be much better to
represent these facts as:
MAN(SOCRATES) MAN(PLATO)
Since now the structure of the representation reflects the structure of the knowledge itself.
But to do that, we need to be able to use predicates applied to arguments. We are in even
more difficulty if we try to represent the equally classic sentence
All men are mortal.
We could represent this as:
MORTALMAN
But that fails to capture the relationship between any individual being a man and that
individual being a mortal. To do that, we really need variables and quantification unless we
are willing to write separate statements about the mortality of every known man.
Let's now explore the use of predicate logic as a way of representing knowledge by looking
at a specific example. Consider the following set of sentences:
1. Marcus was a man.
2. Marcus was a Pompeian.
3. All Pompeians were Romans.
4. Caesar was a ruler.
5. All Romans were either loyal to Caesar or hated him.
6. Everyone is loyal to someone.
7. People only try to assassinate rulers they are not loyal to.
8. Marcus tried to assassinate Caesar.
The facts described by these sentences can be represented as a set of wff's in predicate
logic as follows:
1. Marcus was aman
Man (Marcus)
Although this representation fails to represent the notion of past tense (which is clear in the
English sentence), it captures the critical fact of Marcus being a man. Whether this omission
is acceptable or not depends on the use to which we intend to put the knowledge.
2. Marcus was aPompeian.
Pompeian (Marcus)
3. All Pompeian‟s were Romans.
x: Pompeian(x)Roman(x)
4. Caesar was aruler.
ruler(Caesar)
Since many people share the same name, the fact that proper names are often not
references to unique individuals, overlooked here. Occasionally deciding which of several
people of the same name is being referred to in a particular statement may require a
somewhat more amount of knowledge and logic.
5. All Romans were either loyal to Caesar or hatedhim.
x: Roman(x)loyalto(x, Caesar) V hate(Caesar)
Here we have used the inclusive-or interpretation of the two types of or supported by
English language. Some people will argue, however, that this English sentence is really
stating an exclusive-or. To express that we would have towrite:
x: Roman(x) [(loyalto(x, Caesar) V hate(x,
Caesar))Not (loyalto(x, Caesar) hate(x, Caesar))]
6. Everyone is loyal tosomeone.
x:y : loyalto(x,y)
The scope of quantifiers is a major problem that arises when trying to convert English sentences
into logical statements. Does this sentence say, as we have assumed in writing the logical formula
above, that for each person there exists someone to whom he or she is loyal, possibly a different
someone for everyone? Or does it say that there is someone to whom everyone is loyal?
Fig: An Attempt to Prove not loyal to(Marcus, Caesar)
Thus since each clause is a separate conjunct and since all the variables are universally
quantified, there need be no relationship between the variables of two clauses, even if they were
generated from the same wff.
Performing this final step of standardization is important because during the resolution
procedure it is sometimes necessary to instantiate a universally quantified variable (i.e., substitute
for it a particular value). But, in general, we want to keep clauses in their most general form as long
as possible. So when a variable is instantiated, we want to know the minimum number of
substitutions that must be made to preserve the truth value of the system.
After applying this entire procedure to a set of wff's, we will have a set of clauses, each of which is
a disjunction of literals. These clauses can now be exploited by the resolution procedure to generate
proofs
One way of viewing the resolution process is that it takes a set of clauses that are all
assumed to be true and based on information provided by the others, it generates new
clauses
that represent restrictions on the way each of those original clauses can be made true. A
contradiction occurs when a clause becomes so restricted that there is no way it can be true.
This is indicated by the generation of the empty clause.
Unification Algorithm
In propositional logic it is easy to determine that two literals cannot both be true at the same
time. Simply look for L and ~L. In predicate logic, this matching process is more complicated,
since bindings of variables must be considered.
For example man (john) and man (john) is a contradiction while man (john) and man(Himalayas)
is not. Thus in order to determine contradictions we need a matching procedure that compares
two literals and discovers whether there exist a set of substitutions that makes them identical.
There is a recursive procedure that does this matching. It is called Unification algorithm.
In Unification algorithm each literal is represented as a list, where first element is the name of a
predicate and the remaining elements are arguments. The argument may be a single element
(atom) or may be another list. For example we can have literals as
( tryassassinate Marcus Caesar)
( tryassassinate Marcus (ruler of Rome))
To unify two literals, first check if there is a first element re same. If so proceed.
Otherwise they cannot be unified. For example the literals
(try assassinate Marcus Caesar)
(hate Marcus Caesar)
It cannot be unified. The unification algorithm recursively matches pairs of elements, one pair at a
time. The matching rules are:
1. Different constants, functions or predicates cannot match, whereas identical one scan.
2. A variable can match another variable, any constant or a function or predicate expression,
subject to the condition that the function or [predicate expression must not contain any
instance of the variable being matched (otherwise it will lead to infinite recursion).
3. The substitution must be consistent. Substituting y for x now and then z for x later is
inconsistent. (a substitution y for x written as y/x)
The Unification algorithm is listed below as a procedure UNIFY (L1, L2). It returns a list
representing the composition of the substitutions that were performed during the match.
An empty list NIL indicates that a match was found without any substitutions. If the list
contains a single value F, it indicates that the unification procedure failed.
The empty list, NIL, indicates that a match was found without any substitutions. The list
consisting of the single value FAIL indicates that the unification procedure failed.
Algorithm: Unify (L1, L2)
1. If L1 or L2 are variables or constants, then:
(a) If L1 and L2 are identical, then return NIL.
(b) Else if L1 is a variable, then if L1 occurs in L2 then return {FAIL}, else return(L2/L1).
(c) Else if L2 is a variable, then if L2 occurs in L1 then return {FAIL} , else return(L1/L2).
(d) Else return{FAIL}.
2. If the initial predicate symbols in L1 and L2 are not identical, then return{FAIL}.
3. If LI and L2 have a different number of arguments, then return{FAIL}.
4. Set SUBST to NIL. (At the end of this procedure, SUBST will contain all the substitutions used to
unify L1 andL2.)
5. For i ← 1 to number of arguments in L1:
(a) Call Unify with the i th argument of L1 and the ith argument of L2, putting result in S.
(b) If S contains FAIL then return {FAIL}.
(c) If S is not equal to NIL then:
(i) Apply S to the remainder of both L1 andL2.
(ii) SUBST: = APPEND(S,SUBST).
6. Return SUBST.
9. persecute(x, y) hate(y,x)
10. hate(x,y) persecute(y,x)
11. Converting to clause form,
we get
12. persecute(x5,y2) hate(y2,x5)
13. hate(x6,y3) persecute(y3,x6)
Procedural v/s Declarative Knowledge
A Declarative representation is one in which knowledge is specified but the use to which
that knowledge is to be put in, is not given.
A Procedural representation is one in which the control information that is necessary to
use the knowledge is considered to be embedded in the knowledge itself.
To use a procedural representation, we need to augment it with an interpreter that follows
the instructions given in the knowledge.
The difference between the declarative and the procedural views of knowledge lies in where
control information resides.
man(Marcus)
man(Caesar)
person(Cleop
atra)
∀x: man(x)→
person(x)
person(x)?
Now we want to extract from this knowledge base the ans to the
question: Ǝy : person (y)
Marcus, Ceaser and Cleopatra can be the answers
Fig: Unsuccessful attempts at resolution
As there is more than one value that satisfies the predicate, but only one value is needed,
the answer depends on the order in which the assertions are examined during the search of
a response.
If we view the assertions as declarative, then we cannot depict how they will be examined. If
we view them as procedural, then they do.
Let us view these assertions as a non-deterministic program whose output is simply not
defined, now this means that there is no difference between Procedural & Declarative
Statements. But most of the machines don’t do so, they hold on to whatever method they
have, either sequential or in parallel.
The focus is on working on the control model.
man(Marcus)
manCeas
er) Vx :
man(x)
person(x)
Person(Cleopatra)
If we view this as declarative then there is no difference with the previous statement. But
viewed procedurally, and using the control model, we used to got Cleopatra as the answer,
now the answer is marcus.
The answer can vary by changing the way the interpreter works.
The distinction between the two forms is often very fuzzy. Rather than trying to prove which
technique is better, what we should do is to figure out what the ways in which rule formalisms
and interpreters can be combined to solve problems.
Logic Programming
Logic programming is a programming language paradigm in which logical assertions are
viewed as programs, e.g.PROLOG
APROLOG program is described as a series of logical assertions, each of which is a Horn Clause.
A Horn Clause is a clause that has at most one positive literal.
Eg p, ¬ p V q etc are also Horn Clauses.
The fact that PROLOG programs are composed only of Horn Clauses and not of arbitrary
logical expressions has two important consequences.
Because of uniform representation a simple & effective interpreter can be written.
The logic of Horn Clause systems is decidable.
Even PROLOG works on backward reasoning.
The program is read top to bottom, left to right and search is performed depth-first with
backtracking.
There are some syntactic difference between the logic and the PROLOG representations
as mentioned
The key difference between the logic & PROLOG representation is that PROLOG
interpreter has a fixed control strategy, so assertions in the PROLOG program define a
particular search path to answer any question. Whereas Logical assertions define set of
answers that they justify, there can be more than one answers, it can be forward or
backward tracking.
Control Strategy for PROLOG states that we begin with a problem statement, which is
viewed as a goal to be proved.
Look for the assertions that can prove the goal.
To decide whether a fact or a rule can be applied to the current problem,
invoke a standard unification procedure.
Reason backward from that goal until a path is found that terminates with assertions in the
program.
Consider paths using a depth-first search strategy and use backtracking.
Propagate to the answer by satisfying the conditions.
Forward v/s Backward Reasoning
The objective of any search is to find a path through a problem space from the initial to the
final one.
There are 2 directions to go and find the answer
Forwa
rd
Backw
ard
8-square problem
Reason forward from the initial states: Begin building a tree of move sequences that might
be solution by starting with the initial configuration(s) at the root of the tree. Generate the
next level of tree by finding all the rules whose left sides match the root
nodeandusetherightsidestocreatethenewconfigurations.Generateeachnode by taking each
node generated at the previous level and applying to it all of the rules whose left sides
match it. Continue.
Reason backward from the goal states: Begin building a tree of move sequences that might
be solution by starting with the goal configuration(s) at the root of the tree. Generate the
next level of tree by finding all the rules whose right sides match the root node and use the
left sides to create the new configurations. Generate each node by taking each node
generated at the previous level and applying to it all of the rules whose right sides match it.
Continue. This is also called Goal-Directed Reasoning.
To summarize, to reason forward, the left sides (pre-conditions) are matched against the
current state and the right sides (the results) are used to generate new nodes until the goal is
reached.
To reason backwards, the right sides are matched against the current node and the
left sides are used to generate new nodes.
A good system for the representation of structured knowledge in a particular domain should posses the
following four properties:
(i) Representational Adequacy:- The ability to represent all kinds of knowledge that are needed in that
domain.
(ii) Inferential Adequacy :- The ability to manipulate the represented structure and infer new
structures.
(iii) Inferential Efficiency:- The ability to incorporate additional information into the knowledge
structure that will aid the inference mechanisms.
(iv) Acquisitional Efficiency :- The ability to acquire new information easily, either by direct insertion or
by program control.
The techniques that have been developed in AI systems to accomplish these objectives fall under two
categories:
In practice most of the knowledge representation employ a combination of both. Most of the
knowledge representation structures have been developed to handle programs that handle natural
language input. One of the reasons that knowledge structures are so important is that they provide
a way to represent information about commonly occurring patterns of things. Such descriptions are
some times called schema. One definition of schema is
“Schema refers to an active organization of the past reactions, or of past experience, which
must always be supposed to be operating in any well adapted organic response”.
By using schemas, people as well as programs can exploit the fact that the real world is not
random. There are several types of schemas that have proved useful in AI programs.
They include
(i) Frames:- Used to describe a collection of attributes that a given object possesses (eg:
description of a chair).
(ii) Scripts:- Used to describe common sequence of event (eg:- a restaurant scene).
(iv) Rule models:- Used to describe common features shared among a set of rules in a
production system.
Frames and scripts are used very extensively in a variety of AI programs. Before selecting
any specific knowledge representation structure, the following issues have to be considered.
(i) The basis properties of objects, if any, which are common to every problem domain must be
identified and handled appropriately.
A frame is a collection of attributes and associated values that describe some entity in the
world. Frames are general record like structures which consist of a collection of slots and
slot values. The slots may be of any size and type.
Slots typically have names and values or subfields called facets. Facets may also have names
and any number of values. A frame may have any number of slots; a slot may have any
number of facets, each with any number of values.
A slot contains information such as attribute value pairs, default values, condition for filling
a slot, pointers to other related frames and procedures that are activated when needed for
different purposes.
Sometimes a frame describes an entity in some absolute sense, sometimes it represents the
entity from a particular point of view. A single frame taken alone is rarely useful.
We build frame systems out of collection of frames that are connected to each other by
virtue of the fact that the value of an attribute of one frame may be another frame. Each
frame should start with an open parenthesis and closed with a closed parenthesis.
The object of a knowledge representation is to express knowledge in a computer tractable form, so that
it can be used to enable our AI agents to perform well.
1. Syntax The syntax of a language defines which configurations of the components of the language
constitute valid sentences.
2. Semantics The semantics defines which facts in the world the sentences refer to, and hence the
statement about the world that each sentence makes.
A good knowledge representation system for any particular domain should possess the following
properties:
1. Representational Adequacy – the ability to represent all the different kinds of knowledge that might
be needed in that domain.
2. Inferential Adequacy – the ability to manipulate the representational structures to derive new
structures (corresponding to new knowledge) from existing structures.
3. Inferential Efficiency – the ability to incorporate additional information into the knowledge structure
which can be used to focus the attention of the inference mechanisms in the most promising directions.
4. Acquisitional Efficiency – the ability to acquire new information easily. Ideally the agent should be
able to control its own knowledge acquisition, but direct insertion of information by a „knowledge
engineer‟ would be acceptable.
In practice, the theoretical requirements for good knowledge representations can usually be achieved
by dealing appropriately with a number of practical requirements:
1. The representations need to be complete – so that everything that could possibly need to be
represented, can easily be represented.
3. They should make the important objects and relations explicit and accessible – so that it is easy to see
what is going on, and how the various components interact.
4. They should suppress irrelevant detail – so that rarely used details don‟t introduce necessary
complications, but are still available when needed.
5. They should expose any natural constraints – so that it is easy to express how one object or relation
influences another.
6. They should be transparent – so you can easily understand what is being said.
7. The implementation needs to be concise and fast – so that information can be stored, retrieved and
manipulated rapidly.
1. A set of rules of the form Ci ® Ai where Ci is the condition part and Ai is the action part. The condition
determines when a given rule is applied, and the action determines what happens when it is applied.
2. One or more knowledge databases that contain whatever information is relevant for the given
problem. Some parts of the database may be permanent, while others may temporary and only exist
during the solution of the current problem. The information in the databases may be structured in any
appropriate manner.
3. A control strategy that determines the order in which the rules are applied to the database, and
provides a way of resolving any conflicts that can arise when several rules match at once.
4. A rule applier which is the computational system that implements the control strategy and applies the
rules.
2. Production Systems are highly modular because the individual rules can be added, removed or
modified independently.
3. The production rules are expressed in a natural form, so the statements contained in the knowledge
base should the a recording of an expert thinking out loud.
One important disadvantage is the fact that it may be very difficult to analyse the flow of control
within a production system because the individual rules don’t call each other.
Production systems describe the operations that can be performed in a search for a solution to the
problem. They can be classified as follows.
A system in which the application of a rule never prevents the later application of another rule, that
could have also been applied at the time the first rule was selected.
A production system in which the application of a particular sequence of rules transforms state X
into state Y, then any permutation of those rules that is allowable also transforms state x into state Y.
Theorem proving falls under monotonic partially communicative system. Blocks world and 8 puzzle
problems like chemical analysis and synthesis come under monotonic, not partially commutative
systems. Playing the game of bridge comes under non monotonic , not partially commutative system.
For any problem, several production systems exist. Some will be efficient than others. Though it may
seem that there is no relationship between kinds of problems and kinds of production systems, in
practice there is a definite relationship.
Partially commutative , monotonic production systems are useful for solving ignorable problems.
These systems are important for man implementation standpoint because they can be implemented
without the ability to backtrack to previous states, when it is discovered that an incorrect path was
followed. Such systems increase the efficiency since it is not necessary to keep track of the changes
made in the search process.
Monotonic partially commutative systems are useful for problems in which changes occur but can
be reversed and in which the order of operation is not critical (ex: 8 puzzle problem).
Production systems that are not partially commutative are useful for many problems in which
irreversible changes occur, such as chemical analysis. When dealing with such systems, the order in
which operations are performed is very important and hence correct decisions have to be made at the
first time itself.
A frame is a data structure with typical knowledge about a particular object or concept. Frames,
first proposed by Marvin Minsky in the 1970s.
Each frame has its own name and a set of attributes associated with it. Name, weight, height
and age are slots in the frame Person. Model, processor, memory and price are slots in the
frame Computer. Each attribute or slot has a value attached to it.
Frames provide a natural way for the structured and concise representation of knowledge.
A frame provides a means of organising knowledge in slots to describe various attributes and
characteristics of the object.
Frames are an application of object-oriented programming for expert systems.
Object-oriented programming is a programming method that uses objects as a basis for analysis,
design and implementation.
In object-oriented programming, an object is defined as a concept, abstraction or thing with
crisp boundaries and meaning for the problem at hand. All objects have identity and are clearly
distinguishable. Michael Black, Audi 5000 Turbo, IBM Aptiva S35 are examples of objects.
An object combines both data structure and its behavior in a single entity. This is in sharp
contrast to conventional programming, in which data structure and the program behavior have
concealed or vague connections.
When an object is created in an object-oriented programming language, we first assign a name
to the object, then determine a set of attributes to describe the object’s characteristics, and at
last write procedures to specify the object’s behavior.
A knowledge engineer refers to an object as a frame (the term, which has become the AI
jargon).
Inference
Two control strategies: forward chaining and backward chaining
Forward chaining:
Working from the facts to a conclusion. Sometimes called the data driven approach. To chain
forward, match data in working memory against 'conditions' of rules in the rule-base. When one of them
fires, this is liable to produce more data. So the cycle continues
Backward chaining:
Working from the conclusion to the facts. Sometimes called the goal-driven approach.
To chain backward, match a goal in working memory against 'conclusions' of rules in the rule-
base. When one of them fires, this is liable to produce more goals. So the cycle continues.
The choice of strategy depends on the nature of the problem. Assume the problem is to get
from facts to a goal (e.g. symptoms to a diagnosis).
Backward chaining is the best choice if:
The goal is given in the problem statement, or can sensibly be guessed at the beginning of the
consultation; or:
The system has been built so that it sometimes asks for pieces of data (e.g. "please now do the
gram test on the patient's blood, and tell me the result"), rather than expecting all the facts to
be presented to it.
This is because (especially in the medical domain) the test may be expensive, or unpleasant, or
dangerous for the human participant so one would want to avoid doing such a test unless there
was a good reason for it.
Forward chaining is the best choice if:
All the facts are provided with the problem statement; or:
There are many possible goals, and a smaller number of patterns of data; or:
There isn't any sensible way to guess what the goal is at the beginning of the consultation.
Note also that a backwards-chaining system tends to produce a sequence of questions which seems
focused and logical to the user, a forward-chaining system tends to produce a sequence which
seems random & unconnected. If it is important that the system should seem to behave like a
human expert, backward chaining is probably the best choice.
■ Given: A Rule base contains following Rule set Rule 1: If A and C Then F
Rule 2: If A and E Then G Rule 3: If B Then E
Rule 4: If G Then D
■ Problem: Prove
If A and B true Then D is true Solution:
(i) Start with input given A, B is true and then start at Rule 1 and go forward/down till a rule “fires''
is found.
First iteration :
(ii) Rule 3 fires : conclusion E is true new knowledge found
(iii) No other rule fires; end of first iteration.
(iv) Goal not found; new knowledge found at (ii); go for second iteration Second iteration:
(v) Rule 2 fires : conclusion G is true new knowledge found
(vi) Rule 4 fires : conclusion D is true Goal found;
Proved.
Backward chaining is a techniques for drawing inferences from Rule base. Backward-chaining
inference is often called goal driven.
The algorithm proceeds from desired goal, adding new assertions found.
A backward-chaining, system looks for the action in the THEN clause of the rules that matches
the specified goal.
Goal Driven
Example : Backward Channing
Rule 4: If G Then D
■ Problem : Prove
(i) Start with goal ie D is true go backward/up till a rule "fires'' is found.
First iteration:
(iii) Rule 2 "fires''; conclusion: A is true new sub goal to prove E is true go backward;
(iv) no other rule fires; end of first iteration. new sub goal found at (iii)go for second iteration Second
iteration :
Conclusion B is true (2nd input found) both inputs A and B ascertained Proved.
FUZZY LOGIC
Fuzzy Logic (FL) is a method of reasoning that resembles human reasoning. The approach of FL
imitates the way of decision making in humans that involves all intermediate possibilities between
digital values YES and NO.
The conventional logic block that a computer can understand takes precise input and produces a
definite output as TRUE or FALSE, which is equivalent to human‟s YES or NO.
The inventor of fuzzy logic, Lotfi Zadeh, observed that unlike computers, the human decision
making includes a range of possibilities between YES and NO, such as –
The fuzzy logic works on the levels of possibilities of input to achieve the definite output.
Implementation
It can be implemented in systems with various sizes and capabilities ranging from small
micro-controllers to large, networked, workstation-based control systems.
It has four main parts as shown Fuzzification Module − It transforms the system inputs, which are
crisp numbers, into fuzzy sets. It splits the input signal into five steps such as
LP x is Large Positive
MP x is Medium Positive
S x is Small
MN x is Medium Negative LN x is Large Negative
Knowledge Base − It stores IF-THEN rules provided by experts.
Inference Engine − It simulates the human reasoning process by making fuzzy inference on the
inputs and IF-THEN rules.
Defuzzification Module − It transforms the fuzzy set obtained by the inference engine into a crisp
value.
The triangular membership function shapes are most common among various other
membership function shapes such as trapezoidal, singleton, and Gaussian.
Here, the input to 5-level fuzzifier varies from -10 volts to +10 volts. Hence the corresponding
output also changes.
Example of a Fuzzy Logic System
Let us consider an air conditioning system with 5-lvel fuzzy logic system. This system adjusts the
temperature of air conditioner by comparing the room temperature and the target temperature value.
Algorithm
Define linguistic variables and terms.
Construct membership functions for them.
Construct knowledge base of rules.
Convert crisp data into fuzzy data sets using membership functions. (fuzzification)
Evaluate rules in the rule base. (interface engine)
Combine results from each rule. (interface engine)
Convert output data into non-fuzzy values. (defuzzification)
Logic Development
Build a set of rules into the knowledge base in the form of IF-THEN-ELSE structures.
The key application areas of fuzzy logic are as given − Automotive Systems
Automatic Gearboxes
Four-Wheel Steering
Vehicle environment control Consumer Electronic Goods
Hi-Fi Systems
Photocopiers
Still and Video Cameras
Television
Domestic Goods
Microwave Ovens
Refrigerators
Toasters
Vacuum Cleaners
Washing Machines
Environment Control
Air Conditioners/Dryers/Heaters
Humidifiers
Advantages of FLSs
Disadvantages of FLSs
Certainty Factor
A certainty factor (CF) is a numerical value that expresses a degree of subjective belief that a
particular item is true. The item may be a fact or a rule. When probabilities are used attention must be
paid to the underlying assumptions and probability distributions in order to show validity. Bayes’rule can
be used to combine probability measures.
Suppose that a certainty is defined to be a real number between -1.0 and +1.0, where 1.0
represents complete certainty that an item is true and -1.0 represents complete certainty that an item is
false. Here a CF of 0.0 indicates that no information is available about either the truth or the falsity of an
item. Hence positive values indicate a degree of belief or evidence that an item is true, and negative
values indicate the opposite belief. Moreover it is common to select a positive number that represents a
minimum threshold of belief in the truth of an item. For example, 0.2 is a commonly chosen threshold
value.
Form of certainty factors in ES IF <evidence>
THEN <hypothesis> {cf }
cf represents belief in hypothesis H given that evidence E has occurred It is based on 2 functions
i) Measure of belief MB(H, E)
ii) Measure of disbelief MD(H, E)
Indicate the degree to which belief/disbelief of hypothesis H is increased if evidence E were observed.
Total strength of belief and disbelief in a hypothesis:
Bayesian networks
Represent dependencies among random variables
Give a short specification of conditional probability distribution
Many random variables are conditionally independent
Simplifies computations
Graphical representation
DAG – causal relationships among random variables
Allows inferences based on the network structure
Definition of Bayesian networks
A BN is a DAG in which each node is annotated with quantitative probability information,
namely:
Nodes represent random variables (discrete or continuous)
Directed links X Y: X has a direct influence on Y, X is said to be a parent of Y
Each node X has an associated conditional probability table, P(Xi | Parents(Xi)) that quantify
the effects of the parents on the node
We can see that P(Xi | Xi-1,…, X1) = P(xi | Parents(Xi)) if Parents(Xi) X1} Xi-1,…,
The condition may be satisfied by labeling the nodes in an order consistent with a DAG
Intuitively, the parents of a node Xi must be all the nodes Xi-1,…, X1 which have a direct
influence on Xi.
Pick a set of random variables that describe the problem
Pick an ordering of those variables
while there are still variables repeat
(a) choose a variable Xi and add a node associated to X i
(b) assign Parents(Xi) a minimal set of nodes that already exists in the network such that the
conditional independence property is satisfied
(c) define the conditional probability table for Xi
Because each node is linked only to previous nodes DAG
P(MaryCalls | JohnCals, Alarm, Burglary, Earthquake) = P(MaryCalls | Alarm)
Dempster-Shafer Theory
Dempster-Shafer theory is an approach to combining evidence
Dempster (1967) developed means for combining degrees of belief derived from independent
items of evidence.
His student, Glenn Shafer (1976), developed method for obtaining degrees of belief for one
question from subjective probabilities for a related question
People working in Expert Systems in the 1980s saw their approach as ideally suitable for such
systems.
Each fact has a degree of support, between 0 and 1:
0 No support for the fact
1 full support for the fact
Differs from Bayesian approach in that:
Belief in a fact and its negation need not sum to 1.
Both values can be 0 (meaning no evidence for or against the fact)
Belief in A:
The belief in an element A of the Power set is the sum of the masses of elements which are subsets of A
(including A itself). Given A={q1, q2, q3}
Bel(A) = m(q1)+m(q2)+m(q3)+ m({q1, q2})+m({q2, q3})+m({q1, q3})+m({q1, q2, q3})
Example
Given the mass assignments as assigned by the detectives:
Result:
Plausibility of A: pl(A)
The plausability of an element A, pl(A), is the sum of all the masses of the sets that intersect
with the set A:
E.g. pl({B,J}) = m(B)+m(J)+m(B,J)+m(B,S) +m(J,S)+m(B,J,S) = 0.9
All Results:
pl(A) = 1- dis(A)
Belief Interval of A:
The certainty associated with a given subset A is defined by the belief interval: [ bel(A) pl(A) ]
E.g. the belief interval of {B,S} is: [0.1 0.8]
Belief Intervals:
Belief intervals allow Demspter-Shafer theory to reason about the degree of certainty or
certainty of our beliefs.
A small difference between belief and plausibility shows that we are certain about our
belief.
A large difference shows that we are uncertain about our belief.
However, even with a 0 interval, this does not mean we know which conclusion is right. Just
how probable it is!
UNIT –III MACHINE LEARNING BASICS
LEARNING
Learning is what gives us flexibility in our life; the fact that we can adjust and adapt to
new circumstances, and learn new tricks.
The important parts of animal learning are remembering, adapting, and generalising:
recognising that last time we were in this situation (saw this data) we tried out some particular
action (gave this output) and it worked (was correct), so we’ll try it again, or it didn’t work, so
we’ll try something different.
The last word, generalising, is about recognising similarity between different situations,
so that things that applied in one place can be used in another. This is what makes learning
useful, because we can use our knowledge in lots of different places.
MACHINE LEARNING
Machine learning is about making computers modify or adapt their actions (whether these
actions are making predictions, or controlling a robot) so that these actions get more accurate,
where accuracy is measured by how well the chosen actions reflect the correct ones.
Imagine that you are playing a game against a computer. You might beat it every time in
the beginning, but after lots of games it starts beating you, until finally you never win. Either you
are getting worse, or the computer is learning how to win
Several algorithms are known for finding weights of a linear function that minimize E
defined in this way. In our case, we require an algorithm that will incrementally refine the
weights as new training examples become available and that will be robust to errors in these
estimated training values. One such algorithm is called the least mean squares, or LMS training
rule. The LMS algorithm is defined as follows:
The Final Design
The final design of our checkers learning system can be naturally described by four distinct
program modules that represent the central components in many learning systems. These four
modules are as follows:
Fig: Summary of choices in designing the checkers learning program
The Performance System is the module that must solve the given performance task, in
this case playing checkers, by using the learned target function(s). It takes an instance of a new
problem (new game) as input and produces a trace of its solution (game history) as output. In our
case, the strategy used by the Performance System to select its next move at each step is
determined by the learned Ṽ evaluation function. Therefore, we expect its performance to
improve as this evaluation function becomes increasingly accurate.
The Critic takes as input the history or trace of the game and produces as output a set of
training examples of the target function. As shown in the diagram, each training example in this
case corresponds to some game state in the trace, along with an estimate Vtrain of the target
function value for this example.
The Generalizer takes as input the training examples and produces an output hypothesis
that is its estimate of the target function. It generalizes from the specific training examples,
hypothesizing a general function that covers these examples and other cases beyond the training
examples. In our example, the Generalizer corresponds to the LMS algorithm, and the output
hypothesis is the function Ṽ described by the learned weights wo, . . . , W6.
The Experiment Generator takes as input the current hypothesis (currently learned
function) and outputs a new problem (i.e., initial board state) for the Performance System to
explore. Its role is to pick new practice problems that will maximize the learning rate of the
overall system. In our example, the Experiment Generator follows a very simple strategy: It
always proposes the same initial game board to begin a new game. More sophisticated strategies
could involve creating board positions designed to explore particular regions of the state space.
The sequence of design choices made for the checkers program is summarized in the
following Figure. These design choices have constrained the learning task in a number of ways.
We have restricted the type of knowledge that can be acquired to a single linear evaluation
function.
We have constrained this evaluation function to depend on only the six specific board
features provided. If the true target function V can indeed be represented by a linear combination
of these particular features, then our program has a good chance to learn it. If not, then the best
we can hope for is that it will learn a good approximation, since a program can certainly never
learn anything that it cannot at least represent.
The problem of automatically inferring the general definition of some concept, given examples
labeled as members or nonmembers of the concept. This task is commonly referred to as concept
learning, or approximating a boolean-valued function from examples.
Table: Positive and negative training examples for the target concept Enjoy Sport.
The most general hypothesis-that every day is a positive example-is represented by
(?, ?, ?, ?, ?, ?)
and the most specific possible hypothesis-that no day is a positive example-is represented by
(Ø,Ø,Ø,Ø,Ø,Ø)
To summarize, the Enjoy Sport concept learning task requires learning the set of days for which
Enjoy Sport = yes, describing this set by a conjunction of constraints over the instance attributes.
Notation
The set of items over which the concept is defined is called the set of instances, which we
denote by X. In the current example, X is the set of all possible days, each represented by the
attributes Sky, AirTemp, Humidity, Wind, Water, and Forecast. The concept or function to be
learned is called the target concept, which we denote by c. In general, c can be any
booleanvalued function defined over the instances X; that is, c : X + {O, 1). In the current
example, the target concept corresponds to the value of the attribute EnjoySport (i.e., c(x) = 1 if
EnjoySport = Yes, and c(x) = 0 if EnjoySport = No).
Now consider the sets of instances that are classified positive by hl and by h2. Because
h2 imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any
instance classified positive by hl will also be classified positive by h2. Therefore, we say that h2
is more general than hl.This intuitive "more general than" relationship between hypotheses can
be defined more precisely as follows.
First, for any instance x in X and hypothesis h in H, we say that x satisjies h if and only if
h(x) = 1. We now define the more-general~han_or.-equal_to relation in terms of the sets of
instances that satisfy the two hypotheses: Given hypotheses hj and hk, hj is more-general-
than_or_ equald_to hk if and only if any instance that satisfies hk also satisfies hj.
Fig:
Supervised learning
A training set of examples with the correct responses (targets) is provided and, based on
this training set, the algorithm generalises to respond correctly to all possible inputs. This is also
called learning from exemplars.
Unsupervised learning
Correct responses are not provided, but instead the algorithm tries to identify similarities
between the inputs so that inputs that have something in common are categorised together. The
statistical approach to unsupervised learning is known as density estimation.
Reinforcement learning
This is somewhere between supervised and unsupervised learning. The algorithm gets
told when the answer is wrong, but does not get told how to correct it. It has to explore and try
out different possibilities until it works out how to get the answer right. Reinforcement learning
is sometime called learning with a critic because of this monitor that scores the answer, but does
not suggest improvements.
Evolutionary learning
Biological evolution can be seen as a learning process: biological organisms adapt to
improve their survival rates and chance of having offspring in their environment. We’ll look at
how we can model this in a computer, using an idea of fitness, which corresponds to a score for
how good the current solution is.
SUPERVISED LEARNING
The webpage example is a typical problem for supervised learning. There is a set of data (the
training data) that consists of a set of input data that has target data, which is the answer that the
algorithm should produce, attached. This is usually written as a set of data (xi, ti), where the
inputs are xi, the targets are ti, and the i index suggests that we have lots of pieces of data,
indexed by i running from 1 to some upper limit N.
REGRESSION
The following datapoints and asked you to tell me the value of the output (which we will
call y since it is not a target datapoint) when x = 0.44
Fig :Top left: A few datapoints from a sample problem. Bottom left: Two possible ways
to predict the values between the known datapoints: connecting the points with straight lines, or
using a cubic approximation (which in this case misses all of the points). Top and bottom right:
Two more complex approximators (see the text for details) that pass through the points, although
the lower one is rather better than the top.
Since the value x = 0.44 isn’t in the examples given, you need to find some way to predict
what value it has. You assume that the values come from some sort of function, and try to find
out what the function is. Then you’ll be able to give the output value y for any given value of x.
This is known as a regression problem in statistics: fit a mathematical function describing a
curve, so that the curve passes as close as possible to all of the datapoints. It is generally a
problem of function approximation or interpolation, working out the value between values that
we know.
The top-left plot shows a plot of the 7 values of x and y in the table, while the other plots
show different attempts to fit a curve through the datapoints. The bottom-left plot shows two
possible answers found by using straight lines to connect up the points, and also what happens if
we try to use a cubic function (something that can be written as ax3 +bx2 +cx+d = 0). The top-
right plot shows what happens when we try to match the function using a different polynomial,
this time of the form
and finally the bottom-right plot shows the function y = 3 sin(5x). Which of these
functions would you choose? our machine learning algorithms can do is interpolate between
datapoints. This might not seem to be intelligent behaviour, or even very difficult in two
dimensions, but it is rather harder in higher dimensional spaces.
CLASSIFICATION
The classification problem consists of taking input vectors and deciding which of N
classes they belong to, based on training from exemplars of each class. The most important point
about the classification problem is that it is discrete—each example belongs to precisely one
class, and the set of classes covers the whole possible output space.
Example: Coin classifier
When the coin is pushed into the slot, the machine takes a few measurements of it. These
could include the diameter, the weight, and possibly the shape, and are the features that will
generate our input vector.
Our input vector will have three elements, each of which will be a number showing
the measurement of that feature (choosing a number to represent the shape would involve
an encoding, for example that 1=circle, 2=hexagon, etc.).
There are many other features that we could measure. If our vending machine included an
atomic absorption spectroscope, then we could estimate the density of the material and its
composition, or if it had a camera, we could take a photograph of the coin and feed that image
into the classifier.
Fig: The New Zealand coins.
Fig: Left: A set of straight line decision boundaries for a classification problem. Right: An
alternative set of decision boundaries that separate the plusses from the lightening strikes better,
but requires a line that isn’t straight.
For example,
If we tried to separate coins based only on colour, we wouldn’t get very far, because the
20 ¢ and 50 ¢ coins are both silver and the $1 and $2 coins both bronze. If we use colour and
diameter, we can do a pretty good job of the coin classification problem for NZ coins. There are
some features that are entirely useless.
For example,
Knowing that the coin is circular doesn’t tell us anything about NZ coins, which are all
circular. The above Figure shows a set of 2D inputs with three different classes shown, and two
different decision boundaries; on the left they are straight lines, and are therefore simple, but
don’t categorise as well as the non-linear curve on the right.
Over fitting
Training
Testing and Validation Sets
The Confusion Matrix
Accuracy Metrics
ROC Curve
Unbalanced Datasets
Measurement Precision
Overfitting
The number of degrees of variability in most machine learning algorithms is huge — for
a neural network there are lots of weights, and each of them can vary. This is undoubtedly more
variation than there is in the function we are learning, so we need to be careful: if we train for too
long, then we will overfit the data, which means that we have learnt about the noise and
inaccuracies in the data as well as the actual function. The following figure shows this by
plotting the predictions of some algorithm (as the curve) at two different points in the learning
process.
Fig: The effect of overfitting is that rather than finding the generating function (as shown on
the left), the neural network matches the inputs perfectly, including the noise in them (on the
right). This reduces the generalisation capabilities of the network.
On the left of the figure the curve fits the overall trend of the data well (it has generalised to
the underlying general function), but the training error would still not be that close to zero since
it passes near, but not through, the training data.
As the network continues to learn, it will eventually produce a much more complex model
that has a lower training error (close to zero), meaning that it has memorised the training
examples, including any noise component of them, so that is has overfitted the training data.
We want to stop the learning process before the algorithm overfits, which means that we
need to know how well it is generalising at each timestep. We can’t use the training data for this,
because we wouldn’t detect overfitting, but we can’t use the testing data either,because we’re
saving that for the final tests.
So we need a third set of data to use for this purpose, which is called the validation set because
we’re using it to validate the learning so far. This is known as cross-validation in statistics. It is
part of model selection: choosing the right parameters for the model so that it generalises as well
as possible.
Training, Testing, and Validation Sets
The area of semi-supervised learning attempts to deal with this need for large amounts of
labelled data;
Fig: The dataset is split into different sets, some for training, some for validation, and some for
testing.
If you are really short of training data, so that if you have a separate validation set there is
a worry that the algorithm won’t be sufficiently trained; then it is possible to perform leave-
some-out, multi-fold cross-validation.
The idea is shown in following figure The dataset is randomly partitioned into K subsets,
and one subset is used as a validation set, while the algorithm is trained on all of the others.
Fig: Leave-some-out, multi-fold cross-validation gets around the problem of data shortage by
training many models. It works by splitting the data into sets, training a model on most sets and
holding one out for validation (and another for testing). Different models are trained with different
sets being held out.
A different subset is then left out and a new model is trained on that subset, repeating the
same process for all of the different subsets. Finally, the model that produced the lowest
validation error is tested and used. We’ve traded off data for computation time, since we’ve had
to train K different models instead of just one. In the most extreme case of this there is leave-
one-out cross-validation, where the algorithm is validated on just one piece of data, training on
all of the rest.
The confusion matrix is a nice simple idea: make a square matrix that contains all the
possible classes in both the horizontal and vertical directions and list the classes along the top of
a table as the predicted outputs, and then down the left-hand side as the targets.
For example, the element of the matrix at (i, j) tells us how many input patterns were put
into class i in the targets, but class j by the algorithm. Anything on the leading diagonal (the
diagonal that starts at the top left of the matrix and runs down to the bottom right) is a correct
answer. Suppose that we have three classes: C1,C2, and C3. Now we count the number of times
that the output was class C1 when the target was C1, then when the target was C2, and so on
until we’ve filled in the table:
This table tells us that, for the three classes, most examples were classified correctly, but
two examples of class C3 were misclassified as C1, and so on. For a small number of classes this
is a nice way to look at the outputs. If you just want one number, then it is possible to divide the
sum of the elements on the leading diagonal by the sum of all of the elements in the matrix,
which gives the fraction of correct responses. This is known as the accuracy, and we are about to
see that it is not the last word in evaluating the results of a machine learning algorithm.
Accuracy Metrics
We can do more to analyse the results than just measuring the accuracy. If you consider
the possible outputs of the classes, then they can be arranged in a simple chart like this (where a
true positive is an observation correctly put into class 1, while a false positive is an observation
incorrectly put into class 1, while negative examples (both true and false) are those put into class
2):
The entries on the leading diagonal of this chart are correct and those off the diagonal are
wrong, just as with the confusion matrix. Note, however, that this chart and the concepts of false
positives, etc., are based on binary classification.
Accuracy is then defined as the sum of the number of true positives and true negatives divided
by the total number of examples (where # means ‘number of’, and TP stands for True Positive,
etc.):
The problem with accuracy is that it doesn’t tell us everything about the results, since it
turns four numbers into just one. There are two complementary pairs of measurements that can
help us to interpret the performance of a classifier, namely sensitivity and specificity, and
precision and recall. Their definitions are shown next, followed by some explanation.
Sensitivity (also known as the true positive rate) is the ratio of the number of correct
positive examples to the number classified as positive, while specificity is the same ratio for
negative examples. Precision is the ratio of correct positive examples to the number of actual
positive examples, while recall is the ratio of the number of correct positive examples out of
those that were classified as positive, which is the same as sensitivity.
Sensitivity and specificity sum the columns for the denominator, while precision and
recall sum the first column and the first row, and so miss out some information about how well
the learner does on the negative examples.
If you consider precision and recall, then you can see that they are to some extent
inversely related, in that if the number of false positives increases (meaning that the algorithm is
using a broader definition of that class), then the number of false negatives often decreases, and
vice versa. They can be combined to give a single measure, the F1 measure, which can be
written in terms of precision and recall as:
and in terms of the numbers of false positives, etc. (from which it can be seen that it computes
the mean of the false examples) as:
We can also compare classifiers – either the same classifier with different learning
parameters, or completely different classifiers.Tthe Receiver Operator Characteristic curve
(almost always known just as the ROC curve) is useful. This is a plot of the percentage of true
positives on the y axis against false positives on the x axis;
Fig: An example of an ROC curve. The diagonal line represents exactly chance, so
anything above the line is better than chance, and the further from the line, the better. Of
the two curves shown, the one that is further away from the diagonal line would represent
a more accurate method.
An example is shown in the above figure. A single run of a classifier produces a single
point on the ROC plot, and a perfect classifier would be a point at (0, 1) (100% true positives,
0% false positives), while the anti-classifier that got everything wrong would be at (1,0); so the
closer to the top-left-hand corner the result of a classifier is, the better the classifier has
performed. Any classifier that sits on the diagonal line from (0,0) to (1,1) behaves exactly at the
chance level (assuming that the positive and negative classes are equally common) and so
presumably a lot of learning effort is wasted since a fair coin would do just as well.
In order to compare classifiers, or choices of parameters settings for the same classifier,
you could just compute the point that is furthest from the ‘chance’ line along the diagonal.
However, it is normal to compute the area under the curve (AUC) instead. If you only have one
point for each classifier, the curve is the trapezoid that runs from (0,0) up to the point and then
from there to (1,1). If there are more points (based on more runs of the classifier, such as trained
and/or tested on different datasets), then they are just included in order along the diagonal line.
The key to getting a curve rather than a point on the ROC curve is to use cross validation.
If you use 10-fold cross-validation, then you have 10 classifiers, with 10 different test sets, and
you also have the ‘ground truth’ labels. The true labels can be used to produce a ranked list of the
different cross-validation-trained results, which can be used to specify a curve through the 10
data points on the ROC curve that correspond to the results of this classifier. By producing an
ROC curve for each classifier it is possible to compare their results.
Unbalanced Datasets
For the accuracy we have implicitly assumed that there are the same number of positive
and negative examples in the dataset (which is known as a balanced dataset).
we can compute the balanced accuracy as the sum of sensitivity and specificity divided
by 2. A more correct measure is Matthew’s Correlation Coefficient, which is computed as:
If any of the brackets in the denominator are 0, then the whole of the denominator is set to 1.
This provides a balanced accuracy computation.
Measurement Precision
The concept here is to treat the machine learning algorithm as a measurement system. We
feed in inputs and look at the outputs that we get. Even before comparing them to the target
values, we can measure something about the algorithm: if we feed in a set of similar inputs, then
we would expect to get similar outputs for them. This measure of the variability of the algorithm
is also known as precision
The point is that just because an algorithm is precise it does not mean that it is accurate –
it can be precisely wrong if it always gives the wrong prediction. One measure of how well the
algorithm’s predictions match reality is known as trueness
It can be defined as the average distance between the correct output and the prediction.
Trueness doesn’t usually make much sense for classification problems unless there is some
concept of certain classes being similar to each other.
Fig: Assuming that the player was aiming for the highest-scoring triple 20 in darts (the segments
each score the number they are labelled with, the narrow band on the outside of the circle scores
double and the narrow band halfway in scores triple; the outer and inner ‘bullseye’ at the centre
score 25 and 50, respectively), these four pictures show different outcomes. Top left: very
accurate: high precision and trueness, top right: low precision, but good trueness, bottom left:
high precision, but low trueness, and bottom right: reasonable trueness and precision, but the
actual outputs are not very good.
The above figure illustrates the idea of trueness and precision in the traditional way: as a
darts game, with four examples with varying trueness and precision for the three darts thrown by
a player.
UNIT – IV NEURAL NETWORKS
The processing units of the brain, these are nerve cells called neurons. There are lots of
them (100 billion = 10 11 is the figure that is often given) and they come in lots of different types,
depending upon their particular task.
Their general operation is similar in all cases: transmitter chemicals within the fluid of
the brain raise or lower the electrical potential inside the body of the neuron. If this membrane
potential reaches some threshold, the neuron spikes or fires, and a pulse of fixed strength and
duration is sent down the axon. The axons divide (arborise) into connections to many other
neurons, connecting to each of these neurons in a synapse.
Each neuron is typically connected to thousands of other neurons, so that it is estimated
that there are about 100 trillion (= 1014) synapses within the brain. After firing, the neuron must
wait for some time to recover its energy (the refractory period) before it can fire again.
Each neuron can be viewed as a separate processor, performing a very simple
computation: deciding whether or not to fire. This makes the brain a massively parallel computer
made up of 1011 processing elements. If that is all there is to the brain, then we should be able to
model it inside a computer and end up with animal or human intelligence inside a computer.
Hebb’s Rule
Hebb’s rule says that the changes in the strength of synaptic connections are proportional
to the correlation in the firing of the two connecting neurons.
So if two neurons consistently fire simultaneously, then any connection between them will
change in strength, becoming stronger.
If the two neurons never fire simultaneously, the connection between them will die away.
The idea is that if two neurons both respond to something, then they should be connected.
Example:
Suppose that you have a neuron somewhere that recognises your grandmother (this will
probably get input from lots of visual processing neurons, but don’t worry about that). Now if
your grandmother always gives you a chocolate bar when she comes to visit, then some neurons,
which are happy because you like the taste of chocolate, will also be stimulated.
Since these neurons fire at the same time, they will be connected together, and the
connection will get stronger over time. So eventually, the sight of your grandmother, even in a
photo, will be enough to make you think of chocolate. Sound familiar? Pavlov used this idea,
called classical conditioning.
To train his dogs so that when food was shown to the dogs and the bell was rung at the
same time, the neurons for salivating over the food and hearing the bell fired simultaneously, and
so became strongly connected. Over time, the strength of the synapse between the neurons that
responded to hearing the bell and those that caused the salivation reflex was enough that just
hearing the bell caused the salivation neurons to fire in sympathy.
There are other names for this idea that synaptic connections between neurons and
assemblies of neurons can be formed when they fire together and can become stronger. It is also
known as long-term potentiation and neural plasticity, and it does appear to have correlates in
real brains.
Fig: A picture of McCulloch and Pitts’ mathematical model of a neuron. The inputs xi are
multiplied by the weights wi, and the neurons sum their values. If this sum is greater than
the threshold θ then the neuron fires; otherwise it does not.
We will use the picture to write down a mathematical description. On the left of the
picture are a set of input nodes (labeled x1, x2, . . . xm). These are given some values, and as an
example we’ll assume that there are three inputs, with x1 = 1, x2 = 0, x3 = 0.5. In real neurons
those inputs come from the outputs of other neurons. So the 0 means that a neuron didn’t fire, the
1 means it did, and the 0.5 has no biological meaning, but never mind. (Actually, this isn’t quite
fair, but it’s a long story and not very relevant.) Each of these other neuronal firings flowed
along a synapse to arrive at our neuron, and those synapses have strengths, called weights. The
strength of the synapse affects the strength of the signal, so we multiply the input by the weight
of the synapse (so we get x1 × w1 and x2 × w2, etc.). Now when all of these signals arrive into
our neuron, it adds them up to see if there is enough strength to make it fire. We’ll write that as,
which just means sum (add up) all the inputs multiplied by their synaptic weights. I have
assumed that there are m of them, where m = 3 in the example. If the synaptic weights are w1 =
1,w2 = −0.5,w3 = −1, then the inputs to our model neuron are h = 1 × 1 + 0 × −0.5 + 0.5 × −1 = 1
+ 0 + −0.5 = 0.5. Now the neuron needs to decide if it is going to fire.
For a real neuron, this is a question of whether the membrane potential is above some
threshold. We’ll pick a threshold value (labelled θ), say θ = 0 as an example. Now, does our
neuron fire? Well, h = 0.5 in the example, and 0.5 > 0, so the neuron does fire, and produces
output 1. If the neuron did not fire, it would produce output 0.
The McCulloch and Pitts neuron is a binary threshold device. It sums up the inputs
(multiplied by the synaptic strengths or weights) and either fires (produces output 1) or does not
fire (produces output 0) depending on whether the input is above some threshold. We can write
the second half of the work of the neuron, the decision about whether or not to fire (which is
known as an activation function), as:
This is a very simple model, but we are going to use these neurons, or very simple
variations on them using slightly different activation functions (that is, we’ll replace the
threshold function with something else) for most of our study of neural networks.
NEURAL NETWORKS
One thing that is probably fairly obvious is that one neuron isn’t that interesting. It
doesn’t do very much, except fire or not fire when we give it inputs. In fact, it doesn’t even learn.
If we feed in the same set of inputs over and over again, the output of the neuron never varies—it
either fires or does not. So to make the neuron a little more interesting we need to work out how
to make it learn, and then we need to put sets of neurons together into neural networks so that
they can do something useful.
In order to make a neuron learn, the question that we need to ask is:
How should we change the weights and thresholds of the neurons so that the network gets the
right answer more often?
Our very first neural network, the space-age sounding Perceptron, and see how we can
use it to solve the problem. Once we have worked out the algorithm and how it works, we’ll look
at what it can and cannot do, and then see how statistics can give us insights into learning as
well.
THE PERCEPTRON
The Perceptron is nothing more than a collection of McCulloch and Pitts neurons
together with a set of inputs and some weights to fasten the inputs to the neurons. The network is
shown in following figure. On the left of the figure, shaded in light grey, are the input nodes.
Fig: The Perceptron network, consisting of a set of input nodes (left) connected
to McCulloch and Pitts neurons using weighted connections.
These are not neurons, they are just a nice schematic way of showing how values are fed
into the network, and how many of these input values there are (which is the dimension (number
of elements) in the input vector). They are almost always drawn as circles, just like neurons,
which is rather confusing, so I’ve shaded them a different colour. The neurons are shown on the
right, and you can see both the additive part (shown as a circle) and the thresholder. In practice
nobody bothers to draw the thresholder separately, you just need to remember that it is part of the
neuron.
The neurons in the Perceptron are completely independent of each other: it doesn’t matter
to any neuron what the others are doing, it works out whether or not to fire by multiplying
together its own weights and the input, adding them together, and comparing the result to its own
threshold, regardless of what the other neurons are doing.
Even the weights that go into each neuron are separate for each one, so the only thing
they share is the inputs, since every neuron sees all of the inputs to the network.
In the above figure the number of inputs is the same as the number of neurons, but this
does not have to be the case — in general there will be m inputs and n neurons. The number of
inputs is determined for us by the data, and so is the number of outputs, since we are doing
supervised learning, so we want the Perceptron to learn to reproduce a particular target, that is, a
pattern of firing and non-firing neurons for the given input. We set the values of the input nodes
to match the elements of an input vector and then use Equations
We can do this for all of the neurons, and the result is a pattern. of firing and non-firing
neurons, which looks like a vector of 0s and 1s, so if there are 5 neurons. Then a typical output
pattern could be (0, 1, 0, 0, 1), which means that the second and fifth neurons fired and the others
did not. We compare that pattern to the target, which is our known correct answer for this input,
to identify which neurons got the answer right, and which did not.
There are m weights that are connected to that neuron, one for each of the input nodes. If
we label the neuron that is wrong as k, then the weights that we are interested in are wik, where i
runs from 1 to m. So we know which weights to change, but we still need to work out how to
change the values of those weights.
We compute yk –tk (the difference between the output yk, which is what the neuron did,
and the target for that neuron, tk, which is what the neuron should have done. This is a possible
error function). If it is negative then the neuron should have fired and didn’t, so we make the
weights bigger,and vice versa if it is positive, which we can do by subtracting the error value.
That element of the input could be negative, which would switch the values over; so if we
wanted the neuron to fire we’d need to make the value of the weight negative as well. To get
around this we’ll multiply those two things together to see how we should change the weight:
∆wik = −(yk − tk) × xi, and the new value of the weight is the old value plus this value.
We need to decide how much to change the weight by. This is done by multiplying the
value above by a parameter called the learning rate, usually labelled as ɳ. The value of the
learning rate decides how fast the network learns. the rule for updating a weight wij :
Computing the computational complexity of this algorithm is very easy. The recall phase
loops over the neurons, and within that loops over the inputs, so its complexity is O(mn). The
training part does this same thing, but does it for T iterations, so costs O(Tmn).
There are two input nodes (plus the bias input) and there will be one output. The inputs
and the target are given in the table on the left. The right of the figure shows a plot of the
function with the circles as the true outputs, and a cross as the false one. The corresponding
neural network is shown in above Figure.
There are three weights. The algorithm tells us to initialize the weights to small random
numbers, so we’ll pick w0 = −0.05,w1 = −0.02,w2 = 0.02.
Now we feed in the first input, where both inputs are 0: (0, 0). Remember that the input
to the bias weight is always −1, so the value that reaches the neuron is −0.05 × −1 +−0.02 × 0 +
0.02 × 0 = 0.05. This value is above 0, so the neuron fires and the output is 1, which is incorrect
according to the target. The update rule tells us that we need to apply Equation
To each of the weights separately (we’ll pick a value of ɳ = 0.25 for the example):
Now we feed in the next input (0, 1) and compute the output (check that you agree that
the neuron does not fire, but that it should) and then apply the learning rule again:
For the (1, 0) input the answer is already correct (you should check that you agree with
this), so we don’t have to update the weights at all, and the same is true for the (1, 1) input. So
now we’ve been through all of the inputs once. Unfortunately, that doesn’t mean we’ve
finished—not all the answers are correct yet.
Implementation
Written this way in Python syntax, the recall code that is used after training for a set of
nData datapoints arranged in the array inputs
Python’s numerical library NumPy provides an alternative method, because it can easily
multiply arrays and matrices together.
In computer terms, matrices are just two-dimensional arrays. We can write the set of
weights for the network in a matrix by making an np.array that has m + 1 rows (the number of
input nodes + 1 for the bias) and n columns (the number of neurons). Now, the element of the
matrix at location (i, j) contains the weight connecting input i to neuron j, which is what we had
in the code above.
If we have matrices A and B where A is size m × n, then the size of B needs to be n×p,
where p can be any number. The n is called the inner dimension since when we write out the size
of the matrices in the multiplication we get (m × n) × (n × p).
NumPy can do this multiplication for us, using the np.dot() function. So to reproduce the
calculation above, we use (where >>> denotes the Python command line, and so this is code to
be typed in, with the answers provided by the Python interpreter shown afterwards):
The np.array() function makes the NumPy array, which is actually a matrix here, made up
of an array of arrays: each row is a separate array, as you can see from the square brackets within
square brackets.
The entire section of code for the recall function of the Perceptron can be rewritten in two lines
of code as:
Using the np.transpose() function, which swaps the rows and columns over (so using matrix a
above again) we get:
The weight update for the entire network can be done in one line (where eta is the learning
rate,ɳ):
The np.shape() function, which tells you the number of elements in each dimension of the
array. The only things that are needed are to add those extra −1’s onto the input vectors for the
bias node, and to decide what values we should put into the weights to start with. The first of
these can be done using the np.concatenate() function, making a one-dimensional array that
contains -1 as all of its elements, and adding it on to the inputs array (note that nData in the code
is equivalent to N in the text):
The OR example that was used in the hand-worked demonstration. Making the OR data
is easy, and then running the code requires importing it using its filename (pcn) and then calling
the pcntrain function. The print-out below shows the instructions to set up the arrays and call the
function, and the output of the weights for 5 iterations of a particular run of the program, starting
from random initial points (note that the weights stop changing after the 1st iteration in this case,
and that different runs will produce different values).
The following Figure shows the decision boundary, which shows when the decision
about which class to categorise the input as changes from crosses to circles.
Before returning the weights, the Perceptron algorithm above prints out the outputs for
the trained inputs. You can also use the network to predict the outputs for other values by using
the pcnfwd function. However, you need to manually add the −1s on in this case, using:
The results on this test data are what you can use in order to compute the accuracy of the
training algorithm using the methods.
LINEAR SEPARABILITY
What the Perceptron does: it tries to find a straight line (in 2D, a plane in 3D, and a
hyperplane in higher dimensions) where the neuron fires on one side of the line, and doesn’t on
the other. This line is called the decision boundary or discriminant function. An example of one
is given in the following Figure.
The matrix notation we used in the implementation, but consider just one input vector x.
The neuron fires if x·wT >=0 (where w is the row of W that connects the inputs to one particular
neuron; they are the same for the OR example, since there is only one neuron, and wT denotes the
transpose of w and is used to make both of the vectors into column vectors). The a · b notation
describes the inner or scalar product between two vectors. It is computed by multiplying each
element of the first vector by the matching element of the second and adding them all together.
As you might remember from high school, a · b = ||a|| ||b|| cos θ, where θ is the angle between a
and b and ||a|| is the length of the vector a. So the inner product computes a function of the angle
between the two vectors, scaled by their lengths. It can be computed in NumPy using the
np.inner() function. Getting back to the Perceptron, the boundary case is where we find an input
vector x1 that has x1 · wT = 0. Now suppose that we find another input vector x2 that satisfies
x2 · wT = 0. Putting these two equations together we get:
What does this last equation mean? In order for the inner product to be 0, either ||a||or ||b||
or cos θ needs to be zero. There is no reason to believe that ||a|| or ||b|| should be 0, so cos θ = 0.
This means that θ = π/2 (or −π/2), which means that the two vectors are at right angles to each
other. Now x1 − x2 is a straight line between two points that lie on the decision boundary, and
the weight vector wT must be perpendicular.
The associated target outputs, the Perceptron simply tries to find a straight line that
divides the examples where each neuron fires from those where it does not. This is great if that
straight line exists, but is a bit of a problem otherwise. The cases where there is a straight line are
called linearly separable cases.
The following Figure shows an example of decision boundaries computed by a
Perceptron with four neurons; by putting them together we can get good separation of the
classes.
A Useful Insight
Writing the problem in 3D means including a third input dimension that does not change
the data when it is looked at in the (x, y) plane, but moves the point at (0, 0) along a third
dimension. So the truth table for the function is the one shown on the left side of the following
Figure (where ‘In3’ has been added, and only affects the point at (0, 0)).
Fig: A decision boundary (the shaded plane) solving the XOR problem in 3D with
the crosses below the surface and the circles above it.
To demonstrate this, the following listing uses the same Perceptron code:
The following Figure shows two versions of the same dataset. On the left side, the
coordinates are x1 and x2, while on the right side the coordinates are x1, x2 and x1 ×x2. It is now
easy to fit a plane (the 2D equivalent of a straight line) that separates the data.
Fig: Left: Non-separable 2D dataset. Right: The same dataset with third coordinate
x1 × x2, which makes it separable.
Statistics has been dealing with problems of classification and regression for a long time,
before we had computers in order to do difficult arithmetic for us, and so straight line methods
have been around in statistics for many years. They provide a different (and useful) way to
understand what is happening in learning, and by using both statistical and computer science
methods we can get a good understanding of the whole area.
LINEAR REGRESSION
For regression we are making a prediction about an unknown value y (such as the
indicator variable for classes or a future value of some data) by computing some function of
known values xi. We are thinking about straight lines, so the output y is going to be a sum of the
xi values, each multiplied by a constant parameter:
The βi define a straight line (plane in 3D, hyperplane in higher dimensions) that goes
through (or at least near) the datapoints. The following Figure shows this in two and three
dimensions.
The most common solution is to try to minimise the distance between each datapoint and
the line that we fit. We can measure the distance between a point and a line by defining another
line that goes through the point and hits the line.
We can use Pythagoras’ theorem to know the distance. Now, we can try to minimise an
error function that measures the sum of all these distances. If we ignore the square roots, and just
minimise the sum-of-squares of the errors, then we get the most common minimisation, which is
known as least-squares optimisation.
In order to minimise the squared difference between the prediction and the actual data value,
summed over all of the datapoints. That is, we have:
It might not be clear what this means, but if we threshold the outputs by setting every
value less than 0.5 to 0 and every value above 0.5 to 1, then we get the correct answer. Using it
on the XOR function shows that this is still a linear method:
The linear regressor can’t do much with the names of the cars either, but since they
appear in quotes (") we will tell np.loadtxt that they are comments, using:
Separate the data into training and testing sets, and then use the training set to recover the _
vector. Then you use that to get the predicted values on the test set. However, the confusion
matrix isn’t much use now, since there are no classes to enable us to analyse the results. Instead,
we will use the sum-of-squares error, which consists of computing the difference between the
prediction and the true value, squaring them so that they are all positive, and then adding them
up, as is used in the definition of the linear regressor. Obviously, small values of this measure are
good. It can be computed using:
UNSUPERVISED LEARNING
Unsupervised learning is to find clusters of similar inputs in the data without being
explicitly told that these datapoints belong to one class and those to a different class.
A distance measure In order to talk about distances between points, we need some way
to measure distances. It is often the normal Euclidean distance, but there are other alternatives.
The mean average We can compute the central point of a set of datapoints, which is the
mean average (the mean of two numbers is, it is the point halfway along the line between them).
Actually, this is only true in Euclidean space, which is the one you are used to, where everything
is nice and flat.
A suitable way of positioning the cluster centres: we compute the mean point of each
cluster, μc(i), and put the cluster centre there. This is equivalent to minimising the Euclidean
distance (which is the sum-of-squares error again) from each datapoint to its cluster centre.
For all of the points that are assigned to a cluster, we then compute the mean of them, and
move the cluster centre to that place. We iterate the algorithm until the cluster centres stop
moving.
The NumPy implementation follows these steps almost exactly, and we can take
advantage of the np.argmin() function, which returns the index of the minimum value, to find the
closest cluster. The code that computes the distances, finds the nearest cluster centre, and updates
them can then be written as:
The following Figures show some data and some different ways to cluster that data
computed by the k-means algorithm.
The above figure shows examples of what happens when you choose the number of
centres wrongly. There are certainly cases where we don’t know in advance how many clusters
we will see in the data, but the k-means algorithm doesn’t deal with this at all well.
To find a good local optimum (or even the global one) we use many different initial
centre locations, and the solution that minimises the overall sum-of-squares error is likely to be
the best one.
By running the algorithm with lots of different values of k, we can see which values give
us the best solution.
If we still just measure the sum-of-squares error between each datapoint and its nearest
cluster centre, then when we set k to be equal to the number of datapoints, we can position one
centre on every datapoint, and the sum-of-squares error will be zero. There is no generalisation
in this solution: it is a case of serious overfitting.
By computing the error on a validation set and multiplying the error by k we can see
something about the benefit of adding each extra cluster centre.
We will choose k neurons (for hopefully obvious reasons) and fully connect the inputs to
the neurons, as usual. We will use neurons with a linear transfer function, computing the
activation of the neurons as simply the product of the weights and inputs:
Normalisation
Computing this normalisation in NumPy takes a little bit of care because we are
normalizing the total Euclidean distance from the origin, and the sum and division are row-wise
rather than column-wise, which means that the matrix has to be transposed before and after the
division:
A Better Weight Update Rule
If we normalise the inputs as well, which certainly seems reasonable, then we can use the
following weight update rule:
VECTOR QUANTISATION
What happens when I want to send a datapoint and it isn’t in the codebook? In that case
we need to accept that our data will not look exactly the same, and I send you the index of the
prototype vector that is closest to it (this is known as vector quantisation, and is the way that
lossy compression works).
The dots at the centre of each cell are the prototype vectors, and any datapoint that lies
within a cell is represented by the dot. The name for each cell is the Voronoi set of a particular
prototype. Together, they produce the Voronoi tesselation of the space. If you connect together
every pair of points that share an edge, as is shown by the dotted lines, then you get the Delaunay
triangulation, which is the optimal way to organise the space to perform function approximation.
We need to choose prototype vectors that are as close as possible to all of the possible
inputs that we might see. This application is called learning vector quantization because we are
learning an efficient vector quantisation. The k-means algorithm can be used to solve the
problem if we know how large we want our codebook to be. Another algorithm turns out to be
more useful, the Self-Organising Feature Map.
The most commonly used competitive learning algorithm is the Self-Organising Feature
Map (often abbreviated to SOM).It was considering the question of how sensory signals get
mapped into the cerebral cortex of the brain with an order.
For example, in the auditory cortex, which deals with the sounds that we hear, neurons
that are excited (i.e., that are caused to fire) by similar sounds are positioned closely together,
whereas two neurons that are excited by very different sounds will be far apart.
There are two novel departures
1. The relative locations of the neurons in the network matters (this property is known as
feature mapping—nearby neurons correspond to similar input patterns)
2. The neurons are arranged in a grid with connections between the neurons, rather than
in layers with connections only between the different layers.
In the auditory cortex there appears to be sheets of neurons arranged in 2D, and that is
the typical arrangement of neurons for the SOM: a grid of neurons arranged in 2D, as can
be seen in the following figure
Neighbourhood Connections
The size of the neighbourhood is thus another parameter that we need to control. Once
the network has been learning for a while, the rough ordering has already been created, and the
algorithm starts to fine-tune the individual local regions of the network. At this stage, the
neighbourhoods should be small, as is shown in the following Figure.
It therefore makes sense to reduce the size of the neighbourhood as the network adapts.
These two phases of learning are also known as ordering and convergence. Typically, we reduce
the neighbourhood size by a small amount at each iteration of the algorithm. We control the
learning rate ɳ in exactly the same way, so that it starts off large and decreases over time.
Self-Organisation
A particularly interesting aspect of feature mapping is that we get a global ordering of the
neurons in the network, despite the fact that the interactions are all local, since neurons that are
very far apart do not interact with each other. We thus get a global ordering of the space using
only a set of local interactions, which is amazing. This is known as self-organisation, and it
appears everywhere.
Example: A flock of birds flying in formation The birds cannot possibly know exactly
where each other are, so how do they keep in formation?
If each bird just tries to stay diagonally behind the bird to its right, and fly at the same
speed, then they form perfect flocks, no matter how they start off and what objects are placed in
their way. So the global ordering of the whole flock can arise from the local interactions of each
bird looking to the one on its right (or left).
Applying the SOM algorithm to a 2D rectangular array of neurons, but there is nothing in
the algorithm to force this. There are cases where a line of neurons (1D) works better, or where
three dimensions are needed. It depends on the dimensionality of the inputs (actually on the
intrinsic dimensionality, the number of dimensions that you actually need to represent the data),
not the number that it is embedded in.
Example:
Consider a set of inputs spread through the room you are in, but all on the plane that
connects the bottom of the wall to your left with the top of the wall to your right. These points
have intrinsic dimensionality two since they are all on the plane, but they are embedded in your
three-dimensional room. Noise and other inaccuracies in data often lead to it being represented in
more dimensions than are actually required, and so finding the intrinsic dimensionality can help
to reduce the noise.
We also need to consider the boundaries of the network. It makes sense that the edges of
the map of neurons is strictly defined.
Example:
If we are arranging sounds from low pitch to high pitch, then the lowest and highest pitches we
can hear are obvious endpoints. However, it is not always the case that such boundaries are
clearly defined. In this case we might want to remove the boundary conditions. We can do this
by removing the boundary by tying the ends together. In 1D this means that we turn a line into a
circle, while in 2D we turn a rectangle into a torus. To see this, try taking a piece of paper and
bend it so that the top and bottom edges line up. You’ve now got a tube. If you bend the tube
round so that the two open ends meet up you have a circle of tube known as a torus. Pictures of
these effects are shown in the following Figure
The map distances get more complicated to calculate, since we now need to calculate the
distances allowing for the wrap around. This can be done using modulo arithmetic, but it is easier
to think about taking copies of the map and putting them around the map, so that the original
map has copies of itself all around: one above, one below, to the right and left, and also
diagonally above and below, as is shown in above Figure
Now we keep one of the points in the original map, and the distance to the second node is
the smallest of the distances between the first node and the copies of the second node in the
different maps (including the original). By treating the distances in x and y separately, the
number of distances that has to be computed can be reduced.
The competitive learning algorithm that we considered earlier, the size of the SOM is
defined before we start learning. The size of the network (that is, the number of neurons that we
put into it) decides how fine-grained the learning is. If there are very few neurons, then the best
that the network can do is to find gross generalisations that link the data. However, if there are
very large numbers of neurons, then the network can represent every input without ever needing
to generalise at all.
This is yet another example of overfitting. Clearly, then, choosing the correct size of
network is important. The common approach is to test out several different sizes of network,
such as 5 × 5 and 10 × 10 and see how well the network learns.
UNIT V
1
1/8/2023
EXPERT SYSTEMS
• An expert system is a computer program that
is designed to solve complex problems and to
provide decision-making ability like a human
expert.
• It performs this by extracting knowledge from
its knowledge base using the reasoning and
inference rules according to the user queries.
EXPERT SYSTEMS
• The expert system is a part of AI, and the first ES was
developed in the year 1970, which was the first successful
approach of artificial intelligence.
• It solves the most complex issue as an expert by extracting
the knowledge stored in its knowledge base.
• The system helps in decision making for compsex problems
using both facts and heuristics like a human expert.
• It is called so because it contains the expert knowledge of a
specific domain and can solve any complex problem of that
particular domain.
• These systems are designed for a specific domain, such
as medicine, science, etc.
2
1/8/2023
3
1/8/2023
4
1/8/2023
5
1/8/2023
6
1/8/2023
7
1/8/2023
8
1/8/2023
9
1/8/2023
10
1/8/2023
11
1/8/2023
12
1/8/2023
13
1/8/2023
14
1/8/2023
15
1/8/2023
16
1/8/2023
MYCIN
MYCIN
17
1/8/2023
MYCIN
18
1/8/2023
19
1/8/2023
20
1/8/2023
21
1/8/2023
XOON
22
1/8/2023
23
1/8/2023
24
SSLC, HSE, DIPLOMA, B.E/B.TECH, M.E/M.TECH, MBA, MCA
Notes Available @
Syllabus
Question Papers
www.AllAbtEngg.com
Results and Many more…
www.AllAbtEngg.com
Available in /AllAbtEngg Android App too,
Check www.SmartPoet.Net & www.PhotoShip.Net
SSLC, HSE, DIPLOMA, B.E/B.TECH, M.E/M.TECH, MBA, MCA
Notes Available @
Syllabus
Question Papers
www.AllAbtEngg.com
Results and Many more…
www.AllAbtEngg.com
Available in /AllAbtEngg Android App too,
Check www.SmartPoet.Net & www.PhotoShip.Net
SSLC, HSE, DIPLOMA, B.E/B.TECH, M.E/M.TECH, MBA, MCA
Notes Available @
Syllabus
Question Papers
www.AllAbtEngg.com
Results and Many more…
www.AllAbtEngg.com
Available in /AllAbtEngg Android App too,
Check www.SmartPoet.Net & www.PhotoShip.Net
STUDENTSFOCUS.COM
STUDENTSFOCUS.COM
STUDENTSFOCUS.COM
STUDENTSFOCUS.COM
I ililil ilil lllll lllll lllll llll llll Reg. No.:
STUDENTSFOCUS.COM
50897 -2- Iffitililililffiilffiiltiltiltiltl
PART-B (5xt8=65Marks)
11. a) Explain the following types of HilI Climbing search techniques.
i) Simple Hill Climbing. @,)
ii) Steepest-Ascent HiIl Ctimbing. (E)
iii) Simulated Annealing. (4)
(oR) lvw'recGntquestion paPer'com
b) Trace the operation of the unification algorithm on each of the following pairs
ofliterals: (lg)
i) f(Marcus) and f(Caesar)
tI*'' recGntquesfi
ii) f(x) and f(g(y)) on paler. cortl
iii) f(Marcus, g(x, y)) and f(x, g(Caesar, Marcus)).
13. a) Explain the production based knowledge representation technique. (1S)
(oR)
sda
TF
STUDENTSFOCUS.COM
I rililt ilil illll llffi lllll lill llll -3- 50397
oN(e,B,so) n
ONTABLE(B,So) n
CLEAR(e,So)
ii) Write short notes on Nonlinear Planning using Constraint Posting. (5)
i) MYCIN. (7)
ii) DART. (6)
(oR)
16. a) Design an e:pert system for Travel recommendation and discuss its roles.
(oR)
STUDENTSFOCUS.COM
STUDENTSFOCUS.COM
STUDENTSFOCUS.COM
www.vidyarthiplus.com
STUDENTSFOCUS.COM
www.Vidyarthiplus.com
www.vidyarthiplus.com
STUDENTSFOCUS.COM
www.Vidyarthiplus.com
www.vidyarthiplus.com
STUDENTSFOCUS.COM
www.Vidyarthiplus.com
Unit -1
PART - A
1. What is AI?
Artificial Intelligence is the branch of computer science concerned with making
computers behave like humans.
Systems that think like humans
Systems that act like humans
Systems that think rationally
Systems that act rationally
2. Define an agent.
An agent is anything that can be viewed as perceiving its environment through sensors
and acting upon that environment through actuators.
1
6. List the properties of task environments.
Fully observable vs. partially observable.
Deterministic vs. stochastic.
Episodic vs sequential
Static vs dynamic.
Discrete vs. continuous.
Single agent vs. multiagent.
7. What are the four different kinds of agent programs?
Simple reflex agents;
Model-based reflex agents;
Goal-based agents; and
Utility-based agents.
8. Explain goal based reflex agent.
Knowing about the current state of the environment is not always enough to
decide what to do. For example, at a road junction, the taxi can turn left, turn
right, or go straight on. The correct decision depends on where the taxi is trying
to get to.
In other words, as well as a current state description, the agent needs some sort
of goal information that describes situations that are desirable-for example,
being at the passenger's destination.
9. What are utility based agents?
Goals alone are not really enough to generate high-quality behavior in most
environments.
For example, there are many action sequences that will get the taxi to its
destination (thereby achieving the goal) but some are quicker, safer, more
reliable, or cheaper than others.
Autilityfunctionmaps a state (or a sequence of states) onto a real number,
which describes the associateddegree of happiness.
10. What are learning agents?
A learning agent can be divided into four conceptual components. The most
important distinction is between the learning element, which is re-ELEMENT
possible for making improvements, and the performance element, which is
responsible for selecting external actions.
The performance element is what we have previously considered to be the entire
agent: it takes in percepts and decides on actions. The learning element uses
CRITIC feedback from the critic on how the agent is doing and determines how
the performance element should be modified to do better in the future.
11. Define the problem solving agent.
A Problem solving agent is a goal-based agent. It decides what to do by finding
sequence of actions that lead to desirable states.
The agent can adopt a goal and aim at satisfying it. Goal formulation is the first
step in problem solving.
2
12. List the steps involved in simple problem solving agent.
Goal formulation
Problem formulation
Search
Search Algorithm
Execution phase
13. Define search and search algorithm.
The process of looking for sequences actions from the current state to reach the
goal state is called search. The search algorithm takes a problem as input
and returns a solution in the form of action sequence.
Once a solution is found, the execution phase consists of carrying out the
recommended action..
14. What are the components of well-defined problems?
The initial state that the agent starts in . The initial state for our agent of
example problem is described by In(Arad)
A Successor Function returns the possible actions available to the agent.
Given a state successor-FN(x) returns a set of {action, successor} ordered
pairs where each
action is one of the legal actions in state x,and each successor is a state that can be
reached from x by applying the action.
For example, from the state In(Arad),the successor function for the Romania
problem would return
{ [Go(Sibiu),In(Sibiu)],[Go(Timisoara),In(Timisoara)],[Go(Zerind),In(Zerind)] }
The goal test determines whether the given state is a goal state.
A path cost function assigns numeric cost to each action. For the Romania
problem the cost of path might be its length in kilometers.
15. Give examples of real world problems.
a) Touring problems
b) Travelling Salesperson Problem(TSP)
c) VLSI layout
d) Robot navigation
e) Automatic assembly sequencing
f) Internet searching
16. List the criteria to measure the performance of different search strategies.
Completeness: Is the algorithm guaranteed to find a solution when there is one?
Optimality: Does the strategy find the optimal solution?
Time complexity: How long does it take to find a solution?
Space complexity: How much memory is needed to perform the search?
17. Define Best-first-search.
Best-first search is an instance of the general TREE-SEARCH or GRAPH-SEARCH
algorithm in which a node is selected for expansion based on the evaluation
function f(n ). Traditionally, the node with the lowest evaluation function is selected
for expansion.
3
18. What is a heuristic function? (Nov/Dec 2016)
A heuristic function or simply a heuristic is a function that ranks alternatives in
various search algorithms at each branching step basing on available information
in order to make a decision which branch is to be followed during a search.
For example, for shortest path problems, a heuristic is a function, h (n) defined
on the nodes of a search tree, which serves as an estimate of the cost of the
cheapest path from that node to the goal node. Heuristics are used by informed
search algorithms such as Greedy best-first search and A* to choose the best
node to explore.
19. What are relaxed problems?
A problem with fewer restrictions on the actions is called a relaxed problem
The cost of an optimal solution to a relaxed problem is an admissible heuristic for
the original problem
If the rules of the 8-puzzle are relaxed so that a tile can move anywhere, then hoop(n)
gives the shortest solution
If the rules are relaxed so that a tile can move to any adjacent square, then
hmd(n) gives the shortest solution
20. What are categories of production system? (Nov/Dec 2016)
e Monotonic (Characteristics) Non-monotonic
Partially commutative Theorem proving Robot navigation
Non-partial commutative Chemical synthesis Bridge game
21. What is A* search?
A* search is the most widely-known form of best-first search. It evaluates the nodes
bycombiningg(n),the cost to reach the node, and h(n),the cost to get from the node
to the goal:
f (n) = g(n) + h(n)
Where f (n) = estimated cost of the cheapest solution
through n. g (n) is the path cost from the start node to
node n.
h (n) = heuristic function
A* search is both complete and optimal.
22. What is Recursive best-first search?
Recursive best-first search is a simple recursive algorithm that attempts to mimic
the operation ofstandard best-first search, but using only linear space.
23. What are local search algorithms?
Local search algorithms operate using a single current state (rather than
multiple paths) andgenerally move only to neighbors of that state. The local
search algorithms are not systematic.
The key two advantages are (i) they use very little memory – usually a constant
amount, and (ii) they can often find reasonable solutions in large or infinite
(continuous) state spaces for which systematic algorithms are unsuitable.
4
24. What are the advantages of local search?
Use very little memory – usually a constant amount
Can often find reasonable solutions in large or infinite state spaces (e.g.,
continuous) o Unsuitable for systematic search
Useful for pure optimization problems
Find the best state according to an objective function oTraveling salesman
25. What are optimization problems?
In optimization problems, the aim is to find the best state according to an
objective function the optimization problem is then: Find values of the variables
that minimize or maximize the objective function while satisfying the constraints.
26. What is Hill-climbing search?
The Hill-climbing algorithm is simply a loop that continually moves in the
direction of increasing value –that is uphill. It terminates when it reaches a “peak”
where no neighbor has a higher value.
The algorithm does not maintain a search tree so the current node data structure
need only record the state and its objective function value. Hill-climbing does not
look ahead beyond the immediate neighbors of the current state.
27. What is the problem faced by hill-climbing search? (May 2010)
Hill-climbing often get stuck for the following reasons:
Local maxima –A local maxima is a peak that is higher than each of
itsneighboring states, but lower than the local maximum. Hill climbing algorithm
that reach the vicinity of a local maximum will be drawn upwards towards the
peak, but will then be stuck with nowhere else to go.
Ridges –Ridges result in a sequence of local maxima that is very difficult
forgreedy algorithms to navigate.
Plateaux: a plateau is an area of state space landscape where the
evaluationfunction is flat. A hill-climbing search might be unable to find its way off
the plateau.
28. What is local beam search?
The local beam search algorithm keeps track of k states rather than just one. It
begins with k randomly generated states. At each step, all the successors of all k
states are generated. If anyone is a goal, the algorithm halts. Otherwise, it selects
the k best successors from the complete list and repeats.
29. What are the variants of hill-
climbing? Stochastic hill-
climbing
Random selection among the uphill moves.
The selection probability can vary with the steepness of the uphill move.
First-choice hill-climbing
cfr. Stochastic hill climbing by generating successors randomly until a better one is found.
Random-restart hill-climbing
Tries to avoid getting stuck in local maxima
5
30. Define constraint satisfaction problem. (Nov/Dec 2015)
A Constraint Satisfaction problem (or CSP) is defined by a set of variables X1,
X2,…..,Xn, and a set of constraints, C1,C2,…..,Cm. Each variable Xi has a
nonempty domain Di of possible values.
Each constraint Ci involves some subset of the variables and specifies the
allowable combinations of values for that subset. A state of the problem is
defined by an assignment of values to some or all of the variables,{X i =
vi,Xj=vj,…} A solution to a CSP is a complete assignment that satisfies all the
constraints.
31. What is a constraint graph?
It is helpful to visualize the Constraint Satisfaction Problem as a Constraint Graph.
A Constraint Graph is a graph where the nodes of the graph correspond to
variables of the problem and the arcs corresponds to constraints.
A crypt arithmetic problem. Each letter stands for a distinct digit; the aim is to
find a substitution of digits for letters such that the resulting sum is arithmetically
correct, with the added restriction that no leading zeroes are allowed.
The constraint hyper graph for the crypt arithmetic problem, showing the All diff
constraint as well as the column addition constraints. Each constraint is a square
box connected to the variables it constrains.
33. Define a game.
Formal Definition of Game
We will consider games with two players, whom we will call MAX and MIN. MAX
moves first, and then they take turns moving until the game is over. At the end of the
6
game, points are awarded to the winning player and penalties are given to the loser.
A game can be formally defined as a search problem with the following components:
The initial state includes the board position and identifies the player to move.
A successor function returns a list of (move, state) pairs, each indicating a legal
move and the resulting state.
A terminal test, describes when the game is over. States where the game has
ended are called terminal states.
A utility function (also called an objective function or payoff function), which give a
numeric value for the terminal states. In chess, the outcome is a win, loss, or
draw,with values +1,-1,or 0. he payoffs in backgammon range from +192 to -192.
34. Explain briefly the min-max algorithm.
10
UNIT II
PART – A
1. What is game playing?
The term Game means a sort of conflict in which n individuals or groups (known as players)
participate. Game theory denotes games of strategy. Game theory allows decision-makers (players)
to cope with other decision-makers (players) who have different purposes in mind. In other words,
players determine their own strategies in terms of the strategies and goals of their opponent.
11
8. Specify the syntax of First-order logic in BNF form
27. Give the Baye’s rule equation? [APRIL/MAY 2017, APR /MAY 2018]
W.K.T P(A^B) = P(A/B) P(B) -----------------------------------1
P(A^B) = P(B/A) P(A) ------------------------------------2
DIVIDINGBYP(A);WEGET
P(B/A) = P(A/B) P(B) P(A)
14
35. Define Fuzzy reasoning. [NOV/DEC 2017].
Human Reasoning means the action of thinking about something in a logical/sensible way. Fuzzy
Logic (FL) is a method of reasoning that resembles human reasoning. The approach of FL
imitates the way of decision making in humans that involves all intermediate possibilities between
digital values YES and NO.
36. Compare production based system with frame based system. [NOV/DEC 2017]
PART – B & C
16
UNIT – IV
PART – A
1. What is learning?
Learning covers a wide range of phenomena.At one end of the spectrum is skill refinement.People
get better at many tasks simply by practicing.At the other end of the spectrum lies knowledge
acquisition.Knowledge is
generally acquired through experience.
5. What is planning?
Planning refers to the process of computing several steps of a problem solving procedure before
executing any of them.
17
9. List out successful applications of machine learning?
Adaptable software system Bioinformatics
Natural language processing Speech recognition
Pattern recognition Intelligent control
Trend prediction
18
PART – B & C
19
UNIT – IV
PART – A
1. What is neuron?
The processing units of the brain are nerve cells called neurons. There are lots of them
(100 billion = 1011 is the figure that is often given) and they come in lots of different
types, depending upon their particular task. Each neuron can be viewed as a separate
processor, performing a very simple computation: deciding whether or not to fire. This
makes the brain a massively parallel computer made up of 1011 processing elements.
20
Choosing manually.
Being dependent on initial values.
9. What are Neural Networks? What are the types of Neural networks?
In simple words, a neural network is a connection of many very tiny processing elements
called as neurons. There are two types of neural network- Biological Neural Networks–
These are made of real neurons.Those tiny CPU’s which you have got inside your brain..if u
have..Not only brain,,but neurons actually make the whole nervous system. Artificial Neural
Networks– Artificial Neural Networks is an imitation of Biological Neural Networks,,by artificial
designing small processing elements, in lieu of using digital computing systems that have
only the binary digits. The Artificial Neural Networks are basically designed to make robots
give the human quality efficiency to the work.
10. Why use Artificial Neural Networks? What are its advantages?
Mainly, Artificial Neural Networks OR Artificial Intelligence is designed to give robots human
quality thinking. So that machines can decide “What if” and ”What if not” with precision. Some of
the other advantages are:-
Adaptive learning: Ability to learn how to do tasks based on the data given for training or
initial experience.
Self-Organization: An Artificial Neural Networks can create its own organization or
representation of the information it receives during learning time.
Real Time Operation: Artificial Neural Networks computations may be carried out in parallel,
and special hardware devices are being designed and manufactured which take advantage of
this capability.
21
13. What are the disadvantages of Artificial Neural Networks?
Answer: The major disadvantage is that they require large diversity of training for
working in a real environment. Moreover, they are not strong enough to work in the
real world.
17. Overfitting is one of the most common problems every Machine Learning practitioner
faces. Explain some methods to avoid overfitting in Neural Networks.
Dropout: It is a regularization technique that prevents the neural network from overfitting. It
randomly drops neurons from the neural network during training which is equivalent to training
different neural networks. The different networks will overfit differently, so the net effect of the
dropout regularization technique will be to reduce overfitting so that our model will be good for
predictive analysis.
PART B & C
22
UNIT - V
PART – A
23
11. What are the players in expert system?
Players in expert system are: Expert, Knowledge Engineer, User
12. What are the advantages of Expert system? [ MAY / JUNE 2016 ]
– Availability: Expert systems are available easily due to mass production software.
– Cheaper: The cost of providing expertise is not expensive.
– Reduced danger: They can be used in any risky environments where humans cannot
work with.
– Permanence: The knowledge will last long indefinitely.
– Multiple expertises: It can be designed to have knowledge of many experts.
13. List out the limitations of expert system?
Not widely used or tested
Limited to relatively narrow problems
Cannot readily deal with “mixed” knowledge
Possibility of error
Cannot refine own knowledge base
Difficult to maintain
May have high development costs
Raise legal and ethical concerns
27. What are the properties of Expert system? [ MAY / JUNE 2016 ]
– Availability: Expert systems are available easily due to mass production software.
– Cheaper: The cost of providing expertise is not expensive.
– Reduced danger: They can be used in any risky environments where humans cannot
work with.
– Permanence: The knowledge will last long indefinitely.
– Multiple expertises: It can be designed to have knowledge of many experts.
– Explanation: They are capable of explaining in detail the reasoning that led to a
conclusion
– Fast response: They can respond at great speed due to the
inherent advantages of computers over humans.
– Unemotional and response at all times: Unlike humans, they do not get tense,
fatigue or panic and work steadily during emergency situations.
PART – B & C
1. With neat sketch explain the architecture, characteristic features and roles of expert system.
[ MAY / JUNE 2016 , APR/MAY 2018] Refer Page 422 in Kevin Knight
2. Discuss about the Knowledge Acquisition process in expert systems [ MAY / JUNE 2016 ]
Refer Page 427 in Kevin Knight
3. Write notes on Meta Knowledge and Heuristics in Knowledge Acquisition Refer Page 427 in
Kevin Knight
26
4. Explain in detail about the expert system shell.[ NOV/DEC 2018 ] Refer Page 424 in Kevin Knight
5. Write notes on expert systems MYCIN, DART and XCON and how it works? Explain. [NOV/DEC
2017, APR/MAY 2018] Refer Page 422 in Kevin Knight
6. Explain the basic components and applications of expert system. [ MAY / JUNE 2016 ]
Refer Page 424 in Kevin Knight
7. Define Expert system. Explain the architecture of an expert system in detail with a neat diagram
and an example. [APRIL/MAY 2017] Refer Page 422 in Kevin Knight
8. Write the applications of expert systems. [MAY / JUNE 2016 ] Refer Page 425 in Kevin Knight
9. Explain the need, significance and evolution of XCON expert system. [APRIL/MAY 2017]
Refer Page 425 in Kevin Knight
10. Explain the expert system architectures: [NOV/DEC 2017]
1. Rule-based system architecture 2. Associative or semantic Network Architecture 3.
Network architecture 4 Blackboard system Architectures Refer Page 422 in Kevin Knight
11. Design an expert system for Travel recommendation and discuss its roles. : [NOV/DEC 2017]
Refer Page 422 in Kevin Knight
12. Explain the architecture of an expert system in detail with a neat diagram and an example.
[APRIL/MAY 2017] Refer Page 422 in Kevin Knight
13. Explain the XCON expert system. [APRIL/MAY 2017] Refer Page 425 in Kevin Knight
14. Explain the applications of expert system. [ MAY / JUNE 2016 ] Refer Page 424 in Kevin Knight
15. Explain the architecture of expert system. [ MAY / JUNE 2016 , APR/MAY 2018] Refer Page
422 in Kevin Knight
27