Course 395: Machine Learning - Lectures: - Lecture 7-8: Instance Based Learning (M. Pantic)

Course 395: Machine Learning – Lectures
• Lecture 1-2: Concept Learning (M. Pantic)
• Lecture 3-4: Decision Trees & CBC Intro (M. Pantic)
• Lecture 5-6: Artificial Neural Networks (S. Zafeiriou)

• Lecture 7-8: Instance Based Learning (M. Pantic)
• Lecture 9-10: Genetic Algorithms (M. Pantic)
• Lecture 11-12: Evaluating Hypotheses (THs)
• Lecture 13-14: Bayesian Learning (S. Zafeiriou)
• Lecture 15-16: Dynamic Bayesian Networks (S. Zafeiriou)
• Lecture 17-18: Inductive Logic Programming (S. Muggleton)

Maja Pantic Machine Learning (course 395)
Instance Based Learning – Lecture Overview
• Lazy learning
• K-Nearest Neighbour learning
• Locally weighted regression
• Case-based reasoning (CBR)
• Advantages and disadvantages of lazy learning
• (Example: CBR-based system for facial expression interpretation)

Eager vs. Lazy Learning
• Eager learning methods construct general, explicit description of the target function
based on the provided training examples.
≡ one-fits-all ≡ input independent

• Lazy learning methods simply store the data and generalizing beyond these data is
postponed until an explicit request is made.
Problem Solution
space space
1. Search the memory for similar instances

2. Retrieve the related solutions
3. Adapt the solutions to the current instance
4. Assign the value of the target function
estimated for the current instance

Problem Solution
space space


Problem Solution
space space


Problem Solution
space space


Problem Solution
space space


• Lazy learning methods can construct a different approximation to the target function
for each encountered query instance.
• Eager learning methods use the same approximation to the target function, which
must be learned based on training examples and before input queries are observed.

• Lazy learning methods can construct a different approximation to the target function
for each encountered query instance.
• Eager learning methods use the same approximation to the target function, which
must be learned based on training examples and before input queries are observed.
• Lazy learning is very suitable for complex and incomplete problem domains, where a
complex target function can be represented by a collection of less complex local
approximations.

k-Nearest Neighbour Learning
The main idea behind k-NN learning is so-called majority voting.

The main idea behind k-NN learning is so-called majority voting.

• Given the target function V: X → C and a set of n already observed instances (xi, cj),
where xi ∈ X, i = [1..n], cj ∈ C, j = [1..m], V(xi) = cj, k-NN algorithm will decide the
class of the new query instance xq based on its k nearest neighbours (previously
observed instances) xr, r = [1..k], in the following way:
V(xq) ← cl ∈ C ↔ (∀ j ≠ l) ∑r E(cl, V(xr)) > ∑r E(cj, V(xr)) where
E(a, b) = 1 if a = b and E(a, b) = 0 if a ≠ b
• The nearest neighbours of a query instance xq are usually defined in terms of standard
Euclidean distance:
de (xi, xq) = √{∑g (ag(xi) – ag(xq))²}
where the instances xi, xq ∈ X are described with a set of g = [1..p] arguments ag

• Distance between two instances xi, xq ∈ X, described with a set of g = [1..p]
arguments ag, can be calculated as:
 City-block (Manhattan) distance (L1-norm):
de (xi, xq) = ∑g |ag(xi) – ag(xq)|
 Euclidean distance (L2-norm):

de (xi, xq) = √{∑g (ag(xi) – ag(xq))²}
 Chebyshev distance (L-infinity-norm):

de (xi, xq) = maxg |ag(xi) – ag(xq)|

• For k = 1, the decision surface is a set of polygons (Voronoi diagram),
completely defined by previously observed instances (training examples).

• The nearest neighbours (previously observed instances) xr, r = [1..k], of a query

instance xq are defined based on a distance d(xr, xq) such as the Euclidian distance.
• A refinement of the k-NN algorithm: assign a weight wr to each neighbour xr of the

query instance xq based on the distance d(xr, xq) such that (d(xr, xq)↓ ↔ wr↑ )
• Distance-weighted k-NN algorithm: Given the target function V: X → C and a set of

n already observed instances (xi, cj), where xi ∈ X, i = [1..n], cj ∈ C, j = [1..m], V(xi)
= cj, distance weighted k-NN algorithm will decide the class of the query instance xq
based on its k nearest neighbours xr, r = [1..k], in the following way:
V(xq) ← cl ∈ C ↔ (∀ j ≠ l) ∑r wr · E(cl, V(xr)) > ∑r wr · E(cj, V(xr)) where
E(a, b) = 1 if a = b, E(a, b) = 0 if a ≠ b , and
wr = 1 / (d(xr, xq))²
• any other measure favouring the votes of nearby
neighbours will do (e.g. Gaussian distribution)

k-Nearest Neighbour Learning: Remarks
• By the distance-weighted k-NN algorithm, the value of k is of minor importance as

distant examples will have very small weight and will not greatly affect the value of
V(xq).
• If k = n, where n is the total number of previously observed instances, we call the

algorithm a global method. Otherwise, if k < n, the algorithm is called a local method.
• Advantage – Distance-weighted k-NN algorithm is robust to noisy training data: it

calculates V(xq) based on a weighted V(xr) values of all k nearest neighbours xr,
effectively smoothing out the impact of isolated noisy training data.
• Disadvantage – All k-NN algorithms calculate the distance between instances based
on all attributes → if there are many irrelevant attributes, instances that belong
together may still be distant from one another.
• Remedy – weight each attribute differently when calculating the distance between two
instances

Locally Weighted Regression
• Locally weighted regression is a most general form of k-NN learning.

It constructs an explicit approximation to target function V that fits the training
examples in the local neighbourhood of the query instance xq.
• Local – V is approximated based only on the data (neighbours) near xq.

Weighted – contribution of a datum is weighted by its distance from xq.
Regression – refers to the problem of approximating a real-valued target function.
• Locally weighted regression:

target function: V: X → C,
target function approximation near xq: V’(xq) = w0 + w1 a1(xq) + … + wn an(xq),
where xq ∈ X is described with a set of g = [1..n] arguments aj
training examples: set of k nearest neighbours xr, r = [1..k], of the query instance xq,
learning problem: learn the most optimal weights w given the set of training examples
learning algorithm (distance-weighted gradient descent training rule):
∆ wj = η · {∑r K(d(xr, xq)) · (V(xr) – V’(xr)) · aj(xr)}
• function of distance that determines weights of xr
• Lazy learning


Case Based Reasoning (CBR) – Schank’s Theory
• The work of Roger Schank, inspired by findings in cognitive sciences on human

reasoning and memory organization, is held to be the origin of CBR.
• Human knowledge about the world is organized in memory packets holding similar
concepts and/or episodes that one experienced.
• If a memory packet contains a situation when a problem was successfully solved and
the person experiences a similar situation, the previous experience is recollected and
the same steps are followed to reach a solution.
• Rather than following a general set of ruls, reapplying previously successful solution
schemes in a new but similar context solves the newly encountered problems.
≡ general approximation ≡ local approximation

to target function to target function

• The work of Roger Schank, inspired by findings in cognitive sciences on human

reasoning and memory organization, is held to be the origin of CBR.
• Human knowledge about the world is organized in memory packets holding similar
concepts and/or episodes that one experienced.
• If a memory packet contains a situation when a problem was successfully solved and
the person experiences a similar situation, the previous experience is recollected and
the same steps are followed to reach a solution.
• Rather than following a general set of ruls, reapplying previously successful solution
schemes in a new but similar context solves the newly encountered problems.
• Lazy learning is much closer to human reasoning model than this is the case with
eager learning

Schank’s memory-based reasoning model: based on similarity

of cases
– The memory of experiences is derived from enumaration of the observed cases,

which are stored further in memory organization packets.
– If problems occur to which no specific case can match exactly, reason from more
general similarities to come up with solutions. 1-NN, otherwise k-NN
Note: the retrieval is almost never full breadth (exhaustive).
distance
measure – The basis of memory-based model is automatic (online) learning:
Memory of experiences is augmented by each novel experience (case).
I.e., the process of learning never ceases.
opposite of offline learning (typical for eager learning methods),

where the process of learning ceases when the training is completed

Case Based Reasoning (CBR)
• Schank’s memory-based reasoning model is the underlying reasoning model of CBR.
• CBR is reasoning by remembering: previously solved cases are used to suggest

solutions for novel but similar problems.
Problem Solution
space space

2. Retrieve the related solutions (1- / k-NN)
4. Store the new case in the memory of
experiences


Problem Solution
space space

experiences


Problem Solution
space space

experiences


Problem Solution
space space

experiences


Problem Solution
space space

experiences

Case Based Reasoning (CBR) – Working Cycle
CBR working cycle:

1. RETRIEVE the most similar case(s).
2. REUSE the case(s) to suggest the
solution for the current case.
3. REVISE the suggested solution.
4. RETAIN the case by storing it in the
memory of experiences.
case base

Case Based Reasoning (CBR) – System Design
• How the cases will be represented?
• How the case base should be organized?
• How the indexing (assigning indexes to cases to facilitate their retrieval) should be
defined?
• Which retrieval algorithm is to be used?
• Which (case base) adaptation algorithm is to be used?

Case Based Reasoning (CBR) – Cases
• Cases contain knowledge about previous experiences (solved problems).

• A case is typically composed of the problem description and the problem solution.
• The classic guideline ‘the more information it stores, the more useful the case is’,
should be applied cautiously.
• Problem description should contain enough data for an accurate and efficient case
retrieval. Useful info: retrieval statistics.
• Problem solution can be either atomic (e.g., an action) or compound (e.g., a sequence
of actions).
• Cases can be either monolithic (e.g., observation → action) or compound (e.g., a set
of observations → a sequence of actions; Note: parts can be processed separately).
• Cases can be represented in various ways: feature vectors, semantic nets, objects,
frames, rules...
Cases should be such that an accurate and efficient retrieval is facilitated.

Case Based Reasoning (CBR) – Organisation
• Flat Case Base Organisation

The simplest case base organisation without any specific structure.
Case retrieval is based on case-by-case search.
• Clustered Case Base Organisation

Cases are stored in clusters of similar cases (as originally proposed by Schank).
Case retrieval includes finding the appropriate cluster(s) and searching through
it for similar cases.
Case addition / deletion algorithm is more complex than by flat organisation.
• Hierarchical Case Base Organisation

Cases that share features are grouped together.
A semantic network containing interlinked features and categories is used.
Cases are associated with categories.
Case retrieval is feature based. It is fast and accurate.
Reorganisation of the case base may be very complex and difficult.

Case Based Reasoning (CBR) – Indexing
• Case indexing: assigning indexes to cases to facilitate efficient and accurate retrieval
of cases from the case base.
• Indexes are defined in terms of features / attributes of cases.
• Indexes should be:

 predictive of the case relevance most informative
features
 recognisable – it should be clear why they are used
 abstract enough to allow for widening of the case base
 discriminative enough to facilitate efficient and accurate case retrieval
trade-off between the generality and specificity of the

hypotheses (set of features) to be used for indexing

Case Based Reasoning (CBR) – Retrieval
• Retrieval algorithm should retrieve case(s) most similar to the currently presented
problem / situation.
preferred as it results in faster
• 1-NN (k-NN) search retrieval and more accurate solutions
A case-by-case search. Search is accurate but highly time consuming.
• 1-NN (k-NN) search through preselected cases

Uses the indexing structure of the case base to preselect the cases.
Then, applies 1-NN or k-NN search. Faster than simple case-by-case search.
It can happen that the best match is not in the preselected cases.
• 1-NN (k-NN) search through (preselected and) ranked cases

Uses the retrieval statistics to rank the cases.
Then applies the 1-NN or k-NN search (through preselected cases). Search is
faster than in the above mentioned cases but not necessarily more accurate.
• Good retrieval algorithm: the best compromise between accuracy and efficiency.

Case Based Reasoning (CBR) – Adaptation
• Adaptation algorithm adapts the solutions associated with the retrieved cases to the
currently presented problem / situation.
• Structural Adaptation
Applies a set of adaptation rules directly to the retrieved solutions.
Adaptation rules can include, e.g., modifying certain attributes through
interpolating between relevant attributes of the retrieved cases.
• Derivational Adaptation
Uses algorithms / rules that have been used to generate the original solution.
Can be used only for problem domains that are completely transparent.
↔ Not used very often.
• Manual (User-driven) Adaptation

If no exact match is found, asks the user for a feedback.
Adapts the solutions accordingly. Faulty adaptations cannot be encountered.
Used very often.

Lazy Learning – Advantages
• Incremental (online) learning: The problem-solving ability is increased with each

newly presented case.
• Suitability for complex and incomplete problem domains: A complex target function
can be described as a collection of less complex local approximations and unknown
classes can be learned.
• Suitability for simultaneous application to multiple problems: Examples are simply

stored and can be used for multiple problem-solving purposes.
• Ease of maintenance: A lazy learner adapts automatically to changes in the problem

domain.

Lazy Learning – Disadvantages
• Handling very large problem domains: This implies high memory / storage
requirements and time-consuming search for similar examples.
• Handling highly dynamic problem domains: In CBR, this involves continuous

reorganisation of the case base, which may introduce errors in the case base.
Overall, the set previously encountered examples may become outdated if a sudden
large shift in the problem domain occurs.
• Handling overly noisy data: Such data may result in storing same problems numerous
times because of the differences in cases due to noise. In turn, this implies high
memory / storage requirements and time-consuming search for similar examples.
• Achieving fully automatic operation: Only for complete problem domains a fully
automatic operation of a lazy learner can be expected. Otherwise, user feedback is
needed for situations for which the learner has no solution.

Instance Based Learning – Exam Questions
• Tom Mitchell’s book –chapter 8
• Relevant exercises from chapter 8: 8.3
• Case-Based Reasoning Syllabus
• To prepare assignment 3 of the CBC read:

Pantic & Rothkrantz (2004):
“CBR for user-profiled recognition of emotions from face images”

• Lazy learning


Automatic Facial Expression Analysis

Anger Surprise Sadness Disgust Fear Happiness

Anger Surprise Sadness Disgust Fear Happiness

User-profiled Facial Expression Interpretation
Could you please display an How would you interpret this?

angry expression? Happy? Angry? Teasing?

User-profiled Facial Expression Interpretation
cheeks raised (AU6)

smile (AU12)
lips parted (AU25)
Happy
AU6+AU12+AU25 = Happy
Case Base Initialisation
AUs Case explanation

1 raised inner eyebrow
2 raised outer eyebrow
1+2 from “surprise”
4 furrowed eyebrows
5 raised upper eyelid
7 raised lower eyelid
1+4+5+7 from “fear”
...
1+4 from “sadness”
...
9 wrinkled nose
9+17 from “disgust”
...
6+12 from “happiness”
...

Case Base Initialisation
AUs User’s interpretation

1 disappointed
Happy 2 angry
1+2 surprised
4 angry
5 please don’t
7 angry
1+4+5+7 please don’t
...
1+4 disappointed
...
9 slimy (yak!)
9+17 slimy (yak!)
...
6+13 happy
...

Case Base Organisation
AUs User’s interpretation

1 disappointed
2 angry
1+2 surprised
4 angry
5 please don’t
7 angry
1+4+5+7 please don’t
...
1+4 disappointed
...
9 slimy (yak!)
9+17 slimy (yak!)
Clusters:
...
label ‹angry›
6+13 happy
cases ‹(2,4,0); (4,0); (7,0);… ;(24,0); (24,17,0)›
...
index ‹4, 7, 24,…›

Retrieval
Clusters:
label ‹angry›
cases ‹(2,4,0); (4,0); (7,0);…;
(24,0); (24,17,0)›
index ‹4, 7, 24,…›

Adaptation
Problem Solution User-profiled AU interpretation

space space
1. Search the Case Base for similar cases,
retrieve them, and interpret the input set of
AUs using the interpretation labels
suggested by the retrieved cases.
2. If the user is satisfied with the output, store
the new case in the Case Base. Otherwise,
adapt the Case Base (i.e., store the new
interpretation that the user associates with
the input facial expression).
Pantic and Rothkrantz, Proc. IEEE ICME’04

Course 395: Machine Learning – Lectures
• Lecture 1-2: Concept Learning (M. Pantic)
• Lecture 3-4: Decision Trees & CBC Intro (M. Pantic)
• Lecture 5-6: Artificial Neural Networks (S. Zafeiriou)
• Lecture 7-8: Instance Based Learning (M. Pantic)

• Lecture 9-10: Genetic Algorithms (M. Pantic)
• Lecture 11-12: Evaluating Hypotheses (THs)
• Lecture 13-14: Bayesian Learning (S. Zafeiriou)
• Lecture 15-16: Dynamic Bayesian Networks (S. Zafeiriou)
• Lecture 17-18: Inductive Logic Programming (S. Muggleton)


Course 395: Machine Learning - Lectures: - Lecture 7-8: Instance Based Learning (M. Pantic)

Uploaded by

Copyright:

Available Formats

Course 395: Machine Learning - Lectures: - Lecture 7-8: Instance Based Learning (M. Pantic)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course 395: Machine Learning - Lectures: - Lecture 7-8: Instance Based Learning (M. Pantic)

Uploaded by

Copyright:

Available Formats

Course 395: Machine Learning – Lectures

• Lecture 1-2: Concept Learning (M. Pantic)

• Lecture 3-4: Decision Trees & CBC Intro (M. Pantic)

• Lecture 5-6: Artificial Neural Networks (S. Zafeiriou)

• Lecture 9-10: Genetic Algorithms (M. Pantic)

• Lecture 11-12: Evaluating Hypotheses (THs)

• Lecture 13-14: Bayesian Learning (S. Zafeiriou)

• Lecture 15-16: Dynamic Bayesian Networks (S. Zafeiriou)

• Lecture 17-18: Inductive Logic Programming (S. Muggleton)

• K-Nearest Neighbour learning

• Locally weighted regression

• Case-based reasoning (CBR)

• Advantages and disadvantages of lazy learning

• (Example: CBR-based system for facial expression interpretation)

Maja Pantic Machine Learning (course 395)

≡ one-fits-all ≡ input independent

Maja Pantic Machine Learning (course 395)

1. Search the memory for similar instances

Maja Pantic Machine Learning (course 395)

1. Search the memory for similar instances

Maja Pantic Machine Learning (course 395)

1. Search the memory for similar instances

Maja Pantic Machine Learning (course 395)

1. Search the memory for similar instances

Maja Pantic Machine Learning (course 395)

1. Search the memory for similar instances

Maja Pantic Machine Learning (course 395)

Maja Pantic Machine Learning (course 395)

Maja Pantic Machine Learning (course 395)

The main idea behind k-NN learning is so-called majority voting.

Maja Pantic Machine Learning (course 395)

The main idea behind k-NN learning is so-called majority voting.

Maja Pantic Machine Learning (course 395)

de (xi, xq) = √{∑g (ag(xi) – ag(xq))²}

Maja Pantic Machine Learning (course 395)

 Euclidean distance (L2-norm):

 Chebyshev distance (L-infinity-norm):

Maja Pantic Machine Learning (course 395)

Maja Pantic Machine Learning (course 395)

• The nearest neighbours (previously observed instances) xr, r = [1..k], of a query

• A refinement of the k-NN algorithm: assign a weight wr to each neighbour xr of the

• Distance-weighted k-NN algorithm: Given the target function V: X → C and a set of

Maja Pantic Machine Learning (course 395)

• By the distance-weighted k-NN algorithm, the value of k is of minor importance as

• If k = n, where n is the total number of previously observed instances, we call the

• Advantage – Distance-weighted k-NN algorithm is robust to noisy training data: it

Maja Pantic Machine Learning (course 395)

• Locally weighted regression is a most general form of k-NN learning.

• Local – V is approximated based only on the data (neighbours) near xq.

• Locally weighted regression:

• K-Nearest Neighbour learning

• Locally weighted regression

• Advantages and disadvantages of lazy learning

• (Example: CBR-based system for facial expression interpretation)

Maja Pantic Machine Learning (course 395)

• The work of Roger Schank, inspired by findings in cognitive sciences on human

≡ general approximation ≡ local approximation

Maja Pantic Machine Learning (course 395)

• The work of Roger Schank, inspired by findings in cognitive sciences on human