Instance Based Learning: 09s1: COMP9417 Machine Learning and Data Mining
Instance Based Learning: 09s1: COMP9417 Machine Learning and Data Mining
Instance Based Learning: 09s1: COMP9417 Machine Learning and Data Mining
Aims Introduction
This lecture will enable you to describe and reproduce machine learning
approaches within the framework of instance based learning. Following it • Simplest form of learning: rote learning
you should be able to:
– Training instances are searched for instance that most closely
resembles new instance
• define instance based learning
– The instances themselves represent the knowledge
• reproduce the basic k-nearest neighbour method – Also called instance-based learning
• describe locally-weighted regression • The similarity function defines what is “learned”
• describe radial basis function networks • Instance-based learning is lazy learning
• define case-based reasoning
• Methods: nearest-neighbour, k-nearest-neighbour, IBk, . . .
• describe lazy versus eager learning
COMP9417: April 22, 2009 Instance Based Learning: Slide 1 COMP9417: April 22, 2009 Instance Based Learning: Slide 2
Distance function Nearest Neighbour
Stores all training examples hxi, f (xi)i.
• Simplest case: one numeric (continuous) attribute Nearest neighbour:
– Distance is the difference between the two attribute values involved
(or a function thereof) • Given query instance xq , first locate nearest training example xn, then
estimate fˆ(xq ) ← f (xn)
• Several numeric attributes: normally, Euclidean distance is used and
attributes are normalized
k-Nearest neighbour:
• Nominal (discrete) attributes: distance is set to 1 if values are different,
0 if they are equal • Given xq , take vote among its k nearest neighbours (if discrete-valued
target function)
• Generalised distance functions: can use discrete and continuous
attributes • take mean of f values of k nearest neighbours (if real-valued)
• Are all attributes equally important? Pk
i=1 f (xi )
fˆ(xq ) ←
– Weighting the attributes might be necessary k
COMP9417: April 22, 2009 Instance Based Learning: Slide 3 COMP9417: April 22, 2009 Instance Based Learning: Slide 4
• For each training example hxi, f (xi)i, add the example to the list −
training examples. −
−
+ +
Classification algorithm: xq
COMP9417: April 22, 2009 Instance Based Learning: Slide 5 COMP9417: April 22, 2009 Instance Based Learning: Slide 6
Distance function again Distance function again
The distance function defines what is learned . . . • Many other distances, e.g. Manhattan or city-block (sum of absolute
values of differences)
• Most commonly used is Euclidean distance
– instance x described by a feature vector (list of attribute-value pairs) The idea of distance functions will appear again in kernel methods.
COMP9417: April 22, 2009 Instance Based Learning: Slide 7 COMP9417: April 22, 2009 Instance Based Learning: Slide 8
COMP9417: April 22, 2009 Instance Based Learning: Slide 9 COMP9417: April 22, 2009 Instance Based Learning: Slide 10
When To Consider Nearest Neighbour When To Consider Nearest Neighbour
• Slow at query time: basic algorithm scans entire training data to derive • an assumption that the classification of query instance xq will be most
a prediction similar to the classification of other instances that are nearby according
to the distance function
• Assumes all attributes are equally important, so easily fooled by
irrelevant attributes
k-NN uses terminology from statistical pattern recognition (see below)
– Remedy: attribute selection or weights
• Regression approximating a real-valued target function
• Problem of noisy instances:
• Residual the error fˆ(x) − f (x) in approximating the target function
– Remedy: remove from data set (not easy)
• Kernel function function of distance used to determine weight of
each training example, i.e. kernel function is the function K s.t.
wi = K(d(xi, xq ))
COMP9417: April 22, 2009 Instance Based Learning: Slide 11 COMP9417: April 22, 2009 Instance Based Learning: Slide 12
For real-valued target functions replace the final line of the algorithm by:
• Might want to weight nearer neighbours more heavily ...
Pk
• Use distance function to construct a weight wi i=1 wi f (xi )
fˆ(xq ) ← Pk
• Replace the final line of the classification algorithm by: i=1 wi
k
X Now we can consider using all the training examples instead of just k
fˆ(xq ) ← argmax wiδ(v, f (xi))
v∈V i=1
→ using all examples (i.e. when k = n) with the rule above is called
where Shepard’s method
1
wi ≡
d(xq , xi)2
and d(xq , xi) is distance between xq and xi
COMP9417: April 22, 2009 Instance Based Learning: Slide 13 COMP9417: April 22, 2009 Instance Based Learning: Slide 14
Curse of Dimensionality Curse of Dimensionality
See Moore and Lee (1994) “Efficient Algorithms for Minimizing Cross
Bellman (1960) coined this term in the context of dynamic programming
Validation Error”
Imagine instances described by 20 attributes, but only 2 are relevant to
target function Instance-based methods (IBk)
Curse of dimensionality: nearest neighbour is easily mislead when high- • attribute weighting: class-specific weights may be used (can result in
dimensional X – problem of irrelevant attributes unclassified instances and multiple classifications)
• get Euclidean distance with weights
One approach: qX
wr (ar (xq ) − ar (x))2
• Stretch jth axis by weight zj , where z1, . . . , zn chosen to minimize
prediction error
• Updating of weights based on nearest neighbour
• Use cross-validation to automatically choose weights z1, . . . , zn
– Class correct/incorrect: weight increased/decreased
• Note setting zj to zero eliminates this dimension altogether – |ar (xq ) − ar (x)| small/large: amount large/small
COMP9417: April 22, 2009 Instance Based Learning: Slide 15 COMP9417: April 22, 2009 Instance Based Learning: Slide 16
Use kNN to form a local approximation to f for each query point xq using Recall a global method of learning a function . . .
a linear function of the form
Minimizing squared error summed over the set D of training examples:
fˆ(x) = w0 + w1a1(x) + · · · + wnan(x)
1X
E≡ (f (x) − fˆ(x))2
where ar (x) denotes the value of the rth attribute of instance x 2
x∈D
COMP9417: April 22, 2009 Instance Based Learning: Slide 17 COMP9417: April 22, 2009 Instance Based Learning: Slide 18
Locally Weighted Regression Locally Weighted Regression
Going from a global to a local approximation there are several choices of 3. Combine 1 and 2
error to minimize:
1 X
E3(xq ) ≡ (f (x) − fˆ(x))2K(d(xq , x))
2
1. Squared error over k nearest neighbours x∈ k nearest nbrs of xq
1 X
E1(xq ) ≡ (f (x) − fˆ(x))2 Gives local training rule:
2
x∈ k nearest nbrs of xq
X
∆wr = η K(d(xq , x))(f (x) − fˆ(x))ar (x)
2. Distance-weighted squared error over all neighbours x∈ k nearest nbrs of xq
1X
E2(xq ) ≡ (f (x) − fˆ(x))2 K(d(xq , x)) Note: use more efficient training methods to find weights.
2
x∈D
Atkeson, Moore, and Schaal (1997) “Locally Weighted Learning.”
COMP9417: April 22, 2009 Instance Based Learning: Slide 19 COMP9417: April 22, 2009 Instance Based Learning: Slide 20
f(x)
• Global approximation to target function, in terms of linear combination
of local approximations
• Used, e.g., for image classification
• A different kind of neural network w0 w1 wk
• Closely related to distance-weighted regression, but “eager” instead of
“lazy”
1 ...
...
a 1 (x) a2 (x) a n (x)
COMP9417: April 22, 2009 Instance Based Learning: Slide 21 COMP9417: April 22, 2009 Instance Based Learning: Slide 22
Radial Basis Function Networks Training Radial Basis Function Networks
In the diagram ar (x) are the attributes describing instance x. Q1: What xu (subsets) to use for each kernel function Ku(d(xu, x))
The learned hypothesis has the form:
• Scatter uniformly throughout instance space
k
X • Or use training instances (reflects instance distribution)
f (x) = w0 + wuKu(d(xu, x))
u=1 • Or prototypes (found by clustering)
where each xu is an instance from X and the kernel function Ku(d(xu, x)) Q2: How to train weights (assume here Gaussian Ku)
decreases as distance d(xu, x) increases.
One common choice for Ku(d(xu, x)) is • First choose variance (and perhaps mean) for each Ku
– e.g., use EM
− 12 d2 (xu ,x)
Ku(d(xu, x)) = e 2σ u • Then hold Ku fixed, and train linear output layer
– efficient methods to fit linear function
i.e. a Gaussian function.
COMP9417: April 22, 2009 Instance Based Learning: Slide 23 COMP9417: April 22, 2009 Instance Based Learning: Slide 24
COMP9417: April 22, 2009 Instance Based Learning: Slide 25 COMP9417: April 22, 2009 Instance Based Learning: Slide 26
Dealing with noise Case-Based Reasoning
Can apply instance-based learning even when X 6= <n
• Excellent way: cross-validation-based k-NN classifier (but slow)
→ need different “distance” metric
• Different approach: discarding instances that donst perform well by
keeping success records (IB3)
Case-Based Reasoning is instance-based learning applied to instances with
– Computes confidence interval for instances success rate and for symbolic logic descriptions
default accuracy of its class
– If lower limit of first interval is above upper limit of second one, ((user-complaint error53-on-shutdown)
instance is accepted (IB3: 5%-level) (cpu-model PowerPC)
– If upper limit of first interval is below lower limit of second one, (operating-system Windows)
instance is rejected (IB3: 12.5%-level) (network-connection PCIA)
(memory 48meg)
(installed-applications Excel Netscape VirusScan)
(disk 1gig)
(likely-cause ???))
COMP9417: April 22, 2009 Instance Based Learning: Slide 27 COMP9417: April 22, 2009 Instance Based Learning: Slide 28
f
+
+
+ Q
c
h
+
−
Q
m
+
T +
c
T
m
T +
h
COMP9417: April 22, 2009 Instance Based Learning: Slide 29 COMP9417: April 22, 2009 Instance Based Learning: Slide 30
Case-Based Reasoning in CADET Summary: Lazy vs Eager Learning
Lazy: wait for query before generalizing
• Instances represented by rich structural descriptions
• k-Nearest Neighbour, Case based reasoning
• Multiple cases retrieved (and combined) to form solution to new
problem
Eager: generalize before seeing query
• Tight coupling between case retrieval and problem solving
• Radial basis function networks, ID3, Backpropagation, NaiveBayes, . . .
Bottom line:
Does it matter?
• Simple matching of cases useful for tasks such as answering help-desk • Eager learner must create global approximation
queries
• Lazy learner can create many local approximations
• Area of ongoing research
• if they use same H, lazy can represent more complex fns (e.g., consider
H = linear functions)
COMP9417: April 22, 2009 Instance Based Learning: Slide 31 COMP9417: April 22, 2009 Instance Based Learning: Slide 32