Foundations of Machine
Foundations of Machine
Foundations of Machine
Ajay Nagesh
Contents
1 Basic notions and Version Space 4
1.1 ML : Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Structure of hypotheses space: Version Space . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Version Space (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Course Outline 10
3 Decision Trees 10
3.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 DTree Construction: splitting attribute . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Splitting attribute : Impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Exercise: Illustration of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Probability Primer 14
4.1 Basic Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 The three axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4 Independent events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 Probability Mass Function (pmf) and Probability Density Function(pdf) . . . . . . . 16
4.6 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.7 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Probability Distributions 22
6.1 Bernoulli Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2 Binomial Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1
CONTENTS 2
7 Hypothesis Testing 27
7.1 Basic ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.2 Example and some notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Statistical hypothesis testing in decision trees . . . . . . . . . . . . . . . . . . . . . . 29
8 Estimation 31
8.1 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . 33
9 Optimization 33
9.1 Optimization for multiple variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
9.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
9.3 KKT conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.4 Necessary and Sucient Conditions : Summary . . . . . . . . . . . . . . . . . . . . . 44
9.5 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
12 Unsupervised Learning 66
12.1 Objective function (Likelihood) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2 Expectation-Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 68
13 Clustering 70
13.1 Similarity Measures for Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
13.2 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
13.3 k-means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
13.4 Variants of k-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
13.5 MDL principle and clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
13.6 Hierachical clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.7 Miscellaneous topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
16 Discriminative Classication 86
16.1 Maximum Entropy models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
17 Graphical Models 89
19 Regression 94
19.1 Motivating Example : Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
19.2 Linear regression and method of least squares error . . . . . . . . . . . . . . . . . . . 95
19.3 Regularised solution to regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
19.4 Linear Regression : Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
19.5 Possible solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Lecture 2: Introduction
Instructor: Ganesh Ramakrishnan Date: 26/07/2011
Computer Science & Engineering Indian Institute of Technology, Bombay
Consider the sample dataset iris.arff. Given a set of observations of Iris owers (like sepal
length, width), our goal is to predict whether a given ower is Iris-Setosa or Iris-Versicor ...
Mapping the denition to the Iris ower-type prediction problem:
E: Observations on Iris owers (sepal length, width, ...)
T: Identify the type of Iris ower
P: Accuracy
4
1 BASIC NOTIONS AND VERSION SPACE 5
Consider the cooked-up dataset shown is table 1. Our goal is to nd a hypothesis for class C1 .
If our hypothesis language is only a conjunction of atomic statements (i.e. they are conjunctions
of stmts. of the form x.attr = value or x.attr =?), then the version space for this example is
empty. In otherwords, we cannot nd a hypothesis that belongs to the hypothesis language that
we have dened, such that all the positive examples are covered and none of the negative examples
are covered. (However, note that if we decide to include negation (¬), then we can nd a satisfying
hypothesis for class C1 : ¬(Y, N, ?) ⇒ C1 ).
Now consider the hypothesis h1 = (?, ?, N ) ⇒ C1 . What is P (h) ? If we consider that our
performance measure is dened as follows:
|{x|h(x) = cl(x) = C1 } {x|h(x) 6= C1 & cl(x) 6= C1 }| 3
P = =
|{x}| 4
In short, this performance measure counts all those instances of C1 that are correctly classied
by the hypothesis and all those instance that are not C1 and is not classied as C1 by the hypothesis.
1 BASIC NOTIONS AND VERSION SPACE 6
F1 F2 F3 Class
D1 Y Y N C1
D2 N N N C1
D3 Y Y Y C1
D4 Y N Y C2
Ideally, we are in the search for that hypothesis that maximizes P (h) i.e.
We can observe from diagram 1, that the version space becomes empty after seeing D4. If we
expand our hypothesis language (for instance, to also include disjunctions of conjunctions), we can
construct version spaces that are not empty.
1 BASIC NOTIONS AND VERSION SPACE 7
In the lecture we saw three main type of extensions to the hypothesis language:
1. (Y ∧ Y ∧?) ∨ (N ∧ N ∧?) ⇒ C1 . This, in the class, was termed as lazy learner. This can also
be called conjunctive normal form (CNF) learner
2. (?∧? ∧ (N ∨ O) ⇒ C1 . (If, we change (D3,F3) to O)
3. (attribute1 = attribute2) ⇒ C1
Some Observations
1. Version space may be empty. Generally, we cannot always nd one hypothesis that will explain
everything. So, as an approximation, we try to maximize P our performance measure.
2. Hypothesis language: There could be a bias in h. In our previous example, h1 only consisted
of a conjunction of atomic statements and we were avoiding disjunctions.
3. Active Learning: This is a learning technique where the machine prompts the user (an oracle
who can give the class label given the features) to label an unlabeled example. The goal
here is to gather as dierentiating (diverse) an experience as possible. In a way, the machine
should pick those examples which challenge the existing hypothesis learnt by it.
For instance, if the machine has seen 2 examples : {Y, Y, Y } ⇒ C1 and {Y, N, N } ⇒ C1 , then
it should ask the user, {N, Y, Y } ⇒?.
1.3 Exercise
Construct version spaces for the 3 dierent hypothesis language extensions listed above along the
lines of diagram 3 for the same dataset (Table 1).
Further Reading
1. Machine Learning, Tom Mitchell, McGraw Hill, 1997. (http://www.cs.cmu.edu/~tom/
mlbook.html). Chapters 1 and 2.
2. Datasets: http://storm.cis.fordham.edu/~gweiss/data-mining/datasets.html (iris.arff,
soybean.arff)
CS 725 : Foundations of Machine Learning Autumn 2011
Notation
A small change in notation to ensure conformity with the material to be covered in the future and
ease of understanding. Previously we had denoted hypothesis by h, features by x.f eature name
(where x was the example data) and class label by c1 , c2 , .... From here onwards, we will denote
features by φ(xi ), class label by y(xi ) and the hypothesis by f . So our data with its features will
be as shown in Table 2.
Our objective is to maximize P (f ) and we search in the hypothesis space H for that hypothesis
f that maximizes the performance P . In otherwords,
arg max P (f )
f ∈H
The version space for the fourth hypotheses language is as shown in Figure 4. Version space is
usually represented by a lattice (which is a partially ordered set with relations greatest lower bound
and least upper bound dened on every pair of elements).
8
1 BASIC NOTIONS AND VERSION SPACE 9
1.5 Bias
Bias B is the assumptions we make about the target function f . Some of the eects of bias is as
follows:
ambiguity). However, in the larger version space (Figure 4), it is covered by 9 out of 13 hypotheses
and not covered by the remaining 4 (There is more ambiguity in this case).
This ambiguity can actually be captured through the concept of variance, which is what we
will look at in greater detail when we talk about the bias-variance dilemma (in our discussions on
probabilistic models).
2 Course Outline
Given bias B and the resulting version space from the bias (V.S(B, D)), the central question in
machine learning is which f to pick ? Depending on how we do this, there are a host of techniques.
Some of the classsication techniques that we cover in the course are as shown in Figure 5
3 Decision Trees
The bias in a decision tree is as shown in Figure 6
Some characteristics / considerations:
1. Each example will take a denite path. (There is no ambiguity)
2. Which φi to pick ?
For a binary classication problem, we have p(⊕|x) + p(|x) = 1. For the vote.arff example
169 268
(⊕ = republicans or = democrats), p(⊕|x) = 268+169 and p(|x) = 268+169 .
3 DECISION TREES 11
We need to “Code up” the information to classify an example. Using information theory, we
get the following equation:
−p(⊕|x) log2 p(⊕|x) − p(|x) log2 p(|x) = Ep (x)
This is the amount of uncertainty associated (also known as entropy Ep (x)). What we are
interested is the relative chanage in entropy given a feature’s value which is EPpf (x) as shown in
Figure 7
In otherwords, use that attribute that has maximum decrease in uncertainty. This measure is
also called Information Gain.
3.1 Exercise
Is the version space in Figure 4 complete ? If not, complete it.
Further Reading
1. Lattice: http://www.cse.iitb.ac.in/~cs717/notes/classNotes/completeNotes.pdf (Use
the same username and passwd). Pages 6-11.
2. Decision Trees (applets with illustrative examples): http://webdocs.cs.ualberta.ca/~aixplore/
learning/DecisionTrees/
CS 725 : Foundations of Machine Learning Autumn 2011
Hypothesis space
It is a disjunction of conjunctions. With respect to the spect train.arff dataset, we observed
the decision tree learnt as shown in Figure 8. We can serialize the decision tree as
(F18 = 0) ∧ (F21 = 0) ⇒ class 0 ∨ (F18 = 0) ∧ (F21 = 1) ∧ (OD = 0)) ⇒ class 0 ...
12
3 DECISION TREES 13
4 Probability Primer
A review of some basics of probability theory.
P r(Bi ∩ A)
P r(Bi /A) = (1)
P r(B1 ∩ A) + P r(B2 ∩ A) + · · · + P r(Bn ∩ A)
Using the relation P (Bi ∩ A) = P (Bi ) · P (A/Bi )
P r(Bi ) · P r(A/Bi )
P r(Bi /A) = n (2)
j=1 P r(Bj ) · P r(A/Bj )
pX (a) = P r(X = a)
dF (x)
f (a) = |x=a
dx
For discrete
case i.e.
p(x,y) is a joint pmf:
F (a, b) = x<=a y<=b p(x, y)
Marginalization
Marginal probability is then the unconditional probability P(A) of the event A; that is, the proba-
bility of A, regardless of whether event B did or did not occur. If B can be thought of as the event
of a random variable X having a given outcome, the marginal probability of A can be obtained
by summing (or integrating, more generally) the joint probabilities over all outcomes for X. For
example, if there are
two possible
outcomes for X with corresponding events B and B’, this means
that P (A) = P (A B) + P (A B ′ ). This is called marginalization.
Discrete case:
P (X = a) = y p(a, y)
Continuous
∞ case:
Px (a) = −∞ p(a, y)dy
4 PROBABILITY PRIMER 17
4.6 Expectation
Discrete case: Expectation is equivalent to probability weighted sums of possible values.
E(X) = Σi xi P r(xi ) where X is a random variable
Continuous
∞case:
E(X) = −∞ f (x)p(x)dx
Properties of E(x)
E[X + Y ] = E[X] + E[Y ]
E[cX] = cE[X]
4.7 Exercise
Example 1. A lab test is 99% eective in detecting a disease when in fact it is present. However,
the test also yields a false positive for 0.5% of the healthy patients tested. If 1% of the population
has that disease, then what is the probability that a person has the disease given that his/her test
is positive?
Soln. Let, H be the event that a tested person is actually healthy.
D be the event that a tested person does have the disease.
T be the event that the test comes out positive for a person.
We want to nd out P r(D/T )
H and D are disjoint events. Together they form the sample space.
Using Bayes’ theorem,
There is 0.5% chance that the test will give false positive for a healthy person. Hence, Pr(T/H)=0.005
Plugging these values in equation (5) we get,
0.01 ∗ 0.99
P r(D/T ) =
0.01 ∗ 0.99 + 0.99 ∗ 0.005
2
=
3
What does this mean? It means that there is 66.66% chance that a person with positive test
results is actually having the disease. For a test to be good we would have expected higher certainty.
So, despite the fact that the test is 99% eective for a person actually having the disease, the false
positives reduce the overall usefulness of the test.
CS 725 : Foundations of Machine Learning Autumn 2011
The second term in the above expression is the expected new impurity. V is a function which
returns the split values given an attribute φi . So V (φi ) can be varied for any φi . It could have
many values or a range of values. A example of V (φi ) is V (φi ) = {1, 3, 5} which translates to the
following split points φi < 1, 1 ≤ φi < 3, 3 ≤ φi < 5, φi ≥ 5
The second term in the above expression is called 4Imp(D). The intuition for using this term
is that, if the skew is more, the lower will be the denominator and is better at countering lowered
impurity. In otherwords, it prefers a less skewed tree as shown in Figure 10.
19
5 DECISION TREES (CONTINUED) 20
5.3 Pruning
Simpler trees are preferred over their complex counterparts for the following reasons:
Trees can get unnecessarily deep and complex. So they are various strategies/heuristics to
decrease the complexity of the tree learnt. Some of the options are as follows:
1. Early termination. Stop if 4Imp(D) < θ, where θ is some threshold.
2. Majority class ≥ α%, for some value of α
3. Pruning: The idea is to build complex trees and prune them. This is a good option, since
the construction procedure can be greedy and does not have to look ahead. Some concepts
of Hypothesis testing is used to achieve this. (For instance, χ2 -test).
4. Use an objective function like
maxφi 4Imp(D, i) − Complexity(tree)
5.4 Exercise
Become familiar with the chebyshev’s inequality, law of large numbers, central limit theorem and
some concepts from Linear Algebra like vector spaces.
CS 725 : Foundations of Machine Learning Autumn 2011
6 Probability Distributions
6.1 Bernoulli Random Variable
Bernoulli random variable is a discrete random variable taking values 0,1
Say, P r[Xi = 0] = 1 − q where q[0, 1]
Then P r[Xi = 1] = q
E[X] = (1 − q) ∗ 0 + q ∗ 1 = q
V ar[X] = q − q 2 = q(1 − q)
An example of Binomial distribution is the distribution of number of heads when a coin is tossed
n times.
22
6 PROBABILITY DISTRIBUTIONS 23
Figure 11: Figure showing curve where Information is not distributed all along.
Question: Consider the two graphs above. Say you know probability function p(x). When is
knowing value of X more useful (that is, carries more information)?
6 PROBABILITY DISTRIBUTIONS 24
Ans: It is more useful in the case(2), because more information is conveyed in Figure 11 than in
Figure 12.
I(p(x))<I(p(y))
There is only one function which satises the above two properties.
The Entropy in the case of discrete random variable can be dened as:
∑
EP [I(p(x))] = −c log[p(x)] (9)
x
Observations:
For a discrete random variable (with countable domain), the information is maximum for the
uniform distribution.
For Continuous random variable ( with nite mean and nite variance), the information is
maximum for the Gaussian Distribution.
xp(x)dx = µ
and
6 PROBABILITY DISTRIBUTIONS 25
(x − µ)2 p(x)dx = σ 2
The solution would be
−(x−µ)2
e 2σ 2
p(x) = √
σ 2π
Recall
dφ(p)
E(X) = dt
2
var(x) = d dt
φ(p)
2
(σt)2
EN (µ,σ2 ) [et(w1 x+w0 ) ] = (w1 µt + w0 t + 2 × w12 ) ∼ N (w1 µ + w0 , w12 σ 2 )
Sum of i.i.d X1 , X2 , ......, Xn ∼ N (µ, σ 2 ) is also normal (gaussian)
X1 + X2 + ...... + Xn ∼ N (nµ, nσ 2 )
n
In genaral if Xi ∼ N (µi , σi2 ) =⇒ i=1 Xi ∼ N ( µi , σi2 )
1 µ
(take w1 = σ and w0 = σ)
Note:- If X1 , X2 , ....Xm ∼ N (0, 1)
2 2
1. y = i Xi ∼ χm . That is, y follows the chi-square distribution with m-degrees of
freedom.
2. y = ∑z 2 ∼ tn . (where z ∼ N (0, 1))). That is, y follows the students-t distribution.
Xi
6 PROBABILITY DISTRIBUTIONS 26
Figure 6.4 : Figure showing the nature of the (chi − square) distribution with 5 degrees of
freedom
m −(Xi −µ)2
µ̂M LE = argmaxµ √1
i=1 [ σ 2π e ]
2σ 2
∑
− (Xi −µ)2
= argmaxµ σ√12π e 2σ 2
∑m
Xi
µ̂M LE = i=1
m = sample mean
With out relaying on central limit theorem Properties (2) and (1)
Similarly
∑m 2
2 i=1 (Xi −µ̂M LE )
σ̂M LE = m is χ2 distrbution
∼ χ2m
⇒ µ ∼ N (µ0 , σ02 )
7 HYPOTHESIS TESTING 27
⇒ σ2 ∼ Γ
7 Hypothesis Testing
7.1 Basic ideas
Given a random sample for a random variable X = (X1 , X2 , . . . Xn ), we dene a function S :
(X1 , X2 , . . . Xn ) → R. This function is called statistic (or sample statistic).
∑ ∑ ∑
Xi 2
Xi −
For instance, the sample mean is nXi and sample variance is i n−1 n
Given X and 2 hypotheses H0 and H1 dened by:
H0 : (X1 , X2 , . . . Xn ) ∈ C
H1 : (X1 , X2 , . . . Xn ) ∈
/C
where, C is some tolerance limit also called the condence region. It is generally dened in
terms of some statistic.
The following types of errors are dened as a consequence of the above hypotheses. They are:
Given a signicance level α (some bound on the Type I error), we want to nd a C such that,
P rH0 ({X1 , X2 , . . . Xn } ∈
/ C) ≤ α
Dirac-delta function
Here we represent the statistic Yi as
n
∑
Yi = δ(Xj , i)
j=1
Random vectors
In this representation, X is represented by random vector xk as follows:
x1 1 0 ... 0
. ..
M = ..
.
xk 0 0 ... 1
where [i] is the ith element of the vector obtained by summing individual random vectors.
A goodness of t test
Let H0 the hypothesis be dened as the distance between the sample and expected average is within
some tolerance. if µi be the probability of Xj = i i.e. µi = pi . Then, expectation E(Yi ) = nµi and
H0 is mathematically derived as :
k
∑ (Yi − nµi )2
H0 : t = ≤c
i=1
nµi
If the ratio of the separation of tuples remains the same (or is similar) to that before the split,
then we might not gain much by splitting that node. To quantify this idea, we use the concept of
signicance testing. The idea is to compare 2 probability distributions.
Let us consider a 2-class classication problem. If p be the probability of taking the left branch,
then the probablity of taking the right branch is 1 − p. Then we obtain the following:
n11 = pn1
n21 = pn2
n12 = (1 − p)n1
n22 = (1 − p)n2
Consider the original ratio (also called reference distribution) of positive to total no. of tuples.
If the same ratio is obtained after the split, we will not be interested in such splits. i.e.,
n1 n11
=
n 1 + n2 n11 + n21
29
7 HYPOTHESIS TESTING 30
or in general
nj nj1
= ∀j no. of classes & i = 1, 2
n 1 + n2 n1i + n2i
Since, these splits only add to the complexity of the tree and does not convey any meaningful
information not already present. Suppose we are interested not only in equal distribution but also
approximately equal distributions i.e.,
n1 n11
≈
n 1 + n2 n11 + n21
The idea is to compare two probability distributions. It is here that the concept of hypothesis
testing is used.
The ratio n1n+n
1
2
is called the reference distribution. In general, it is:
nj
p(cj ) = µj = for a given class j (pre-splitting distribution)
i ni
1
Hypothesis testing: problem
Let X1 , . . . Xn be i.i.d random samples. For our example, class labels of instances that have gone
into the left branch. The representation of these random variables can be any of the n 2 types
discussed in the last class. As per the random vector representation,
n statistic is Y j = i=1 Xi [j].
As per the dirac-delta representation, the statistic is Yj = i=1 δ(Xi , j). The statistic for the
example, # of instances in the left branch that have class j.
The hypotheses:
H0 : X1 , . . . , X n ∈ C
H1 : X1 , . . . , X n ∈
/C
The distribution of samples in the left branch is same as before splitting i.e. µ1 , . . . µk .
Given a random sample, we need to test our hypothesis i.e., given an α ∈ [0, 1], we want to
determine a C such that
P rH0 ({X1 , . . . Xn } ∈
/ C) ≤ α Type I error
Given an α ∈ [0, 1], probability that we decide that the pre-distribution and the left-branch
distribution are dierent, when in fact they are similar, is less than or eqaul to α.
[Currently we are not very much interested in the Type II error, i.e. P rH1 ({X1 , . . . Xn } ∈ C)].
Here, C is the set of all possible “interesting” random samples. Also,
′ ′
P rH0 ({X1 , . . . Xn } ∈
/ C ) ≤ P rH0 ({X1 , . . . Xn } ∈
/ C) ∀C ⊇ C
We are interested in the “smallest” / “tightest” C. This is called the critical region Cα . Con-
sequently,
P rH0 ({X1 , . . . Xn } ∈
/ Cα ) = α
∣ k
∣ ∑ (Y − EH0 (Yj ))2
∣ j
C = (X1 , . . . Xn )∣ ≤c where c is some constant
∣ EH0 (Yj )
i=1
∣ k
∣ ∑ (Y − nµj )2
∣ j
= (X1 , . . . Xn )∣ ≤c
∣ nµj
i=1
As we have seen before, the above expression ∼ χ2k−1 . We then use the chi-square tables to nd
c given the value of α.
8 Estimation
In estimation, we try to determine the probability values (till now we used to consider the ratio as
the probability). Essentially, our µj should actually denoted by µ ̂j , since it is an estimate of the
actual probability µj (which we do not know).
The question posed in estimation is How to determine µ ̂j ?
If {X1 , . . . Xn } are the class labels from which we want to estimate µ, then:
̂1
µ
∏n ∏ k ∑k
.. δ(Xi ,j)
. = arg max P r µ j s.t. µi = 1
µ1 ,...,µk
i=1 j=1 i=1
̂k
µ
8 ESTIMATION 32
This is also known as the maximum likelihood estimator (MLE) for the multivariate bernoulli
random variable. For k = 2,
n
[∏ ]
δ(Xi ,1) δ(Xi ,2)
= arg max µ1 µ2
µ1 ,µ2
i=1
[ ∑ ∑ ]
δ(Xi ,1) δ(Xi ,2)
= arg max µ1 i
1 − µ1 i
µ1
Further Reading
Read section 4.1.3 from the convex optimization notes
(http://www.cse.iitb.ac.in/~cs725/notes/classNotes/BasicsOfConvexOptimization.pdf )
CS 725 : Foundations of Machine Learning Autumn 2011
such that, ∑
µi = 1 ; 0 ≤ µi ≤ 1 , for i = 1 . . . k
i
9 Optimization
Refer to the optimization theory notes (http://www.cse.iitb.ac.in/~cs725/notes/classNotes/
BasicsOfConvexOptimization.pdf). All references to sections, theorems will be with respect to
these notes. It will be denoted for instance as CoOpt,4.1.1.
For some basic denitions of optimization, see sections CoOpt,4.1.1 and CoOpt,4.1.2. To nd
the maximum minimum values of an objective function, see section CoOpt,4.1.3 from Theorem 38
to Theorem 56. A set of systematic procedures to nd optimal value is listed in Procedures 1, 2, 3
and 4 (CoOpt, pages 10, 16, 17 ).
33
9 OPTIMIZATION 34
′
For our optimization problem, the optimum value occurs when f ( µM L ) = 0, which implies
n n
δ(Xi , V1 ) δ(Xi , V2 )
⇒ i=1 − i=1 =0
M L
µ (1 − µ
M L )
n
i=1 δ(X i , V1 )
⇒ n n =µ M L
i=1 δ(X i , V1 ) + i=1 δ(Xi , V2 )
However, f (0) = −∞, f (1) = −∞. From this we can conclude, it is indeed the case that,
n
i=1 δ(Xi , V1 )
M L is
µ
n
We can come to same conclusion through the another path, that of CoOpt, Theorem 55. The
reasoning is as follows. Let the log-likelihood objective function be denoted by LL(µ)
n
′′ − i=1 δ(Xi , V1 )
LL (µ) =
µ2
n < 0 , µ ∈ [0, 1]
− i=1 δ(Xi , V2 )
=
(1 − µ)2
9 OPTIMIZATION 35
′′
So, minimum value is at 0 or 1, since LL (µ) < 0, ∀µ ∈ [0, 1], which implies that
′′
if LL (µ) = 0, for µ ∈ [0, 1] then µ
is argmax LL
The more general LL objective function with the constraints i µi = 1, 0 ≤ µi ≤ 1, for i =
1 . . . k geometrically forms what is known as a simplex. (http://en.wikipedia.org/wiki/Simplex)
CS 725 : Foundations of Machine Learning Autumn 2011
n ∑
∑ k
arg max LL(µ1 . . . µk |x1 . . . xn ) = arg max δ(Xi , Vj )logµj
µ1 ...µk µ1 ...µk
i=1 j=1
k
∑ m
∑
= arg max nj logµj , where nj = δ(Xi , Vj )
µ1 ...µk
j=1 j=1
such that, 0 ≤ µi ≤ 1 and i µi = 1
n1 loge µ1 + n2 loge µ2 = c
c−n1 loge µ1
⇒ µ2 = e n2
Directional derivative along the standard axis for the sample objective function is
36
9 OPTIMIZATION 37
∂LL n1
∂µ1
∇LL = ∂LL = µ1
n2
∂µ2 µ2
CoOpt, Theorem 61, denes the notion of the Hessian matrix, given as follows:
∂ 2 LL ∂ 2 LL ∂ 2 LL
∂µ 2 ∂µ 2 ∂µ 1
. . . ∂µ k
2 1
∂ LL ∂ 2 LL
∂µ1 ∂µ2 ∂µ2 2 . . .
∇2 LL = .. ..
. .
.
.. ... ∂ 2 LL
∂µ2k
− nµ21
1
=
− nµ22
2
0
0 ..
.
− nµ2k
k
CoOpt, Theorem 62 and Corollary 63 list the necessary and sucient conditions for local max-
ima (resp. minima) as follows: (reproduced here for convenience, please refer to CoOpt notes for
details and proof)
Necessary Conditions
∇f (x∗ ) = 0
∇2 f (x∗ ) 4 0 (∇2 f (x∗ ) < 0 , minima resp.)
Sucient Conditions
∇f (x∗ ) = 0
∇2 f (x∗ ) ≺ 0 (∇2 f (x∗ ) 0 , minima resp.)
Some points to note here. If ∇2 f (x∗ ) is neither < 0 nor 4 0, then x∗ is called a saddle point.
Also to visualize, negative deniteness implies upward curvature of the function surface.
However, for our given function, the gradient does not disappear. Then, how do we nd the
maximum value for our objective function ? How about taking a relook at the method for nding
the maximum value. Some considerations:
39
9 OPTIMIZATION 40
Example
Let k = 4. The objective function is
n1 ln µ1 + n2 ln µ2 + n3 ln µ3 + n4 ln µ4
such that,
µ1 + µ2 + µ3 + µ4 = 1 and µi ∈ [0, 1]
In step 1 we calculate the ∇f . In step 2, we nd the boundary relative to R3 , where µx = 1 or
µx = 0. So we get the following equations
µ1 + µ2 + µ3 = 1
µ4 = 1
µ1 + µ2 + µ4 = 1
µ3 = 1
...
In step 3, we nd boundaries relative to R2 and so on. As an exercise, complete this example.
Consider level curves for f (x). If x∗ ∈ local arg max f (x), subject to constraints, we have the
following observations
If ∇f has component perpendicular to the direction of ∇g
The above statement implies that we can move along g(x) = 0 while increasing value above
f (x∗ )
This is in violation of the assumption that x∗ is local argmax
From this, we can conclude that if x∗ is local argmax, then ∇f has no component perpendicular
to ∇g. Therefore, at x∗ ,
g(x∗ ) = 0
∇f (x∗ ) = λ∇g(x∗ )
Components of ∇f (x∗ ) perpendicular to the space spanned by {∇gi (x∗ )}i=1,...m should be 0.
i.e.
m
∑
∇f (x∗ ) = λi ∇gi (x∗ ) (Linear Combination)
i=1
gi (x∗ ) = 0 , ∀i = 1, . . . m
P1
g(x∗ ) = 0
∗ ∗
∇f (x ) = λ∇g(x ) ,λ ≥ 0
9 OPTIMIZATION 42
P2
g(x∗ ) < 0
∇f (x∗ ) = 0
∇f (x∗ ) − λ∇g(x∗ ) = 0
λ≥0
∗
λg(x ) = 0
Exercise
For the log-likelihood function (our objective function), compute the above form of the equation
(Lagrangian)
CS 725 : Foundations of Machine Learning Autumn 2011
k
For the log-likelihood function f = j=1 nj µj , we get the following necessary condition for
optimality (Lagrangian)
n
1
µ
1
..
.
∑k ∑ ∂µ ∑ ∂µi
ni i
µ −λ µ − 1 − α − βi =0
i i i
∂µi ∂µi
. i=1
.
.
nk
µk
gi ≤ 0 : −µi ≤ 0 ∼ αi
gk+j ≤ 0 : µj − 1 ≤ 0 ∼ βi
k
∑
h=0 : µj − 1 = 0 ∼ λ
j=1
s.t. gi (µ) ≤ 0
h( µ) = 0
We will have the necessary conditions for optimality as follows (also known as the Karash-Kuhn-
Tucker or KKT conditions)
n
− µ1
1
..
.
ni k k
1. −
µi + j=1 αj (−1) + j=1 βj (1) + λ = 0
.
.
.
− nµkk
43
9 OPTIMIZATION 44
2. −̂µj ≤ 0
̂j − 1 ≤ 0
µ
3. αj ≥ 0
βj ≥ 0
4. −αj µ̂j = 0
µj − 1) = 0
βj (̂
k
5. ̂j + 1 = 0
j=1 µ
∑
From 4., we have µj 6= 0 ⇒ αj = 0. Also from 4., if no nj = 0, βj = 0, ̂j = 1. i.e.
µ
nj
λ =1⇒λ= nj ⇒
nj
̂j =
µ
nj
n
̂j = ∑ nj j is indeed the globally optimal value. For this we have to
We still need to verify that µ
use CoOpt, Theorem 82, which provides us the sucient conditions for global optimality.
Convex Analysis
For the 1-dim case and α ∈ [0, 1], the condition for convexity is as follows:
For the n-dim case, x, y and ∇f are vectors (so denoted in bold). We have the condition of
convexity as:
f (αx + (1 − α)y < αf (x) + (1 − α)f (y)
For detailed denitions of convex sets, ane sets, convex functions and examples of the same
refer from CoOpt. Section 4.2.3 onwards.
9.5 Duality
Dual function
Let L be the primal function given by:
k
∑ k
∑ ∑ ∑
L(µ, α, β, λ) = − nj log µj + λ( µi − 1) − αi µi + βi (µi − 1)
j=1 i=1 i i
Here, µ is called the primal variable and α, β, λ are called the dual variables. The dual function
of L is given by:
L∗ (α, β, λ) = minµ L(µ, α, β, λ)
This dual objective function is always concave. Also note the following
For a concave problem : Local Max = Global Max
For a convex problem : Local Min = Global Min
Duality Gap
Consider the primal problem:
∑
p∗ = min − nj log µj
s.t. µi ∈ [0, 1]
∑
µi = 1
We can show that p∗ ≥ d∗ and the expression p∗ − d∗ is called the duality gap.
Exercise
Prove the functions in CoOpt, Table 4 are convex.
CS 725 : Foundations of Machine Learning Autumn 2011
Xi : Data tuple that belongs to class cj with probability P r(cj ) (class prior)
φ1 (Xi ), φ2 (Xi ) . . . φm (Xi ) : Set of m features on the data tuple
[V11 . . . Vk11 ][V12 . . . Vk22 ] . . . [V1m . . . Vkm
m
] : Set of values taken by each feature function
[µ11,j . . . µ1k1 ,j ][µ21,j . . . µ2k2 ,j ] . . . [µm m
1,j . . . µkm ,j ] : Set of parameters of distribution that characterizes
the values taken by a feature for a particular class
Xi : A particular document
cj : Document is categorized as sports document
φ1 : Some word form for hike; Corresponding values Vi1 can be hikes, hiking, ...
φk : Presence or absence of word race. Here values are binary (|km | = 2)
By the naive bayes assumption which states that features are conditionally independent given
the class, we have the following
∏m
P r φ1 (Xi ) . . . φm (Xi )|cj = P r φl (Xi )|cj
l=1
As the name suggests, this is a naive assumption especially if there is lots of training data.
However, it works very well in practice. If use a lookup table using all the feature values together,
we have [k1 ∗ k2 · · · ∗ km ∗ (#classes)] parameters to estimate. In contrast, in the naive bayes we
46
10 NAIVE BAYES CLASSIFIER 47
Misclassication Risk
A risk is a function which we try to minimize to assign the best class label to a datapoint. The
treatment of risk minimization is related to decision theory. This function R takes as arguments a
policy p and the datapoint Xi . For the misclassication risk, the policy is the point came from ci
but the algorithm assigned it class cj . We have the following expression:
|c|
∑
R(p, Xi ) = pol(cj )(1 − P r(cj |Xi ))
j=1
[ ]
The maximum likelihood estimators are denoted by µ̂M L , PˆrM L (cj ) . They are calculated as
follows
m
∏ m
∏
µ̂M L , Pˆr(cj ) = arg max P r(c(Xi )) ∗ P r(φl (Xi )|c(Xi ))
µ,P r(c) i=1 l=1
|c| kl
m ∏
∏ #cj ∏ nlp,j
= arg max P r(cj ) ∗ µlp,j
µ,P r(c) i=1 l=1 p=1
where,
|c|
[∑ kp
m ∑
∑ ]
arg max (#cj ) log P r(cj ) + nlp,j log(µlp,j ) (11)
µ,P r(c) j=1 l=1 p=1
P r(cj ) ∈ [0, 1] ∀j
µlp,j ∈ [0, 1] ∀p, l, j
Intuitively, working out the KKT conditions on the above objective function, we get the esti-
mators as follows
nlp,j
µ̂lp,j = n
p′ =1 nlp′ ,j
̂r(cj ) = #cj
P
k #ck
10 NAIVE BAYES CLASSIFIER 49
However, there is a problem with the above estimators. To illustrate that let us take the example
of a pair of coins being tossed. Let coin1 be tossed 5 times and suppose we get heads all the 5
times. Then if feature φ = heads, µ1h,1 = 1 and µ1t,1 = 0. But we know that
But from the above parameters, the probability P r(coin1|φ(X) = t) = 0. This may ne if we
have made a million observations. However, in this case it is not correct.
We can modify Equation 11 and get a slightly modied objective function that can alleviate
this problem in a hacky manner by considering:
|c|
[∑ kp
m ∑
∑ ∑ 2 ]
arg max (#cj ) log P r(cj ) + nlp,j log(µpp,j ) − µlp,j (12)
µ,P r(c) j=1 l=1 p=1 p,j,l
The new term introduced in Equation 12 is called the regularization term. It is done to introduce
some bias into the model.
A small note: Suppose φl (Xi ) ∼ Ber(µl1,c(Xi ) . . . µlkl ,c(Xi ) ), then µ̂lp,j is a random variable since
it is a function of the Xi ’s and it will belong to the normal distribution as n → ∞
Exercise
1. Work out the KKT conditions for Equation 11
2. Work out the KKT conditions for Equation 12
Further Reading
1. http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf
2. http://www.cse.iitb.ac.in/~cs725/notes/classNotes/misc/CaseStudyWithProbabilisticModels.
pdf, Sections 7.6.1, 7.7.1, 7.7.2, 7.7.3, 7.4, 7.7.4
3. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.63.2111&rep=rep1&type=
pdf
4. Christopher Bishop, PRML (Pattern Recognition and Machine Learning), Sections 2.1.1,
2.2.1, 2.4.2 and 2.4.3
If P r(µ) has a form such that P r(µ|D) has the same form, we say that P r(µ) is the conjugate
prior to the distribution dening P r(D|µ).
Some of the conjugate priors that we will see in the next class
Dirichlet and Multivariate Bernoulli
Beta and Bernoulli
Gaussian and Gaussian
CS 725 : Foundations of Machine Learning Autumn 2011
t
P (Xn+1 , ..Xn+t |µ) = i=1 P (Xi+1 |µ)
P (Xn+1 |X1 ..Xn ) = P (Xn+1 |µ)P (µ|X1 ..Xn )dµ = µXn +1 (1 − µ)1−Xn+1 P (µ|X1 ..Xn )dµ
= E[µ] if Xn+1 = 1
= 1 − E[µ] if Xn+1 = 0
Beta(µ1 ,µ2 |a,b) = µ1(a−1)!(b−1)!
µ2 (n1 +n2 −1)!
∏ ai −1 √∑
µ aj
Dir(µ1 , µ2 , ..µk |a1 , a2 , ..ak ) = i i∏ √aj j
j
2 lecture scribed by Amrita and Kedhar with inputs from Ajay Nair
51
10 NAIVE BAYES CLASSIFIER 52
Assume X is the event of a coin toss. Let X1=0 (TAILS say), X2=1, X3=0, X4=1, X5=1. We
are interested in predicting the event X6=1 given the above. This can be calculated by dierent
approaches. The ML, MAP and the Bayes Estimator are called the pseudo Bayes, and Bayesian
estimator is called the pure Bayes.
Maximum likelihood
ˆ L is the probability of X = 1 from the data.
µM
(1−X6)
ˆ L X6 (1 − µM
P (X6|X1..X5) = µM ˆ L)
MAP
µMˆAP is the probability of X = 1 from the data.
(1−X6)
P (X6|X1..X5) = µMˆAP X6 (1 − µMˆAP )
Bayes Estimator
ˆ is the probability of X = 1 from the data.
µbayes
(1−X6)
ˆ X6 (1 − µbayes
P (X6|X1..X5) = µbayes ˆ )
Bayesian method
∫ 1
(1−X6)
P (X6|X1..X5) = µX6 (1 − µ) P (µ|X1..X5)dµ
0
The explanation for this equation is as follows:
∫
P (X6|µ, X1, ..X5)P (µ|X1..X5)P (X1..X5)dµ
P (X6|X1..X5) =
P (X1..X5)
10 NAIVE BAYES CLASSIFIER 53
Thus ∫ 1
(1−X6)
P (X6|X1..X5) = µX6 (1 − µ) P (µ|X1..X5)dµ
0
CS 725 : Foundations of Machine Learning Autumn 2011
The setting for the Naive Bayes with Dirichlet prior on Multivariate Bernoulli distribution is as
follows:
For each data point Xi which belongs to class cj there are a set of m features given by
φ1 (Xi ) . . . φl (Xi ) . . . φm (Xi )|cj
Each of these features have a probability distribution given by p(µ1j ) . . . p(µlj ) . . . p(µm
j ) where
p(µ1j ) . . . p(µm
j ) are the parameters to be determined from the data.
Final Estimators
For the problem under consideration, the nal estimators are as follows:
nlp,j
µ̂lp,j = kl
l
q=1 nq,j
M LE
[ ]l nlp,j + alp,j
µ̂lp,j = E(µ|X1 . . . Xn ) = kl also called Laplace Smoothing
l l
q=1 nq,j + aq,j
Bayes p,j
[ ]l nlp,j + alp,j − 1
µ̂lp,j = arg max p(µ|X1 . . . Xn ) = kl
l l
q=1 nq,j + aq,j − 1
M AP p,j
Essentially we are trying to nd p(x|D) but we approximate it to p(x|ˆ(µ) and we determine
various types of estimators from this like MLE, MAP and Bayes.
54
10 NAIVE BAYES CLASSIFIER 55
So nally we have
∫ ∏
k
δ(x,Vi )
p(x|D) = µi Dir(n1 + a1 , . . . nk + ak )dµ1 . . . dµk (13)
µ i=1
µ̂M LE is the loosest approximation from a Bayesian perspective. (since there is no prior in
MLE). For instance, in coin toss example, if we don’t see tails p(tails) = 0 in MLE. In practice,
µ̂M AP and µ̂Bayes are the most important estimators.
Gaussian
Let us consider the rst case of Xi s following a normal distribution. , The pdf of a normal distri-
bution is
1 (x−µ)2
p(x|µ, σ 2 ) = √ e− 2σ2
σ 2π
Given D = {X1 , . . . Xn }, We have
n
[∏ 1 (Xi −µ)2
]
LL(µ, σ 2 |D) = log √ e− 2σ2 (14)
i=1
σ 2π
10 NAIVE BAYES CLASSIFIER 56
n [
∑ (Xi − µ)2 √ ]
=− − n log σ 2π
i=1
2σ 2
n [
∑
2 (Xi − µ)2 √ ]
{µ̂M L , σ̂M L } = arg max − − n log σ − n log 2π
µ,σ 2 i=1
2σ 2
The critical question is whether the objective function in Equation 14 concave ? As an exercise
compute the Hessian of Eq. 14 and check if it is negative semi-denite. Once this is ascertained,
we can apply the KKT conditions for optimality:
2
n
− i=1 (Xiσ̂−µ̂)
2 0
∇LL = n (Xi −µ̂)2 n
=
i=1 σ̂ 3 − σ̂ 0
n
∑
⇒ = nµ̂M LE
i=1
Xi σ2
µ̂M LE = ∼ N (µ, )
n n
Also
n
∑ (Xi − µ̂)2 n
⇒ =
i=1
σ̂ 3 σ̂
n
∑
2 (Xi − µ̂)2
σ̂M LE = ∼ αχ2n−1
i=1
n
e− 2σ2 e 2σ0
2
= √
σ 2π i=1
(µ−µn )2
2
∝e 2σn
(this can be proved)
σ2 nσ 2
0
where µn = µ 0 + µ̂M LE
nσ02 + σ 2 nσ02 + σ 2
1 1 n
and 2 = 2 + 2
σn σ0 σ
Exercise
1. Compute the form of Equation 13
2. Compute the Hessian of Equation 14. Prove that the function in Eq. 14 is concave.
3. Derive the expression for (σ̂jl )M AP in the Naive Bayes with independent Gaussians’ setting.
CS 725 : Foundations of Machine Learning Autumn 2011
58
10 NAIVE BAYES CLASSIFIER 59
p(cj1 |x) = p(cj2 |x) for some 2 classes cj1 and cj2
⇒ p(x|cj1 )p(cj1 ) = p(x|cj2 )p(cj2 )
kl
m ∏
∏ δ(φl (x),Vpl ) kl
m ∏
∏ δ(φl (x),Vpl )
⇒ µlp,j1 p(cj1 ) = µlp,j2 p(cj2 )
l=1 p=1 l=1 p=1
kl
m ∑
∑ [ ] p(cj1 )
⇒ δ(φl (x), Vpl ) log µlp,j1 − log µlp,j2 + log =0 (taking log on both sides)
p=1
p(cj2 )
l=1
If φl = 0|1,
kl
m ∑
∑ [ ] p(cj1 )
φl (x) log µlp,j1 − log µlp,j2 + log =0
p=1
p(cj2 )
l=1
p(cj1 |x) = p(cj2 |x) for some 2 classes cj1 and cj2
⇒ p(x|cj1 )p(cj1 ) = p(x|cj2 )p(cj2 )
m −(φl (x)−µlj )2 m −(φl (x)−µlj )2
∏ 1 2(σ l )2
1 ∏ 1 2(σ l )2
2
⇒ √ e j 1 p(cj1 ) = √ e j 2 p(cj2 )
l=1
σjl 1 2π l=1
σjl 2 2π
∑m [ m
−(φl (x) − µlj1 )2 (φl (x) − µlj2 )2 ] p(cj1 ) ∑ [ σjl ]
⇒ l 2
+ l 2
+ log + log l 2 = 0
2(σj1 ) 2(σj2 ) p(cj2 ) σj1
l=1 l=1
The decision surface is quadratic in φ’s. For the same set of points as in Figure 14, the decision
surface for Naive Bayes with Gaussians is as shown in Figure 15. A very clear decision surface does
not emerge for this case unlike the case for decision trees. For another set of points the decision
surface looks as shown in Figure 16.
(2π) |Σ|
2 2
10 NAIVE BAYES CLASSIFIER 60
In general, if
σ12 0
σ22
Σ= ..
.
0 2
σm
then,
m
∏
N (x|µ, Σ) = N (xi |µi , σi2 )
i=1
Figure 16: Decision Surface of Naive Bayes with Gaussians (another set of points)
So, φ(Xi ) represents a vector. MLE for µ and Σ assuming they came from the same multivariate
Gaussian N (µ, Σ):
µ,Σ i=1
(2π) 2 |Σ| 2
m [
∑ −(φ(xi ) − µ)T Σ−1 (φ(xi ) − µ) n mn ]
= arg max − log |Σ| − log(2π) (15)
µ,Σ i=1
2 2 2
Exercise
1. Find ∇LL for equation 15 w.r.t [µ1 . . . µm Σ11 . . . Σ1m Σ21 . . . Σmm ]T
2. Were θ̂M L , the parameters of bernoulli, multivariate bernoulli and gaussian, unbiased ?
Further Reading
1. Equivalence of symmetric and non-symmetric matrices: Chapter 2, PRML, Christopher
Bishop
2. Unbiased Estimator: http://en.wikipedia.org/wiki/Bias_of_an_estimator
64
11 GAUSSIAN DISCRIMINANT ANALYSIS 65
The nature of the separating surface is linear. When we equate the following terms, xT Σ−1 x
gets cancelled.
2. Consider a ring of radius r = 1 as shown in Figure 19b. Now x a small value and consider
a second ring of radius 1 − . In 2 dimensions, the ratio of area of the annulus to the entire
2 2 m m
area is 1 −(1−)
12 = 2 − 2 . In m-dimensions it is 1 −(1−)
1m . As m grows, the volume of the
annulus grows more than in lower dimensions.
12 Unsupervised Learning
So far we have been studying supervised learning wherein during the training we are also provided
with the class label of a particular example. The data is in the form D(xi , class(xi )). Also we
assumed the generative model of classication i.e the class labels generate the examples. This is
depicted as shown in Figure 20.
Now, we consider another type of learning in the same generative model setting. Here the class
12 UNSUPERVISED LEARNING 67
labels are not known during training. This is also called unsupervised learning. So the data is of
the form D(xi ). We will look at clustering, a type of unsupervised learning.
Here we will learn from a set of data points x1 , x2 , . . . , xn , whose class labels are not known
In the above equation, we are tempted to move the log inside the summation. But to achieve
this, we have to nd a bound on the log-likelihood. We try to nd the lower bound, since our
original intention is to maximize LL. If we maximize the lower bound on LL we are better o.
Let rewrite the LL function in terms of some reference distribution qneq0 as follows:
n |C|
n ∑|C|
∑ ∑ p(xi , cj ) ∑ p(xi , cj )
LL = log q(cj |xi ) ≥ q(cj |xi ) log (16)
i=1 j=1
q(cj |xi ) i=1 j=1
q(cj |xi )
|C|
n ∑
n ∑|C| n ∑|C|
∑ p(xi , cj ) ∑ ∑
LL ≥ q(cj |xi ) log = q(cj |xi ) log p(xi , cj ) − q(cj |xi ) log q(cj |xi )
i=1 j=1
q(cj |xi ) i=1 j=1 i=1 j=1
(18)
The rst term looks like log p(xi , cj )q(cj |xi ) . It is the expected log-likelihood under the q distri-
bution and is written as Eq(cj |xi ) [LL]. The second term is the entropy and is written as Hq .
Some observation on log-likelihood and its lower bound.
1. LL has params q(cj |xi ) = θj = {µj , Σj } and p(cj ). LL is not convex in either of them.
2. Lower bound of LL has params q(cj |xi ), p(cj ) and params of p(xi |cj ). It is not convex in any
of these parameters.
So neither the lower bound of LL nor LL is convex. So can only nd a local optimum. We next
look at the procedure for nding the local optimum of this objective.
1. E-step (Expectation Step): Here we obtain the distribution q for which expectation is
maximum. [ ]
q̂(cj |xi ) = arg max Eq (LL) + Hq = p(cj |xi ) (by Eq. 17)
q(cj |xi )
|C|
n ∑
∑
θ̂j , p̂(cj ) = arg max q(cj |xi ) log p(xi , cj )
θj ,p(cj ) i=1 j=1
12 UNSUPERVISED LEARNING 69
2. M-step:
|C|
n ∑
( )
∑ 1 −(xi −µj )T Σ
−1
j
(xi −µj )
p̂(cj ), µ̂j , Σ̂j = arg max q(cj |xi ) log e 2
Exercise
Prove the following parameter estimates obtained from the steps of EM algorithm for the mixture
of gaussians.
∑n
q(cj |xi )xi
1. µ̂j = i=1
n
∑n
q(cj |xi )(xi −µ̂j )T (xi −µ̂j )
2. Σ̂j = i=1
n
1
n
3. p̂(cj ) = n i=1 q(cj |xi )
CS 725 : Foundations of Machine Learning Autumn 2011
13 Clustering
Now we will study various types of clustering algorithms like k-means, k-mediods and so on. These
are special variants of mixture of Gaussians. They also come under the generative model framework.
(P r(x|C)).
The notes will follow the material from Jiawei Han’s texbook, Data Mining: Concepts and
techniques, Chapter 7. Read this in conjunction with the slides present at http://www.cse.iitb.
ac.in/~cs725/notes/classNotes/JaiweiHanDataMining.ppt.
The broad steps in clustering algorithms are as follows:
1. Feature pre-processing on the datapoints
2. Distance/similarity measure formulation
3. Formulation of the objective function
4. Solving the objective function for optimality
(thereby deriving the clusters that the datapoints belong to.)
Let there be n such datapoints denoting n customers. There can be 3 dierent types of features
1. Numeric features: Features such as age and salary which can take a whole range of values.
2. Discrete features: Features such as gender = male, female and marital status = {married,
single}
70
13 CLUSTERING 71
3. Ordinal features: Features where there is some notion of partial/total order among the values.
For instance designation can take values ceo, manager, superintendent, engineer, worker or
car driven can take values {luxury, sports, sedan, compact, mini} or customer class can be
{gold, silver, bronze, none}
Feature preprocessing
Some kind of preprocessing is done on the features so that we dont have features which drastically
vary in the values they take. Usually some kind of normalization is performed across all the values
of a particular feature (column normalization). The other type of normalization across all the
features of a particular datapoint (row normalization) is also useful sometimes, but is rare.
For numeric features, some normalizations are as follows:
φi − φmin
i
1. max
∈ [0, 1]
φi − φmin
i
φi
2.
φmax
i
φi − φmean
i j |φi (xj ) − φmean
i |
3. (where φmean
i is the mean of values and φstd
i = )
φstd
i n
For discrete features not much preprocessing is necessary. The ordinal features are mapped to
numeric values and then preprocessing is applied on them.
A thing to note here is that feature processing is not unique to clustering and is widely used in
other machine learning techniques. It signicantly helps in classication as well.
Distance measures
The distance (or similarity) metric is denoted by dij (or sij respectively). It is the distance between
any two datapoints i and j. Some examples of distance metrics are as follows:
1. Mahalanobis Distance: Given by the expression ||φ(x) − µ||22 ⇒ (φ(x) − µ)T Σ(φ(x) − µ). EM
algorithm has this in some sense.
2. If φ(x) are numeric / ordinal, then a measure dened (after applying post-column normaliza-
tion):
∑ m
p 1/p
φl (xi ) − φl (xj )
l=1
Similarity measures are usually metric measures. So Sij = 1 − dij . Also sometimes they could
be non-metric measures, for instance, cosine similarity given by:
φT (xi ) φ(xj )
||φ(xi )||2 ||φ(xj )||2
Further Reading
1. Jiawei Han, Micheline Kamber: Data Mining: Concepts and Techniques , Chapter 7
2. Hastie, Tibshirani, Friedman: The elements of Statistical Learning Springer Verlag
3. Ian H. Witten, Eibe Frank, Mark A. Hall: Data Mining: Practical Machine Learning Tools
and Techniques: for reference on data structures for ecient clustering / classication.
CS 725 : Foundations of Machine Learning Autumn 2011
s.t. Pij = 0 or 1
k
∑
Pij = 1
j=1
In this optimization program, the region specied by constraints is non-convex. So the entire
program is not convex.
Suppose we have the objective function:
k ∑
n
′ ∑
∗ 2
O = maxPij ,µj − Pij ||φ(x) − µj ||
j=1 i=1
′ ′
We can show that O∗ is a lower bound of O ∗ i.e O∗ ≤ O ∗ . We can approximate the new
′
objective (O ∗ ) function to the old (O∗ ) by
73
13 CLUSTERING 74
We will start with an arbitrary assignment of points to clusters. The steps of one iteration of
the algorithm are:
Step 1:
Pij φ(xi )
i
µj =
i Pij
Step 2:
n
k ∑
∑ 2
Pij∗ ∈ arg max − Pij φ(xi ) − µj
Pij j=1 i=1
for each i
If
′ 2
Pij = 1 and j ∈ arg max − φ(xi ) − µj then leave Pij unchanged for that i
j′
else
′ 2
1 if j = arg max − φ(xi ) − µj
Pij = j′
0 otherwise
′
Check after every iteration if any new Pij is dierent from old Pij . If yes continue. If no halt.
Note: An iteration of k-means algorithm is of order O(nk). k-means algorithm is also called
k-centroids algorithm.
Disadvantages of k-means
1. Fixed value of k: So determining the right value of k is very critical to the success of the
approach.
2. Sometimes there is a problem owing to the wrong initialization of µj ’s.
3. Mean in “no-man’s land”: Lack of robustness to outliers.
13 CLUSTERING 75
k-mediods
Here the assumption is µj = φ(xi ) for some value of i. In otherwords, the cluster’s centroid coincides
with one of the actual points in the data. Each step of the k-mediod algorithm is k(n−1)n ∼ O(kn2 )
k-modes
For discrete valued attributes, in the k-modes algorithm we have:
∑
µj = arg max δ φl (xi , v) ∀l = 1 . . . m
v∈{1,...Vl } x ∈C
i j
For continuous valued attributes, we may need to t some distribution and nd the mode for
each feature over the cluster.
Exercise
1. Write the steps of the k-mediod algorithm.
However, letting the algorithm choose the value of k will lead to a case where k = n where each
point is a cluster of its own. This kind of trivial clustering is denitely not desirable.
Now let us consider the problem of communicating the location of points to another person over
a SMS. The more information we pass the more is the cost. So we want to minimize the amout
of information set yet be precise enough so that the other side reconstruct the same set of points
given the information.
In the extreme case we send actual co-ordinates of the points. This is a very costly information
transfer but allows the other person to accurately reconstruct the points.
Another option would be to communicate cluster information. So in this we send across the
cluster centroids, the dierence of points with its corresponding centroid and the magnitude of the
dierence. So in this case, the amount of information to be sent across is much less given that the
magnitude of the points may be large but their dierence will be small. Hence. information to be
encoded is less. This is akin to the MDL principle.
Let us denote the data by D and theory about the data T (cluster information and so on). Let
the information that is to be encoded be denoted by I bits. So we can approximate I(D) as I(D, D)
and is given by the expression:
I(D) ≈ I(D, D) > I(D|T ) + I(T )
where I(D) represents the co-ordinate of every point. I(D|T ) is cluster information of a point
like membership of point to a cluster and relative co-ordinate (dierence) of point w.r.t to the corre-
sponding cluster. I(T ) is overall clustering information like number of clusters, order of magnitude
of points in each cluster and means of every cluster.
P (D, T ) is similar in spirit to the Bayes rule. We have P (D, T ) = P (D|T )P (T ) or
log P (D, T ) = log P (D|T ) + log P (T )
So we can modify the original objective as follows:
k ∑
∑ n
arg max − Pij ||φ(xi ) − µj ||2 − γk
Pij ,µj ,k j=1 i=1
76
13 CLUSTERING 77
The rst term of the objective is I(D—T) and the second term is -I(T). The second term is also
called the regularizer.
In essence, according to the MDL principle, we need to dene I(D|T ) and I(T ) and choose T
such that it minimizes I(D|T ) + I(T ). This is also aligned with the Occam Razor principle.
2. Which two clusters to merge (single-link, complete-link, average distance): Merge clusters
that have the least mutual distance. Altenatively, which clusters to break.
3. When to stop merging clusters (closely linked to the distance measure). Stop when the
distance between two clusters is > θ (some threshold). Altenatively, when to stop splitting
the clusters.
Grid-based clustering
An algorithm called STING was discussed in class. Some of the salient steps of the algorithm (see
Han’s book for more details)
To nd the type of distribution in step 2, we use the concept of goodness-of-t. Step 3 of the
algorithm is of order O(grid hierarchy size).
Exercise
1. Construct a case where DBScan will discover non-convex clusters.
2. Construct a case where DBScan will discover one cluster inside another.
Further Reading
Read more about goodness-of-t, MDL, Hypothesis Testing and cross-validation
78
14 ASSOCIATION RULE MINING 79
m
⋃
The nal set of all interesting sets S ∗ = Si
i=1
Excercise
Write the generalized expression for Option 1 above.
CS 725 : Foundations of Machine Learning Autumn 2011
φ i 1 ∧ φ i2 ∧ . . . ∧ φ i k ⇒ φ i j
such that Sup(R) > s i.e., Support(φi1 ∧ φi2 ∧ . . . ∧ φik ⇒ φij ) > s and
Conf (R) > c where
1. Constraint based mining: In this type of rule mining, constraints can be placed on what
should be discovered and what should be not.
2. Handling Quantitaive attributes:
3. Multi-level Association Rule: This type of association rule mining gets rules by placing
higher threshold for smaller sets ensuring concrete rules and then in the next level decrease
threshold to get ruels for next level and so on.
3 Lecture scribed by Kedhar
4 http://en.wikipedia.org/wiki/Association_rule_learning
80
14 ASSOCIATION RULE MINING 81
Pr(x/C)∗Pr(C)
Pr(C/x) = Pr(x)
Using the histogram computaion, we can do histogram seperately for everey class to compute
Pr(x/C). Extending the concept to instances with m-dimensions, the number of bins would be
tm , which increases exponentially with the dimensions of the instances. Hence, this method do not
seem to be a good one.
If Pr(x) ∼true density (Pr(x) can be replaced by Pr(x/C)).
Now consider area A given by PA .
∫
PA = P (x ∈ A) = Pr(x)dx = Px ∗ |A| (22)
k
Px ≈ (23)
n ∗ |A|
In 23, We x the value of k and determine A in which these k lie. This is K-Nearest Neighbour
classier. Otherwise, x the region A and then determine k based on the training sample, this is
Parzen window kernel density classier.
15 NON PARAMETRIC DENSITY ESTIMATION AND CLASSIFICATION 83
From the above equation, it can be understood that Standard KNN takes uniform Pr(C − j).
15 NON PARAMETRIC DENSITY ESTIMATION AND CLASSIFICATION 84
kj
C∗ = argmaxCj ∗ Pr(Cj ) (26)
k
Notes:K-Nearest Neighbour classier is also called memory based classier. Its a lazy classier
and non-smooth one.
CS 725 : Foundations of Machine Learning Autumn 2011
p(x|cj )p(cj )
p(cj |x) = (by Bayes’s rule)
p(x)
∑
arg max p(cj |x) = arg max p(cj ) × (1/nj ) K(x, xi )
j j xi ∈cj
1 ∑
= arg max K(x, xi ) (since p(cj ) = nj /n)
j n x ∈c
i j
85
16 DISCRIMINATIVE CLASSIFICATION 86
where αi ’s are the parameters to be discovered and are in some range. Also most αi ’s are 0. This
gives the sparsity that we discussed before.
16 Discriminative Classication
Most models seen so far try to model the likelihood of the data i.e p(x|cj ). So we indirectly obtain
p(cj |x) using the Bayes’ rule:
p(x|cj )p(cj )
p(cj |x) =
p(x)
What if we model p(cj |x) directly ? This is called probabilistic discriminative classication
model.
The above formaluation is a concave optimization program. Optimality occurs at (see practice
h/w 2, problem 3):
m
∑ ′
2
− 1 − log p(cj |xi ) + λl φl (xi , cj ) + ηi + θij − θij =0
l=1
∑m
λl φl (xi ,cj )
e i=1
⇒p(cj |xi ) = |c| ∑m
i=1 λl φl (xi ,cj ′ )
j ′ =1
e
16 DISCRIMINATIVE CLASSIFICATION 87
Where λl parameters are obtained from the 1st constraint and η and∑mθ from the 2nd constraint.
We can show that the equivalent dual problem (using p(cj |xi ) ∝ e i=1 λl φl (xi ,cj ) ) as:
n
∑
maxλl LL = maxλl log p(c(xi )|xi ) (27)
i=1
This form of p(cj |xi ) is an important element of the exponential family of distributions. (See
further reading below for a detailed description of exponential family)
this is also called the logistic regression classier. The logistic regression classier’s decision
surface is linear in the φ space.
Note: How to learn λ ?
1. Gradient descent: Updates are given by (for the dual problem in Eq. 27):
n |c|
n ∑
∑ ∑
φl (xi , c(xi )) − pold (cj |xi )φl (xi , cj )
i=1 i=1 j=1
. ∇2 LL is given by: ∇2 LL = φT M φ
16 DISCRIMINATIVE CLASSIFICATION 88
Further Reading
Exponential family : Read Section 7.4 of http://www.cse.iitb.ac.in/~cs725/notes/classNotes/
misc/CaseStudyWithProbabilisticModels.pdf
Exercise
Write the expression for M in the Newton updates expression of ∇2 LL.
See Convex optimization notes section 4.5.2.(http://www.cse.iitb.ac.in/~cs725/notes/classNotes/
BasicsOfConvexOptimization.pdf)
CS 725 : Foundations of Machine Learning Autumn 2011
17 Graphical Models
In this lecture a general introduction to graphical models was presented. The two broad categories of
graphical models namely undirected and directed models were discussed. Properties of conditional
independence and how a graph is factored based on this was also discussed. It was stressed upon
that the absence of an edge is more important than the presence of an edge.
Some ways of inferencing in graphical models were briey touched upon.
Further Reading
1. Graphical Models slides presented in class : http://www.cse.iitb.ac.in/~cs725/notes/
classNotes/graphicalModels.ppt
2. Detailed graphical models notes : http://www.cse.iitb.ac.in/~cs725/notes/classNotes/
misc/CaseStudyWithProbabilisticModels.pdf
89
CS 725 : Foundations of Machine Learning Autumn 2011
2. The second method is to model P (x|C+ ) and P (x|C− ) together with the prior probabilities
P (Ck ) for the classes, from which we can compute the posterior probabilities using Bayes’
theorem
P (x|Ck )P (Ck )
P (Ck |x) =
P (x)
These types of models are called generative models.
3. The third method is to model P (C+ |x) and P (C− |x) directly. These types of models are called
discriminative models. In this case P (C+ |x) = P (C− |x) gives the required decision boundary.
90
18 GENERALIZED MODELS FOR CLASSIFICATION 91
Examples
An example of generative model is as follows:
P (x|C+ ) = N (µ+ , Σ)
P (x|C− ) = N (µ− , Σ)
With prior probabilities P (C+ ) and P (C− ) known, we can derive P (C+ |x) and P (C+ |x). In this
case it can be shown that the decision boundary P (C+ |x) = P (C− |x) is a hyperplane.
An example of discriminative model is
T
ew φ(x)
P (C+ |x) =
1 + ewT φ(x)
1
P (C− |x) =
1 + ewT φ(x)
Examples of rst model (which directly construct the classier) include
Linear Regression
Perceptron
Fisher’s Discriminant
Support Vector Machines
Attempting to construct a K class discriminant from a set of two class discriminants leads to
ambiguous regions. The problems with the rst two approaches are illustrated in Figures 23 and
24, where there are ambiguous regions marked with ’ ?’.
Avoiding ambiguities
We can avoid above mentioned diculties by considering a single K-class discriminant comprising
K functions gCk (x). Then x is assigned to a class Ck that has the maximum value for gCk (x)
If gCk (x) = wCTk φ(x) the decision boundary between class Cj and class Ck is given by gCk (x) =
gCj (x) and hence corresponds to
(wCTk − wCTj )φ(x) = 0
18 GENERALIZED MODELS FOR CLASSIFICATION 92
where k ∈ {1, ...K}. We can conveniently group these together using vector notation so that
y(x) = WT φ(x)
where W is a matrix whose k th column comprises the unknown parameters wk and φ(x) is the
vector of basis function values evaluated at the input vector x. The procedure for classication is
then to assign a new input vector x to the class for which the output yk = wkT φ(x) is largest.
We now determine the parameter matrix W by minimizing a sum-of-squares error function.
Consider a training data set {xn , tn } where n ∈ {1, .., N }, where xn is input and tn is corresponding
target vector. We now dene a matrix Φ whose nth row is given by φ(xn ).
φ0 (x1 ) φ1 (x1 ) . . . φK−1 (x1 )
φ0 (x2 ) φ1 (x2 ) . . . φK−1 (x2 )
Φ=
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
φ0 (xN ) φ1 (xN ) . . . φK−1 (xN )
18 GENERALIZED MODELS FOR CLASSIFICATION 93
Figure 25: Data from two classes classied by least squares (magenta) and logistic (green)
We further dene a matrix T whose nth row is given by the vector tTn . Now, the sum-of-squares
error function can then be written as
1
err(W) = T r{(ΦW − T)T (ΦW − T)}
2
We can now minimize the error by setting the derivative with respect to W to zero. The solution
we obtain for W is then of the form
W = (ΦT Φ)−1 ΦT T
19 Regression
Suppose there are two sets of variables x ∈ <n and y ∈ <k such that x is independent and y
is dependant. The regression problem is concerned with determining y in terms of x. Let us
assume that we are given m data points D = 〈x1 , y1 〉, 〈x2 , y2 〉, .., 〈xm , ym 〉. Then the problem is
to determine a function f ∗ such that f ∗ (x) is the best predictor for y, with respect to D. Suppose
ε(f, D) is an error function, designed to reect the discrepancy between the predicted value f (x′ )
of y′ and the actual value y′ for any 〈x′ , y′ 〉 ∈ D, then
where, F denotes the class of functions over which the optimization is performed.
f (x) = w0 + w1 x + w2 x2 + ... + wt xt
If there are m data points, then a polynomial of degree m − 1 can exactly t the data, since the
polynomial has m degrees of freedom (where degrees of freedom=no. of coecients)
As the degree of the polynomial increases beyond m, the curve becomes more and more wobbly,
while still passing through the points. Contrast the degree 10 t in Figure 28 against the degree 5
t in Figure 27. This is due to the problem of overtting (overspecication)
Now E is a convex function. To optimize it, we need to set ∇w E = 0. The ∇ operator is also
called gradient.
Solution is given by,
X = (φt φ)−1 φt Y
If m << t then
φ becomes singular and the solution cannot be found OR
Figure 28: Fit for degree 10 polynomial. Note how wobbly this t is.
value. ∑
ε(f, D) = (f (xi ) − yi )2
〈xi ,yi 〉∈D
The minimum value of the squared loss is zero. Is it possible to achieve this value ? In other
p
∑
words is ∀j, wi φi (xj ) = yj possible ?
i=1
C(φ)
ŷ
Figure 29: Least square solution ŷ is the orthogonal projection of y onto column space of φ
ŷT φ = yT φ
T
ie, (φw) φ = yT φ
T T
ie, w φ φ = yT φ
ie, φT φw = φT y
∴ w = (φT φ)−1 y
In the last step, please note that, φT φ is invertible only if φ has full column rank.
Theorem: If φ has full column rank, φT φ is invertible. A matrix is said to have full column rank
if all its columnvectors are linearly independent. A set of vectors vi is said to be linearly
independent if i αi vi = 0 ⇒ αi = 0.
Proof: Given that φ has full column rank and hence columns are linearly independent, we have
that φx = 0 ⇒ x = 0.
Assume on the contrary that φT φ is non invertible. Then ∃x 6= 0 3 φT φx = 0.
⇒ xT φT φx = 0
⇒ (φx)T φx = ||φx||2 = 0
⇒ φx = 0. This is a contradiction. Hence the theorem is proved.
19 REGRESSION 98
Problem formulation
In order to cease coecients from becoming too large in magnitude, we may modify the problem
to be a constrained optimization problem. Intuitively, for achieving this criterion, we may impose
constraint on magnitude of coecients. Any norm for this purpose might give good working solu-
tion. However, for mathematical convenience, we start with the euclidean (L2 ) norm. The overall
problem with objective function and constraint goes as follows:
As observed in last lecture, the objective function, namely f (w) = (Φw−Y )T (Φw−Y ) is strictly
convex. Further to this, the constraint function, g(w) =‖ w ‖22 −ξ, is also a convex function. For
convex g(w), the set S = {w|g(w) ≤ 0}, can be proved to be a convex set by taking two elements
w1 ∈ S and w2 ∈ S such that g(w1 ) ≤ 0 and g(w2 ) ≤ 0. Since g(w) is a convex function, we have
the following inequality:
g(θw1 + (1 − θ)w2 ) ≤ θg(w1 ) + (1 − θ)g(w2 )
(30)
≤ 0; ∀θ ∈ [0, 1], w1 , w2 ∈ S
i.e.
‖w∗ ‖2 ≤ ξ (33)
λ≥0 (34)
λ‖w∗ ‖2 = λξ (35)
Thus values of w∗ and λ which satisfy all these equations would yield an optimum solution.
Consider equation (32),
w∗ = (ΦT Φ + λI)−1 ΦT y
Premultiplying with (ΦT Φ + λI) on both sides we have,
(ΦT Φ + λI)w∗ = ΦT y
(α + λ)‖w∗ ‖ ≥ ‖ΦT y‖
i.e.
‖ΦT y‖
λ≥ −α (37)
‖w∗ ‖
Note that when ‖w∗ ‖ → 0, λ → ∞. This is obvious as higher value of λ would focus more on
reducing value of ‖w∗ ‖ than on minimizing the error function.
‖w∗ ‖2 ≤ ξ
‖ΦT y‖
∴λ≥ √ −α (38)
ξ
This is not the exact solution of λ but the bound (15) proves the existance of λ for some ξ and Φ.
19 REGRESSION 100
Figure 30: RMS error vs. degree of polynomial for test and train data
Figure 31: RMS error vs. 10λ for test and train data (at Polynomial degree = 6)
1. Sensitivity to outliers
2. Masking
Sensitivity to outliers
Outliers : They are points which have noise and adversely aect the classication.
In the right hand gure , the separating hyperplane has changed because of the outliers.
19 REGRESSION 102
Masking
It is seen empirically that linear regression classier may mask a given class. This is shown in
the left hand gure. We had 3 classes one in between the other two. The between class points are
not classied.
The equation of the classier between class C1(red dots) and class C2(green dots) is
(ω1 − ω2 )T φ(x) = 0
and the equation of the classier between the classes C2(green dots) and C3(blue dots) is
(ω2 − ω3 )T φ(x) = 0
We will transform the original dimensions to new dimensions. New dimensions are function
of original dimensions. This is a work around solution.
′
φ1 (x) = σ1 (φ1 , φ2 )
′
φ2 (x) = σ2 (φ1 , φ2 )
′ ′
Here we try to determine the transformations φ1 and φ2 such that we can get a linear
classier in this new space. When we map back to the original dimensions , the separators
may not remain linear.
′
Problem : Exponential blowup of number of parameters (w s) in order O(nk−1 ).
19 REGRESSION 103
Figure 34: Mapping back to original dimension class separator not linear
Decision surface is the perpendicular bisector of the line joining mean of class C1 (m1 ) and
mean of class C2 (m2 ).
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number of points in
class C1 .
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number of points in
class C2 .
Comment : This is solving the masking problem but not the sensitivity problem as this
does not capture the orientation(eg: spread of the data points) of the classes.
3. Fisher Discrimant Analysis.
Here we consider the mean of the classes , within class covariance and global covariance.
Aim : To increase the separation between the class means and to minimize within class
variance. Considering two classes.
SB is Inter class covariance and SW is Intra class covariance.
m1 = (1/N1 ) n∈C1 xn where m1 is the mean of class C1 and N1 is the number of points in
class C1 .
m2 = (1/N2 ) n∈C2 xn where m2 is the mean of class C2 and N2 is the number of points in
class C2 .
SB = (m2 − m1 )(m2 − m1 )T
SW = n∈C1 (xn − m1 )(xn − m1 )T + n∈C2 (xn − m2 )(xn − m2 )T
−1
wαSw (m2 − m1 )
Summary
20 Perceptron Algorithm
We saw a class of classiers that model wT φ(x) directly. Among them were least squared error
classication, perpendicular bisector of line connecting the mean points and sher discriminant
analysis. All these models have the problem that they are not robust to outliers. They are extremely
sensitive to them, in the sense that a few outlier points can drastically change the position and
orientation of the decision surface.
20.1 Introduction
Desirables for avoiding sensitivity to outliers
1. Few points properly classied and far away from the separating surface (decision boundary)
should not inuence the decision boundary much.
2. (Possibly) few misclassied points far away from the separating surface should also not inu-
ence the decision boundary.
In Perceptron algorithm the main idea is to learn w (for wT φ(x) = 0) only from misclassied
examples weighing them by their distance from the separating hyperplane.
A misclassied example is dened as
yi wT φ(xi ) < 0
105
20 PERCEPTRON ALGORITHM 106
wT φ(x) = 0
φ(x0 )
φ(x)
D
D = wT (φ(x) − φ(x0 )
Since wT (φ(x0 )) = 0, we get distance = wT (φ(x)). Note: We study perceptron (and later SVM)
for 2-class classication problems only. We label them as y=1 and y=-1. A point is misclassied if
yi wT (φ(x)) < 0
Intuition
T
ywk+1 φ(x) = y(wk + yφ(x)T φ(x)
= ywkT φ(x) + y 2 ‖φ(w)‖2
> ywkT φ(x)
Note: We applied the update for this point, (since ywkT φ(x) ≤ 0) We have ywkT φ(x) > ywkT φ(x).
So we have more hope that this point is classied correctly now. More formally, perceptron tries
to minimize the error function ∑
E=− yφT (x)ω
x∈M
wk+1 = wk − η∇E
∑
= wk + η yφ(x) (This takes all misclassied points at a time)
x∈M
But what we are doing in standard Perceptron Algorithm, is basically Stochastic Gradient
Descent:
∑ ∑
∇E = − yφ(x) = − ∇E(x) , where E(x) = yφ(x)
x∈M x∈M
wk+1 = wk − η∇E(x)
= wk + ηyφ(x) (for any x ∈ M )
φT (x)w ∗ = 0
Proof :-
lim ‖wk+1 − ρw∗ ‖2 = 0 (If this happens for some constant ρ, we are ne.)
k→∞
‖wk+1 − ρw∗ ‖2 = ‖wk − ρw∗ ‖2 + ‖yφ(x)‖2 + 2y(wk − ρw∗ )T φ(x)
Now, we want L.H.S. to be less than R.H.S. at every step, although by some small value, so
that perceptron will converge overtime.
So, if we can obtain an expression of the form:
Then, ‖wk+1 −ρw∗ ‖2 is reducing by atleast θ2 at every iteration. So, from the above expressions,
we need to nd θ such that,
Some observations
1. ywkT φ(x) ≤ 0 (∵ x was misclassied)
21 SUPPORT VECTOR MACHINES 108
2. Γ2 = max ‖φ(x)‖2
x∈D
3. δ = max −2yw∗ T φ(x)
x∈D
Here, margin = w∗ T φ(x) = dist. of closest point from hyperplane and, D is the set of all points,
not just misclassifed ones.
Since, w∗ T φ(x) ≥ 0, so, δ ≤ 0. So, what we are interested in, is the ’least negative’ value of δ
From the observations, and eq.(2), we have:
2Γ2
Taking, ρ = , then,
−δ
0 ≤ ‖wk+1 − ρw∗ ‖2 ≤ ‖wk − ρw∗ ‖2 − Γ2
Hence, we have, Γ2 = θ2 , that we were looking for in eq.(3). ∴ ‖wk+1 − ρw∗ ‖2 decreases by atleast
Γ2 at every iteration.
Here the notion of convergence is that wk converges to ρw∗ by making atleast some decrement
at each step. Thus, for k → ∞, ‖wk − ρw∗ ‖ → 0. Hence, the proof of convergence.
However, the catch is : if w∗ , w0∗ is a solution, then λw∗ , λw0∗ is also a solution.
21 SUPPORT VECTOR MACHINES 109
2
∑
min ||w|| + c ξi (44)
w,w0
i
s.t.∀i (45)
T
yi (φ (xi )w + w0′ ) ≥ 1 − ξi (46)
where, (47)
∀iξi ≥ 0 (48)
In soft margin we account for the the errors. The above formulation is one of the many formulation
of soft SVMs. In the above formulation, large value of c means overtting.
Dual Formulation
m
( )
∑
d∗ = max min f (x) + λi gi (x) (53)
λ∈ x∈D
i=1
s.t. λi ≥ 0 (54)
Equation 53 is and convex optimization problem. Also, d∗ ≤ p∗ and (p∗ − d∗ ) is called the duality
gap.
If for some (x∗ , λ∗ ) where x∗ is primal feasible and λ∗ is dual feasible and we see the KKT
conditions are satised and f is and all gi are convex then x∗ is optimal solution to primal and λ∗
to dual.
Also, the dual optimization problem becomes,
d∗ = max
m
L(x∗ , λ) (55)
λ∈
s.t. λi ≥ 0∀i (56)
m
∑
where L(x, λ) = f (x) + λi gi (x) (57)
i=1
L∗ (λ) = min L(x, λ) (58)
x∈D
= min L(x, λ) (59)
x∈KKT
λi ≥ 0∀i (60)
It happens to be,
p ∗ = d∗ (61)
111
21 SUPPORT VECTOR MACHINES 112
∑m ∑m m
∑
¯ w0 , ᾱ, λ̄) = 1 ||w||2 + c
L(w̄, ξ, ξi + αi 1 − ξi − yi φT (xi )w + w0 − λi ξ i (62)
2 i=1 i=1 i=1
KKT 1.a
∇w L = 0 (63)
∑n
=⇒ w − αj yj φT (xj ) = 0 (64)
j=1
KKT 1.b
∇xii L = 0 (65)
=⇒ c − αi − λi = 0 (66)
KKT 1.c
∇w0 L = 0 (67)
∑n
=⇒ α i yi = 0 (68)
i=1
KKT 2
∀i (69)
T
yi φ (xi )w + w0 ≥ 1 − ξi (70)
ξi ≥ 0 (71)
KKT 3
αj ≥ 0 and λk ≥ 0 (72)
∀j, k = 1, . . . , n (73)
KKT 4
αj yi φT (xj )w + w0 − 1 + ξj = 0 (74)
λk ξ k = 0 (75)
(a)
m
∑
w∗ = αj yi φ(xj ) (76)
j=1
21 SUPPORT VECTOR MACHINES 113
(b)
subject to constraint,
∀i : yi (φT (xi )w + w0 ) ≥ 1 − ξi
The dual of the SVM optimization problem can be stated as,
m m m
1 ∑∑ ∑
max{− yi yj αi αj φT (xi )φ(xj ) + αj }
2 i=1 j=1 j=1
subject to constraints,
∑
∀i : α i yi = 0
i
∀i : 0 ≤ αi ≤ c
The duality gap = f (x∗ ) − L∗ (λ∗ ) = 0, as shown in last lecture. Thus, as is evident from the
solution of the dual problem,
m
∑
w∗ = αi∗ yi φ(xi )
i=1
To obtain wo∗ , we can use the fact (as shown in last lecture) that, if αi ∈ (0, C), yi (φT (xi )w +
w0 ) = 1. Thus, for any point xi such that, αi ∈ (0, C), that is, αi is a point on the margin,
1 − yi (φT (xi )w∗ )
wo∗ =
yi
= yi − φ (xi )w∗
T
w0∗ = yi − φT (xi )w
∑m
= yi − αj∗ yj φT (xi )φ(xj )
j=0
m
∑
= yi − αj∗ yj Kij
j=0
Generation of φ space
For a given x = [x1 , x2 , . . . , xn ] → φ(x) = [xd1 , xd2 , xd3 , . . . , xd−1
1 x2 , . . . ].
For n = 2, d = 2, φ(x) = [x21 , x1 x2 , x2 x1 , x22 ], thus,
m ∑
∑ m
φT (x).φ(x̄) = xi xj .x̄i x̄j
i=1 j=1
∑m m
∑
= ( xi x̄i ).( xj x̄j )
i=1 j=1
∑m
= ( xi x̄i )2
i=1
= (xT x̄)2
3. Positivity of Diagonal
K = V ΛV T
Where V is the eigen vector matrix (an orthogonal matrix), and Λ is the Diagonal matrix of
eigen values.
Hence K must be
1. Symmetric.
2. Positive Semi Denite.
3. Having non-negative Diagonal Entries.
115
21 SUPPORT VECTOR MACHINES 116
Examples of Kernels
d
1. Kij = (xi T xj )
d
2. Kij = (xi T xj + 1)
3. Gaussian or Radial basis Function (RBF)
‖xi −xj ‖
Kij = e− 2σ 2 (σ ∈ R, σ 6= 0)
4. The Hyperbolic Tangent function
Kij = tanh(σxTi xj + c)
Dene φ(xi ) = φ′T (x′i )φ′′T (x′′i ). Thus, Kij = φ(xi )φ(xj ).
Hence, K is a valid kernel.
∑ 1
min − α1 − α2 − αi + α12 K11 + α22 K22 + α1 α2 K12 y1 y2
α1 ,α2 2
i6=1,2
∑ ∑
+ α 1 y1 K1i αi yi + α2 y2 K2i αi yi (77)
i6=1,2 i6=1,2
∑
s. t. α 1 y1 + α 2 y 2 = − αj yj = α1old + α2old
j6=1,2
α1 , α2 ∈ [0, c]
Then the objective is just a function of α2 , let the objective is −D(α2 ). Now the program
reduces to
min − D(α2 )
α2
s. t. α2 ∈ [0, c]
2)
Find α2∗ such that ∂D(α
∂α2 = 0. We have to ensure that α1 ∈ [0, c]. So based on that we will
have to clipp α2 , ie, shift it to certain interval. The condition is as follows
y2 y2
0 <= −α2 + α1old + α2old <= c
y1 y1
5. case 1: y1 = y2
case 2: y1 = −y2
If α2 is already in the interval then there is no problem. If it is more than the maximum
limit then reset it to the maximum limit. This will ensure the optimum value of the objective
constrained to this codition. Similarly if α2 goes below the lower limit then reset it to the
lower limit.
21 SUPPORT VECTOR MACHINES 118
∑ 1 ∑∑
Dual: min − αi + αi αj yi yj Kij (78)
α 2 i j
∑
s. t. α i yi = 0
i
αi ∈ [0, c]
The above program is a quadratic program. Any quadratic solvers can be used for solving (78),
but a generic solver will not take consider speciality of the solution and may not be ecient. One
way to solve (78) is by using projection methods(also called Kernel adatron). You can solve the
above one using two ways - chunking methods and decomposition methods.
The chunking method is as follows
1. Initialize αi s arbitrarily
2. Choose points(I mean the components αi ) that violate KKT condition
3. Consider only K working set and solve the dual for the variables in working set
∀α ∈ working set
∑ 1 ∑ ∑
min − αi + αi αj yi yj Kij (79)
α 2
αi inW S i∈W S j∈W S
∑ ∑
s. t. α i yi = − α j yj
i∈W S j ∈W
/ S
αi ∈ [0, c]
Decompsition methods follow almost the same procedure except that in step 2 we always take
a xed number of points which violate the KKT conditions the most.
Further Reading
For SVMs in general and kernel method in particular read the SVM book An Introduction to
Support Vector Machines and Other Kernel-based Learning Methods by Nello Cristianini and John
Shawe-Taylor uploaded on moodle.
CS 725 : Foundations of Machine Learning Autumn 2011
119