Speeding Up Kernel Methods, and Intro To Unsupervised Learning
Speeding Up Kernel Methods, and Intro To Unsupervised Learning
Speeding Up Kernel Methods, and Intro To Unsupervised Learning
Piyush Rai
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 1
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Learn a linear model in the new space using the mapped inputs φ(x 1 ), . . . , φ(x N )
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Learn a linear model in the new space using the mapped inputs φ(x 1 ), . . . , φ(x N )
Equivalent to learning a nonlinear model on the original data x 1 , . . . , x N
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Learn a linear model in the new space using the mapped inputs φ(x 1 ), . . . , φ(x N )
Equivalent to learning a nonlinear model on the original data x 1 , . . . , x N
The mappings can be explicitly defined, or implicitly defined via a kernel function k, s.t.
k(x n , x m ) = φ(x n )> φ(x m )
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Learn a linear model in the new space using the mapped inputs φ(x 1 ), . . . , φ(x N )
Equivalent to learning a nonlinear model on the original data x 1 , . . . , x N
The mappings can be explicitly defined, or implicitly defined via a kernel function k, s.t.
k(x n , x m ) = φ(x n )> φ(x m )
Benefit of using kernels: Don’t need to explicitly compute the mappings (M can be very large)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings
Idea: Use a nonlinear mapping φ : RD → RM to map original data to a high-dim space, e.g.,
Learn a linear model in the new space using the mapped inputs φ(x 1 ), . . . , φ(x N )
Equivalent to learning a nonlinear model on the original data x 1 , . . . , x N
The mappings can be explicitly defined, or implicitly defined via a kernel function k, s.t.
k(x n , x m ) = φ(x n )> φ(x m )
Benefit of using kernels: Don’t need to explicitly compute the mappings (M can be very large)
Many ML algos only have data appearing as inner products. Can kernelize such algos
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 2
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
A kernel function k(x n , x m ) = φ(x n )> φ(x m ) defines inner-product similarity between two inputs
This is a Euclidean similarity in φ space but a “nonlinear” similarity in original space
k(x n , x m ) = x>
n xm (Linear kernel)
k(x n , x m ) = (1 + x>n x m)
2
(Quadratic kernel)
> d
k(x n , x m ) = (1 + x n x m) (Polynomial kernel of degree d)
k(x n , x m ) = exp[−γ||x n − x m ||2 ] (RBF/Gaussian kernel)
Again, remember that when using kernels, we don’t have to compute φ explicitly
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 3
Recap: Nonlinear Mappings and Kernels
Nonlinear Map
Nonlinear Map (Helps)
(Helps)
Linear Map
(not helpful)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 4
Kernel Methods can be Slow
Training phase can be slow (if N is very large)
N x N size
Possibly very
high-dimensional
Testing (prediction) phase can be slow (scales in N or at least the number of support vectors)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 5
Kernel Methods can be Slow
Training phase can be slow (if N is very large)
N x N size
Possibly very
high-dimensional
Testing (prediction) phase can be slow (scales in N or at least the number of support vectors)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 5
Kernel Methods can be Slow
Training phase can be slow (if N is very large)
N x N size
Possibly very
high-dimensional
Testing (prediction) phase can be slow (scales in N or at least the number of support vectors)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 5
Kernel Methods can be Slow
Training phase can be slow (if N is very large)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 6
Speeding Up Kernel Methods
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Instead of a high-dim φ(x), can we get a good set of low-dim features ψ(x) ∈ RL using the kernel?
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Instead of a high-dim φ(x), can we get a good set of low-dim features ψ(x) ∈ RL using the kernel?
If ψ(x) is a good approximation of φ(x), then we can just use ψ(x) in a linear model
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Instead of a high-dim φ(x), can we get a good set of low-dim features ψ(x) ∈ RL using the kernel?
If ψ(x) is a good approximation of φ(x), then we can just use ψ(x) in a linear model
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Instead of a high-dim φ(x), can we get a good set of low-dim features ψ(x) ∈ RL using the kernel?
If ψ(x) is a good approximation of φ(x), then we can just use ψ(x) in a linear model
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Speeding Up Kernel Methods
Instead of a high-dim φ(x), can we get a good set of low-dim features ψ(x) ∈ RL using the kernel?
If ψ(x) is a good approximation of φ(x), then we can just use ψ(x) in a linear model
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 7
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
ψ(x n ) ∈ RL is such that k(x n , x m ) = φ(x n )> φ(x m ) ≈ ψ(x n )> ψ(x m )
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
ψ(x n ) ∈ RL is such that k(x n , x m ) = φ(x n )> φ(x m ) ≈ ψ(x n )> ψ(x m )
Can now apply a linear model on the ψ representation (L-dimensional now) of the inputs
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
ψ(x n ) ∈ RL is such that k(x n , x m ) = φ(x n )> φ(x m ) ≈ ψ(x n )> ψ(x m )
Can now apply a linear model on the ψ representation (L-dimensional now) of the inputs
This will be fast both at training as well as test time if L is small
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
ψ(x n ) ∈ RL is such that k(x n , x m ) = φ(x n )> φ(x m ) ≈ ψ(x n )> ψ(x m )
Can now apply a linear model on the ψ representation (L-dimensional now) of the inputs
This will be fast both at training as well as test time if L is small
No need to kernelize the linear model and work with kernels (but still reap their benefits :-))
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Landmarks
Suppose we choose a small set of L “landmark” inputs z 1 , . . . , z L in the training data
For each input x n , using a kernel k, define an L-dimensional feature vector as follows
ψ(x n ) = [k(z 1 , x n ), k(z 2 , x n ), . . . , k(z L , x n )]
ψ(x n ) ∈ RL is such that k(x n , x m ) = φ(x n )> φ(x m ) ≈ ψ(x n )> ψ(x m )
Can now apply a linear model on the ψ representation (L-dimensional now) of the inputs
This will be fast both at training as well as test time if L is small
No need to kernelize the linear model and work with kernels (but still reap their benefits :-))
Note: The landmarks need not be actual inputs. Can even be learned from data.
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 8
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
Can apply a linear model on this L-dim representation of the data (no need to kernelize)
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Using Kernels to “Extract” Good Features: Random Features
Can apply a linear model on this L-dim representation of the data (no need to kernelize)
Such techniques exist for several kernels (RBF, polynomial, etc)
† Random Features for Large-Scale Kernel Machines (Ben and Retch, NIPS 2007. Note: This paper actually won the test-of-time award at NIPS 2017)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 9
Other Techniques for Speeding Up Kernel Methods
Reducing the number of support vectors (for SVM based models), For example,
Learn the kernelized SV. Identify the support vectors.
Cluster the support vectors
Pick one SV from each cluster, retrain SVM using the chosen SVs
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 10
Other Techniques for Speeding Up Kernel Methods
Reducing the number of support vectors (for SVM based models), For example,
Learn the kernelized SV. Identify the support vectors.
Cluster the support vectors
Pick one SV from each cluster, retrain SVM using the chosen SVs
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 10
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Instead, the dual version of Ridge Regression, as we saw earlier, requires solving
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Instead, the dual version of Ridge Regression, as we saw earlier, requires solving
X N
w = X> (XX> + λIN )−1 y = X> α = αn x n
n=1
.. where we learn w in terms of α by inverting N × N matrix
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Instead, the dual version of Ridge Regression, as we saw earlier, requires solving
X N
w = X> (XX> + λIN )−1 y = X> α = αn x n
n=1
.. where we learn w in terms of α by inverting N × N matrix
Even when working with linear model, if D > N, the latter way may be preferable
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Instead, the dual version of Ridge Regression, as we saw earlier, requires solving
X N
w = X> (XX> + λIN )−1 y = X> α = αn x n
n=1
.. where we learn w in terms of α by inverting N × N matrix
Even when working with linear model, if D > N, the latter way may be preferable
Similar considerations apply to other kernelizable models too (e.g., SVM)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Sometimes, even linear models can be trained via kernelization (but with linear kernel)
Benefit? Well, this may be beneficial sometimes due to computational reasons
For example, ridge regression requires solving
w = (X> X + λI)−1 X> y
.. where we learn w by inverting a D × D matrix
Instead, the dual version of Ridge Regression, as we saw earlier, requires solving
X N
w = X> (XX> + λIN )−1 y = X> α = αn x n
n=1
.. where we learn w in terms of α by inverting N × N matrix
Even when working with linear model, if D > N, the latter way may be preferable
Similar considerations apply to other kernelizable models too (e.g., SVM)
If linear model is what you want, still makes sense to look at the relative values of N and D to
decide whether to go for the dual (kernelized) formulation of the problem with a linear kernel
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 11
Kernel Methods: Some Final Comments
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 12
Kernel Methods: Some Final Comments
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 12
Kernel Methods: Some Final Comments
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 12
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 13
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Most unsupervised learning algos can also be seen as learning a new representation of data
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Unsupervised Learning
Most unsupervised learning algos can also be seen as learning a new representation of data
Typically a compressed representation, e.g., clustering can be used to get a one-hot representation
K = 6 clusters
0 0 1 0 0 0
A one-hot (quantized) rep.
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 14
Clustering
Given: N unlabeled examples {x 1 , . . . , x N }; no. of desired partitions K
Goal: Group the examples into K “homogeneous” partitions
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 15
Clustering
Given: N unlabeled examples {x 1 , . . . , x N }; no. of desired partitions K
Goal: Group the examples into K “homogeneous” partitions
Picture courtesy: “Data Clustering: 50 Years Beyond K-Means”, A.K. Jain (2008)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 15
Clustering
Given: N unlabeled examples {x 1 , . . . , x N }; no. of desired partitions K
Goal: Group the examples into K “homogeneous” partitions
Picture courtesy: “Data Clustering: 50 Years Beyond K-Means”, A.K. Jain (2008)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 15
Clustering
Given: N unlabeled examples {x 1 , . . . , x N }; no. of desired partitions K
Goal: Group the examples into K “homogeneous” partitions
Picture courtesy: “Data Clustering: 50 Years Beyond K-Means”, A.K. Jain (2008)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 15
Similarity can be Subjective
Clustering only looks at similarities, no labels are given
Without labels, similarity can be hard to define
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 16
Similarity can be Subjective
Clustering only looks at similarities, no labels are given
Without labels, similarity can be hard to define
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 16
Similarity can be Subjective
Clustering only looks at similarities, no labels are given
Without labels, similarity can be hard to define
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 16
Similarity can be Subjective
Clustering only looks at similarities, no labels are given
Without labels, similarity can be hard to define
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 16
Clustering: Some Examples
Document/Image/Webpage Clustering
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 17
Types of Clustering
1 Flat or Partitional clustering
Partitions are independent of each other
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 18
Types of Clustering
1 Flat or Partitional clustering
Partitions are independent of each other
2 Hierarchical clustering
Partitions can be visualized using a tree structure (a dendrogram)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 18
Types of Clustering
1 Flat or Partitional clustering
Partitions are independent of each other
2 Hierarchical clustering
Partitions can be visualized using a tree structure (a dendrogram)
Possible to view partitions at different levels of granularities by “cutting” the tree at some level
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 18
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Iterate:
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Iterate:
(Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Iterate:
(Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Iterate:
(Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 19
Flat Clustering: K -means algorithm (Lloyd, 1957)
Input: N examples {x 1 , . . . , x N }; x n ∈ RD ; the number of partitions K
Desired Output: Cluster assignments of these N examples and K cluster means µ1 , . . . , µK
Initialize: K cluster means µ1 , . . . , µK , each µk ∈ RD
Usually initialized randomly, but good initialization is crucial; many smarter initialization heuristics
exist (e.g., K -means++, Arthur & Vassilvitskii, 2007)
Iterate:
(Re)-Assign each example x n to its closest cluster center (based on the smallest Euclidean distance)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 20
K -means: Initialization (assume K = 2)
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 21
K -means iteration 1: Assigning points
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 22
K -means iteration 1: Recomputing the centers
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 23
K -means iteration 2: Assigning points
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 24
K -means iteration 2: Recomputing the centers
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 25
K -means iteration 3: Assigning points
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 26
K -means iteration 3: Recomputing the centers
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 27
K -means iteration 4: Assigning points
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 28
K -means iteration 4: Recomputing the centers
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 29
What Loss Function is K -means Optimizing?
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 30
What Loss Function is K -means Optimizing?
Let µ1 , . . . , µK be the K cluster centroids (means)
Let znk ∈ {0, 1} be s.t. znk = 1 if x n belongs to cluster k, and 0 otherwise
Note: z n = [zn1 zn2 . . . znK ] represents a length K one-hot encoding of x n
Intro to Machine Learning (CS771A) Speeding Up Kernel Methods, and Intro to Unsupervised Learning 31