K Means Algo
K Means Algo
K Means Algo
The Algorithm
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
, where is a chosen distance measure between a data point and the cluster
centre , is an indicator of the distance of the n data points from their respective cluster centres. The algorithm is composed of the following steps: 1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres. The k-means algorithm can be run multiple times to reduce this effect. K-means is a simple algorithm that has been adapted to many problem domains. As we are going to see, it is a good candidate for extension to work with fuzzy feature vectors.
An
example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster i. If the clusters are well separated, we can use a minimum-distance classifier to
separate them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k distances. This suggests the following procedure for finding the k means:
Make initial guesses for the means m1, m2, ..., mk Until there are no changes in any mean o Use the estimated means to classify the samples into clusters o For i from 1 to k Replace mi with the mean of all of the samples for cluster i o end_for end_until
Here is an example showing how the means m1 and m2 move into the centers of two clusters.
Remarks This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the cluster centers. It does have some weaknesses:
The way to initialize the means was not specified. One popular way to start is to randomly choose k of the samples. The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points. It can happen that the set of samples closest to mi is empty, so that mi cannot be updated. This is an annoyance that must be handled in an implementation, but that we shall ignore. The results depend on the metric used to measure || x - mi ||. A popular solution is to normalize each variable by its standard deviation, though this is not always desirable. The results depend on the value of k.
This last problem is particularly troublesome, since we often have no way of knowing how many clusters exist. In the example shown above, the same algorithm applied to the same data produces the following 3-means clustering. Is it better or worse than the 2-means clustering?
Unfortunately there is no general theoretical solution to find the optimal number of clusters for any given data set. A simple approach is to compare the results of multiple runs with different k classes and choose the best one according to a given criterion (for instance the Schwarz Criterion - see Moore's slides), but we need to be careful because increasing k results in smaller error function values by definition, but also an increasing risk of overfitting.
Example: Let us apply the k-Means clustering algorithm to the same example as in the previous page and obtain four clusters : Food item # Protein content, P Fat content, F Food item #1 1.1 60 Food item #2 8.2 20 Food item #3 4.2 35 Food item #4 1.5 21 Food item #5 7.6 15 Food item #6 2.0 55 Food item #7 3.9 39 Let us plot these points so that we can have better understanding of the problem. Also, we can select the three points which are farthest apart.
We see from the graph that the distance between the points 1 and 2, 1 and 3, 1 and 4, 1 and 5, 2 and 3, 2 and 4, 3 and 4 is maximum. Thus, the four clusters chosen are : Cluster number C1 C2 C3 C4 Protein content, P 1.1 8.2 4.2 1.5 Fat content, F 60 20 35 21
Also, we observe that point 1 is close to point 6. So, both can be taken as one cluster. The resulting cluster is called C16 cluster. The value of P for C16 centroid is (1.1 + 2.0)/2 = 1.55 and F for C16 centroid is (60 + 55)/2 = 57.50. Upon closer observation, the point 2 can be merged with the C5 cluster. The resulting cluster is called C25 cluster. The values of P for C25 centroid is (8.2 + 7.6)/2 = 7.9 and F for C25 centroid is (20 + 15)/2 = 17.50 The point 3 is close to point 7. They can be merged into C37 cluster. The values of P for C37 centroid is (4.2 + 3.9)/2 = 4.05 and F for C37 centroid is (35 + 39)/2 = 37. The point 4 is not close to any point. So, it is assigned to cluster number 4 i.e., C4 with the value of P for C4 centroid as 1.5 and F for C4 centroid is 21. Finally, four clusters with three centroids have been obtained. Cluster number Protein content, P Fat content, F C16 1.55 57.50 C25 7.9 17.5 C37 4.05 37 C4 1.5 21
In the above example it was quite easy to estimate the distance between the points. In cases in which it is more difficult to estimate the distance, one has to use euclidean metric to measure the distance between two points to assign a point to a cluster.
Erosion
The following figure illustrates the dilation of a binary image. Note how the structuring element defines the neighborhood of the pixel of interest, which is circled. (See Understanding Structuring Elements for more information.) The dilation function applies the appropriate rule to the pixels in the neighborhood and assigns a value to the corresponding pixel in the output image. In the figure, the morphological dilation function sets the value of the output pixel to 1 because one of the elements in the neighborhood defined by the structuring element is on.
The following figure illustrates this processing for a grayscale image. The figure shows the processing of a particular pixel in the input image. Note how the function applies the rule to the input pixel's neighborhood and uses the highest value of all the pixels in the neighborhood as the value of the corresponding pixel in the output image. Morphological Dilation of a Grayscale Image
Processing Pixels at Image Borders (Padding Behavior) Morphological functions position the origin of the structuring element, its center element, over the pixel of interest in the input image. For pixels at the edge of an image, parts of the neighborhood defined by the structuring element can extend past the border of the image. To process border pixels, the morphological functions assign a value to these undefined pixels, as if the functions had padded the image with additional rows and columns. The value of these padding pixels varies for dilation and erosion operations. The following table describes the padding rules for dilation and erosion for both binary and grayscale images. Rules for Padding Images Operation Rule Dilation Pixels beyond the image border are assigned the minimum value afforded by the data type. For binary images, these pixels are assumed to be set to 0. For grayscale images, the minimum value for uint8 images is 0. Erosion Pixels beyond the image border are assigned the maximum value afforded by the data type. For binary images, these pixels are assumed to be set to 1. For grayscale images, the maximum value for uint8 images is 255. Note By using the minimum value for dilation operations and the maximum value for
erosion operations, the toolbox avoids border effects, where regions near the borders of the output image do not appear to be homogeneous with the rest of the image. For example, if erosion padded with a minimum value, eroding an image would result in a black border around the edge of the output image. Back to Top