DS - ML - 7 - 60019210046 1
DS - ML - 7 - 60019210046 1
DS - ML - 7 - 60019210046 1
Experiment 10
Batch: C32
1
Aim: Explore K means clustering on given datasets.
Theory:
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found. It
is presumptively known how many clusters there are. It is also known as the flat clustering algorithm.
The number of clusters found from data by the method is denoted by the letter ‘K’ in K-means.
In this method, data points are assigned to clusters in such a way that the sum of the squared distances
between the data points and the centroid is as small as possible. It is essential to note that reduced
diversity within clusters leads to more identical data points within the same cluster.
The following stages will help us understand how the K-Means clustering technique works-
Step 1: First, we need to provide the number of clusters, K, that need to be generated by this
algorithm.
Step 2: Next, choose K data points at random and assign each to a cluster. Briefly, categorize the
data based on the number of data points.
Step 4: Iterate the steps below until we find the ideal centroid, which is the assigning of data
points to clusters that do not vary.
4.1 The sum of squared distances between data points and centroids would be calculated first.
4.2 At this point, we need to allocate each data point to the cluster that is closest to the others
(centroid).
4.3 Finally, compute the centroids for the clusters by averaging all of the cluster’s data points.
When using the K-means algorithm, we must keep the following points in mind:
It is suggested to normalize the data while dealing with clustering algorithms such as K-Means
since such algorithms employ distance-based measurement to identify the similarity between
data points.
Because of the iterative nature of K-Means and the random initialization of centroids, K-Means
may become stuck in a local optimum and fail to converge to the global optimum. As a result, it
is advised to employ distinct centroids’ initializations.
2
Collab Link: https://colab.research.google.com/drive/1TLEPYAg7LruvFjTPuLLKXrW9SlKDAX-O
Task 1: Perform Kmeans clustering on Dataset 1 with random initialisation, 10 variations of initial means,
300 iteration). Find Lowest SSE value, Final location of centroids and number of iterations to converge.
Show the predicted labels for first 10 points.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
# Generate dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42)
# KMeans class
class KMeans:
def __init__(self, K=3, max_iters=100, plot_steps=False):
self.K = K
self.max_iters = max_iters
self.plot_steps = plot_steps
# Initialize centroids
random_sample_idxs = np.random.choice(self.n_samples, self.K, replace=False)
self.centroids = [self.X[idx] for idx in random_sample_idxs]
3
# Optimize clusters
for _ in range(self.max_iters):
# Assign samples to closest centroids (create clusters)
self.clusters = self._create_clusters(self.centroids)
if self.plot_steps:
self.plot()
4
for cluster_idx, cluster in enumerate(clusters):
cluster_mean = np.mean(self.X[cluster], axis=0)
centroids[cluster_idx] = cluster_mean
return centroids
def plot(self):
fig, ax = plt.subplots(figsize=(12, 8))
plt.show()
5
6