K Means Clustering - Introduction
K Means Clustering - Introduction
K Means Clustering - Introduction
Python R Language Python for Data Science NumPy Pandas OpenCV Data Analysis ML Math Machine
Table of Content
What is K-means Clustering?
What is the objective of k-means clustering?
How k-means clustering works?
Implementation of K-Means Clustering in Python
Hi t hi f t Hi t hi f t
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 1/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we will
use the Euclidean distance as a measurement.
The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive
method is to initialize the means at random items in the data set. Another method is to
initialize the means at random values between the boundaries of the data set (if for a feature
x, the items have values in [0,3], we will initialize the means with values for x at [0,3]).
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 2/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Example 1
Python3
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
Python3
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()
Output:
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 3/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Clustering dataset
Python3
k = 3
clusters = {}
np.random.seed(23)
clusters[idx] = cluster
clusters
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 4/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Output:
Python3
plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()
Output:
The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the
initial cluster centers (red stars) generated for K-means clustering.
Python3
def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))
Python3
#Implementing E step
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
clusters[i]['points'] = []
return clusters
Step 7: Create the function to Predict the cluster for the datapoints
Python3
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 6/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred
Python3
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Python3
plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()
Output:
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 7/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
K-means Clustering
The plot shows data points colored by their predicted clusters. The red markers represent the
updated cluster centers after the E-M steps in the K-means clustering algorithm.
Example 2
Python3
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
Python3
X, y = load_iris(return_X_y=True)
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 8/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Elbow Method
Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised
algorithm. One of the most common techniques for figuring out this ideal value of k is the
elbow approach.
Python3
Python3
sns.set_style("whitegrid")
g=sns.lineplot(x=range(1,11), y=sse)
plt.show()
Output:
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 9/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Elbow Method
From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So, we are
considering K=3
Python3
Output:
KMeans
KMeans(n_clusters=3, random_state=2)
Python3
kmeans.cluster_centers_
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 10/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Output:
Python3
pred = kmeans.fit_predict(X)
pred
Output:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1], dtype=int32)
Python3
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[:2]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")
plt.subplot(1,2,2)
plt.scatter(X[:,2],X[:,3],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 11/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
center = center[2:4]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()
Output:
K-means clustering
The subplot on the left display petal length vs. petal width with data points colored by
clusters, and red markers indicate K-means cluster centers. The subplot on the right show
sepal length vs. sepal width similarly.
Conclusion
In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for
grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data
points part of the same group. The algorithm initializes cluster centroids and iteratively
assigns data points to the nearest centroid, updating centroids based on the mean of points in
each cluster.
K-means is a partitioning method that divides a dataset into ‘k’ distinct, non-
overlapping subsets (clusters) based on similarity, aiming to minimize the variance
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 12/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
K-means works well with numerical data, where the concept of distance between data
points is meaningful. It’s commonly applied to continuous variables.
K-means is primarily used for clustering and grouping similar data points. It does not
predict labels for new data; it assigns them to existing clusters based on similarity.
The objective is to partition data into ‘k’ clusters, minimizing the intra-cluster variance.
It seeks to form groups where data points within each cluster are more similar to each
other than to those in other clusters.
Don't miss your chance to ride the wave of the data revolution! Every industry is scaling new
heights by tapping into the power of data. Sharpen your skills and become a part of the
hottest trend in the 21st century.
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 13/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Dive into the future of technology - explore the Complete Machine Learning and Data Science
Program by GeeksforGeeks and stay ahead of the curve.
Previous Next
Similar Reads
Analysis of test data using K-Means Clustering ML | Determine the optimal value of K in K-
in Python Means Clustering
ML | Mini Batch K-means clustering algorithm Image compression using K-means clustering
Complete Tutorials
Python Crash Course Python API Tutorial: Getting Started with APIs
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 14/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
GeeksforGeeks
Additional Information
Company Explore
About Us Job-A-Thon Hiring Challenge
Legal Hack-A-Thon
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 15/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL Top 100 DSA Interview Problems
R Language DSA Roadmap by Sandeep Jain
Android Tutorial All Cheat Sheets
Tutorials Archive
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 17/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 18/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 19/19