K Means Clustering - Introduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Python R Language Python for Data Science NumPy Pandas OpenCV Data Analysis ML Math Machine

K means Clustering – Introduction


Read Courses Practice Video Jobs

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the


unlabeled dataset into different clusters. The article aims to explore the fundamentals and
working of k mean clustering along with the implementation.

Table of Content
What is K-means Clustering?
What is the objective of k-means clustering?
How k-means clustering works?
Implementation of K-Means Clustering in Python

What is K-means Clustering?


Unsupervised Machine Learning is the process of teaching a computer to use unlabeled,
unclassified data and enabling the algorithm to operate on that data without supervision.
Without any previous data training, the machine’s job in this case is to organize unsorted data
according to parallels, patterns, and variations.

What is the objective of k-means clustering?


The goal of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more comparable to one another and different
from the data points within the other groups. It is essentially a grouping of things based on
how similar and different they are to one another.

LinkedIn Talent Solutions · Sponsored LinkedIn Talent Solutions · Sponsored

Hi t hi f t Hi t hi f t

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 1/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

How k-means clustering works?


We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the K-
means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.

(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we will
use the Euclidean distance as a measurement.

The algorithm works as follows:

1. First, we randomly initialize k points, called means or cluster centroids.


2. We categorize each item to its closest mean, and we update the mean’s coordinates, which
are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.

The “points” mentioned above are called means because they are the mean values of the
items categorized in them. To initialize these means, we have a lot of options. An intuitive
method is to initialize the means at random items in the data set. Another method is to
initialize the means at random values between the boundaries of the data set (if for a feature
x, the items have values in [0,3], we will initialize the means with values for x at [0,3]).

The above algorithm in pseudocode is as follows:

Initialize k means with random values


--> For a given number of iterations:

--> Iterate through items:

--> Find the mean closest to the item by calculating


the euclidean distance of the item with each of the means

--> Assign item to mean

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 2/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

--> Update mean by shifting it to the average of the items in that


cluster

Implementation of K-Means Clustering in Python

Example 1

Import the necessary Libraries


We are importing Numpy for statistical computations, Matplotlib to plot the graph, and
make_blobs from sklearn.datasets.

Python3

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

Create the custom dataset with make_blobs and plot it

Python3

X,y = make_blobs(n_samples = 500,n_features = 2,centers = 3,random_state = 23)

fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

Output:

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 3/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Clustering dataset

Initialize the random centroids


The code initializes three clusters for K-means clustering. It sets a random seed and generates
random cluster centers within a specified range, and creates an empty list of points for each
cluster.

Python3

k = 3

clusters = {}
np.random.seed(23)

for idx in range(k):


center = 2*(2*np.random.random((X.shape[1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}

clusters[idx] = cluster

clusters

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 4/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Output:

{0: {'center': array([0.06919154, 1.78785042]), 'points': []},


1: {'center': array([ 1.06183904, -0.87041662]), 'points': []},
2: {'center': array([-1.11581855, 0.74488834]), 'points': []}}

Plot the random initialize center with data points

Python3

plt.scatter(X[:,0],X[:,1])
plt.grid(True)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '*',c = 'red')
plt.show()

Output:

Data points with random center

The plot displays a scatter plot of data points (X[:,0], X[:,1]) with grid lines. It also marks the
initial cluster centers (red stars) generated for K-means clustering.

Define Euclidean distance


https://www.geeksforgeeks.org/k-means-clustering-introduction/ 5/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Python3

def distance(p1,p2):
return np.sqrt(np.sum((p1-p2)**2))

Create the function to Assign and Update the cluster center


The E-step assigns data points to the nearest cluster center, and the M-step updates cluster
centers based on the mean of assigned points in K-means clustering.

Python3

#Implementing E step
def assign_clusters(X, clusters):
for idx in range(X.shape[0]):
dist = []

curr_x = X[idx]

for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
dist.append(dis)
curr_cluster = np.argmin(dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters

#Implementing the M-Step


def update_clusters(X, clusters):
for i in range(k):
points = np.array(clusters[i]['points'])
if points.shape[0] > 0:
new_center = points.mean(axis =0)
clusters[i]['center'] = new_center

clusters[i]['points'] = []
return clusters

Step 7: Create the function to Predict the cluster for the datapoints

Python3

def pred_cluster(X, clusters):


pred = []
for i in range(X.shape[0]):

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 6/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

dist = []
for j in range(k):
dist.append(distance(X[i],clusters[j]['center']))
pred.append(np.argmin(dist))
return pred

Assign, Update, and predict the cluster center

Python3

clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)

Plot the data points with their predicted cluster center

Python3

plt.scatter(X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.show()

Output:

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 7/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

K-means Clustering

The plot shows data points colored by their predicted clusters. The red markers represent the
updated cluster centers after the E-M steps in the K-means clustering algorithm.

Example 2

Import the necessary libraries

Python3

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

Load the Dataset

Python3

X, y = load_iris(return_X_y=True)

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 8/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Elbow Method
Finding the ideal number of groups to divide the data into is a basic stage in any unsupervised
algorithm. One of the most common techniques for figuring out this ideal value of k is the
elbow approach.

Python3

#Find optimum number of cluster


sse = [] #SUM OF SQUARED ERROR
for k in range(1,11):
km = KMeans(n_clusters=k, random_state=2)
km.fit(X)
sse.append(km.inertia_)

Plot the Elbow graph to find the optimum number of cluster

Python3

sns.set_style("whitegrid")
g=sns.lineplot(x=range(1,11), y=sse)

g.set(xlabel ="Number of cluster (k)",


ylabel = "Sum Squared Error",
title ='Elbow Method')

plt.show()

Output:

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 9/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Elbow Method

From the above graph, we can observe that at k=2 and k=3 elbow-like situation. So, we are
considering K=3

Build the K means clustering model

Python3

kmeans = KMeans(n_clusters = 3, random_state = 2)


kmeans.fit(X)

Output:

KMeans
KMeans(n_clusters=3, random_state=2)

Find the cluster center

Python3

kmeans.cluster_centers_

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 10/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Output:

array([[5.006 , 3.428 , 1.462 , 0.246 ],


[5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
[6.85 , 3.07368421, 5.74210526, 2.07105263]])

Predict the cluster group:

Python3

pred = kmeans.fit_predict(X)
pred

Output:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2,
2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2,
2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1], dtype=int32)

Plot the cluster center with data points

Python3

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.scatter(X[:,0],X[:,1],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:
center = center[:2]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("petal length (cm)")
plt.ylabel("petal width (cm)")

plt.subplot(1,2,2)
plt.scatter(X[:,2],X[:,3],c = pred, cmap=cm.Accent)
plt.grid(True)
for center in kmeans.cluster_centers_:

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 11/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

center = center[2:4]
plt.scatter(center[0],center[1],marker = '^',c = 'red')
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.show()

Output:

K-means clustering

The subplot on the left display petal length vs. petal width with data points colored by
clusters, and red markers indicate K-means cluster centers. The subplot on the right show
sepal length vs. sepal width similarly.

Conclusion
In conclusion, K-means clustering is a powerful unsupervised machine learning algorithm for
grouping unlabeled datasets. Its objective is to divide data into clusters, making similar data
points part of the same group. The algorithm initializes cluster centroids and iteratively
assigns data points to the nearest centroid, updating centroids based on the mean of points in
each cluster.

Frequently Asked Questions (FAQs)

1. What is k-means clustering for data analysis?

K-means is a partitioning method that divides a dataset into ‘k’ distinct, non-
overlapping subsets (clusters) based on similarity, aiming to minimize the variance
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 12/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

within each cluster.

2.What is an example of k-means in real life?

Customer segmentation in marketing, where k-means groups customers based on


purchasing behavior, allowing businesses to tailor marketing strategies for different
segments.

3. What type of data is k-means clustering model?

K-means works well with numerical data, where the concept of distance between data
points is meaningful. It’s commonly applied to continuous variables.

4.Is K-means used for prediction?

K-means is primarily used for clustering and grouping similar data points. It does not
predict labels for new data; it assigns them to existing clusters based on similarity.

5.What is the objective of k-means clustering?

The objective is to partition data into ‘k’ clusters, minimizing the intra-cluster variance.
It seeks to form groups where data points within each cluster are more similar to each
other than to those in other clusters.

Don't miss your chance to ride the wave of the data revolution! Every industry is scaling new
heights by tapping into the power of data. Sharpen your skills and become a part of the
hottest trend in the 21st century.
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 13/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Dive into the future of technology - explore the Complete Machine Learning and Data Science
Program by GeeksforGeeks and stay ahead of the curve.

Last Updated : 21 Dec, 2023 51

Previous Next

Share your thoughts in the comments Add Your Comment

Similar Reads
Analysis of test data using K-Means Clustering ML | Determine the optimal value of K in K-
in Python Means Clustering

ML | Mini Batch K-means clustering algorithm Image compression using K-means clustering

K-Means Clustering in R Programming Difference between K means and Hierarchical


Clustering

Image Segmentation using K Means Clustering K- means clustering with SciPy

K means clustering using Weka Clustering Text Documents using K-Means in


Scikit Learn

Complete Tutorials
Python Crash Course Python API Tutorial: Getting Started with APIs

Advanced Python Tutorials Python Automation Tutorial

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 14/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

OpenAI Python API - Complete Guide

GeeksforGeeks

Article Tags : Machine Learning , Python

Practice Tags : Machine Learning, python

Additional Information

A-143, 9th Floor, Sovereign Corporate


Tower, Sector-136, Noida, Uttar Pradesh -
201305

Company Explore
About Us Job-A-Thon Hiring Challenge
Legal Hack-A-Thon

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 15/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Careers GfG Weekly Contest


In Media Offline Classes (Delhi/NCR)
Contact Us DSA in JAVA/C++
Advertise with us Master System Design
GFG Corporate Solution Master CP
Placement Training Program GeeksforGeeks Videos
Apply for Mentor Geeks Community

Languages DSA
Python Data Structures
Java Algorithms
C++ DSA for Beginners
PHP Basic DSA Problems
GoLang DSA Roadmap
SQL Top 100 DSA Interview Problems
R Language DSA Roadmap by Sandeep Jain
Android Tutorial All Cheat Sheets
Tutorials Archive

Data Science & ML HTML & CSS


Data Science With Python HTML
Data Science For Beginner CSS
Machine Learning Tutorial Web Templates
ML Maths CSS Frameworks
Data Visualisation Tutorial Bootstrap
Pandas Tutorial Tailwind CSS
NumPy Tutorial SASS
NLP Tutorial LESS
Deep Learning Tutorial Web Design

Python Computer Science


Python Programming Examples GATE CS Notes
Django Tutorial Operating Systems
Python Projects Computer Network
Python Tkinter Database Management System
https://www.geeksforgeeks.org/k-means-clustering-introduction/ 16/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Web Scraping Software Engineering


OpenCV Python Tutorial Digital Logic Design
Python Interview Question Engineering Maths

DevOps Competitive Programming


Git Top DS or Algo for CP
AWS Top 50 Tree
Docker Top 50 Graph
Kubernetes Top 50 Array
Azure Top 50 String
GCP Top 50 DP
DevOps Roadmap Top 15 Websites for CP

System Design JavaScript


High Level Design JavaScript Examples
Low Level Design TypeScript
UML Diagrams ReactJS
Interview Guide NextJS
Design Patterns AngularJS
OOAD NodeJS
System Design Bootcamp Lodash
Interview Questions Web Browser

NCERT Solutions School Subjects


Class 12 Mathematics
Class 11 Physics
Class 10 Chemistry
Class 9 Biology
Class 8 Social Science
Complete Study Material English Grammar

Commerce UPSC Study Material


Accountancy Polity Notes
Business Studies Geography Notes
Economics History Notes

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 17/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Management Science and Technology Notes


HR Management Economy Notes
Finance Ethics Notes
Income Tax Previous Year Papers

SSC/ BANKING Colleges


SSC CGL Syllabus Indian Colleges Admission & Campus Experiences
SBI PO Syllabus List of Central Universities - In India
SBI Clerk Syllabus Colleges in Delhi University
IBPS PO Syllabus IIT Colleges
IBPS Clerk Syllabus NIT Colleges
SSC CGL Practice Papers IIIT Colleges

Companies Preparation Corner


META Owned Companies Company Wise Preparation
Alphabhet Owned Companies Preparation for SDE
TATA Group Owned Companies Experienced Interviews
Reliance Owned Companies Internship Interviews
Fintech Companies Competitive Programming
EdTech Companies Aptitude Preparation
Puzzles

Exams More Tutorials


JEE Mains Software Development
JEE Advanced Software Testing
GATE CS Product Management
NEET SAP
UGC NET SEO - Search Engine Optimization
Linux
Excel

Free Online Tools Write & Earn


Typing Test Write an Article
Image Editor Improve an Article
Code Formatters Pick Topics to Write

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 18/19
2/10/24, 1:31 AM K means Clustering - Introduction - GeeksforGeeks

Code Converters Share your Experiences


Currency Converter Internships
Random Number Generator
Random Password Generator

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

https://www.geeksforgeeks.org/k-means-clustering-introduction/ 19/19

You might also like