21BEC505 Exp2

Machine Learning 21BEC505
Experiment-2
Objective: Application on k-means algorithm for clustering of unsupervised data
Task #1
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Define the 2D dataset

data = np.array([[1.0, 1.0],
[1.5, 2.0],
[3.0, 4.0],
[5.0, 7.0],
[3.5, 5.0],
[4.5, 5.0],
[3.5, 4.5]])
# Define the range of k values to try

k_values = range(1, 5)
# Iterate over different k values and perform k-means clustering

for k in k_values:
kmeans = KMeans(n_clusters=k).fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
# Visualize the clustering result

plt.figure()
plt.scatter(data[:,0], data[:,1], c=labels, cmap='plasma',label=f'Cluster {k+1}')
plt.scatter(centroids[:,0], centroids[:,1], marker='*', s=200, color='black',label='Centroids')
plt.title('k = {}'.format(k))
plt.show()
Output:
Task #2
Code:
import numpy as np
import pandas as pd
dataset = pd.read_csv(r 'E:\Jay\NIRMA\Sem6\ML\Exp2\Mall_Customers.csv')

X = dataset.iloc[:, [3, 4]].values
# Using the elbow method to find the optimal number of clusters

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# Fitting K-Means to the dataset

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)
# Visualising the clusters

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label ='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()
Output:
Exercise:
1. Take different features from the 'Mall_Customers.csv’ and repeat the task-2. OR perform
clustering on any other such dataset.
import pandas as pd
data = pd.read_csv("Social_Network_Ads.csv")
#X = df[['Age', 'EstimatedSalary']]
X = data.iloc[:, [2, 3]].values
# Using the elbow method to find the optimal number of clusters

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.ylabel('WCSS')
plt.show()
# Apply the k-means algorithm with k=4

kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Visualize the clusters

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Output:
2. Which is the performance parameter for clustering?

The performance parameter for clustering depends on the evaluation metrics used for the specific clustering algorithm
and problem. There are several metrics that can be used to evaluate the performance of a clustering algorithm,
including:
 Inertia or Within-Cluster-Sum-of-Squares (WCSS): Measures the sum of squared distances of all points to
their respective cluster center. This metric is commonly used with k-means clustering.
 Silhouette Score: Measures the similarity of data points within their own cluster compared to other clusters.
This metric provides a score between -1 and 1, where a higher score indicates better-defined clusters.
 Calinski-Harabasz Index: Measures the ratio of between-cluster variance to within-cluster variance. A
higher index indicates better-defined clusters.
 Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. A
lower index indicates better-defined clusters.
 Adjusted Rand Index (ARI): Measures the similarity between the true and predicted cluster assignments.
This metric provides a score between -1 and 1, where a score of 1 indicates a perfect match between true and
predicted cluster assignments.
The choice of performance parameter depends on the problem and the specific clustering algorithm used. Some
clustering algorithms, such as k-means, optimize the WCSS metric directly, while others may use different evaluation
metrics to optimize their objective function.
3. Create your own dataset using the function make_blobs function from from the sklearn.datasets
module and specifies the number of clusters (using parameter ‘centers’) as 7. Use elbow method to
verify the clusters and visualize it.
Code:
from sklearn.datasets import make_blobs
# Generate dataset
X, y = make_blobs(n_samples=1000, centers=7, random_state=42)
# Find optimal number of clusters using elbow method

inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Plot elbow curve

plt.plot(k_range, inertias)
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
kmeans = KMeans(n_clusters=7, random_state=42)

labels = kmeans.fit_predict(X)
# Plot cluster graph

plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', color='r', s=100)
plt.title('Cluster Graph')
plt.xlabel('A')
plt.ylabel('B')
plt.show()
Output:
Conclusion:
In this experiment, we talked about clustering, a method of unsupervised machine learning used to group similar data
points together, in this experiment. K-means, hierarchical clustering, and DBSCAN were among the various clustering
algorithms we looked into. We also looked at various approaches, such as the elbow method and Within-Cluster-Sum-
of-Squares (WCSS), and talked about how crucial it is to choose the right number of clusters for a given problem. After
that, we applied these ideas to three distinct datasets: Ads on social media and shoppers in a mall. We pre-processed the
data for each dataset and used one or more clustering algorithms to group data points that are similar. Various evaluation
metrics, such as inertia and silhouette score, were utilized in our evaluation of each clustering algorithm's performance.
In general, clustering is an effective method for uncovering hidden data structures and patterns. Depending on the
problem at hand and the data, a clustering algorithm and performance metric should be selected. When it comes to
applying clustering to real-world issues, we are able to make well-informed choices if we comprehend the advantages
and disadvantages of various clustering algorithms and evaluation metrics.

21BEC505 Exp2

Uploaded by

Copyright:

Available Formats

21BEC505 Exp2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

21BEC505 Exp2

Uploaded by

Copyright:

Available Formats

Machine Learning 21BEC505

# Define the 2D dataset

# Define the range of k values to try

# Iterate over different k values and perform k-means clustering

# Visualize the clustering result

dataset = pd.read_csv(r 'E:\Jay\NIRMA\Sem6\ML\Exp2\Mall_Customers.csv')

# Using the elbow method to find the optimal number of clusters

# Fitting K-Means to the dataset

# Visualising the clusters

# Using the elbow method to find the optimal number of clusters

# Apply the k-means algorithm with k=4

# Visualize the clusters

2. Which is the performance parameter for clustering?

# Find optimal number of clusters using elbow method

# Plot elbow curve

kmeans = KMeans(n_clusters=7, random_state=42)

# Plot cluster graph

You might also like