21BEC505 Exp2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Machine Learning 21BEC505

Experiment-2
Objective: Application on k-means algorithm for clustering of unsupervised data
Task #1
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Define the 2D dataset


data = np.array([[1.0, 1.0],
[1.5, 2.0],
[3.0, 4.0],
[5.0, 7.0],
[3.5, 5.0],
[4.5, 5.0],
[3.5, 4.5]])

# Define the range of k values to try


k_values = range(1, 5)

# Iterate over different k values and perform k-means clustering


for k in k_values:
kmeans = KMeans(n_clusters=k).fit(data)
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize the clustering result


plt.figure()
plt.scatter(data[:,0], data[:,1], c=labels, cmap='plasma',label=f'Cluster {k+1}')
plt.scatter(centroids[:,0], centroids[:,1], marker='*', s=200, color='black',label='Centroids')
plt.title('k = {}'.format(k))
plt.show()
Machine Learning 21BEC505

Output:

Task #2
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

dataset = pd.read_csv(r 'E:\Jay\NIRMA\Sem6\ML\Exp2\Mall_Customers.csv')


X = dataset.iloc[:, [3, 4]].values

# Using the elbow method to find the optimal number of clusters


from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
Machine Learning 21BEC505

kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Fitting K-Means to the dataset


kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

# Visualising the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label ='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Output:
Machine Learning 21BEC505

Exercise:
1. Take different features from the 'Mall_Customers.csv’ and repeat the task-2. OR perform
clustering on any other such dataset.
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = pd.read_csv("Social_Network_Ads.csv")
#X = df[['Age', 'EstimatedSalary']]
X = data.iloc[:, [2, 3]].values

# Using the elbow method to find the optimal number of clusters


wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

# Apply the k-means algorithm with k=4


kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualize the clusters


plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of customers')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
Machine Learning 21BEC505

plt.legend()
plt.show()

Output:

2. Which is the performance parameter for clustering?


The performance parameter for clustering depends on the evaluation metrics used for the specific clustering algorithm
and problem. There are several metrics that can be used to evaluate the performance of a clustering algorithm,
including:

 Inertia or Within-Cluster-Sum-of-Squares (WCSS): Measures the sum of squared distances of all points to
their respective cluster center. This metric is commonly used with k-means clustering.
 Silhouette Score: Measures the similarity of data points within their own cluster compared to other clusters.
This metric provides a score between -1 and 1, where a higher score indicates better-defined clusters.
 Calinski-Harabasz Index: Measures the ratio of between-cluster variance to within-cluster variance. A
higher index indicates better-defined clusters.
 Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. A
lower index indicates better-defined clusters.
 Adjusted Rand Index (ARI): Measures the similarity between the true and predicted cluster assignments.
This metric provides a score between -1 and 1, where a score of 1 indicates a perfect match between true and
predicted cluster assignments.
The choice of performance parameter depends on the problem and the specific clustering algorithm used. Some
clustering algorithms, such as k-means, optimize the WCSS metric directly, while others may use different evaluation
metrics to optimize their objective function.

3. Create your own dataset using the function make_blobs function from from the sklearn.datasets
module and specifies the number of clusters (using parameter ‘centers’) as 7. Use elbow method to
verify the clusters and visualize it.
Code:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Machine Learning 21BEC505

# Generate dataset
X, y = make_blobs(n_samples=1000, centers=7, random_state=42)

# Find optimal number of clusters using elbow method


inertias = []
k_range = range(1, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)

# Plot elbow curve


plt.plot(k_range, inertias)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

kmeans = KMeans(n_clusters=7, random_state=42)


labels = kmeans.fit_predict(X)

# Plot cluster graph


plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='*', color='r', s=100)
plt.title('Cluster Graph')
plt.xlabel('A')
plt.ylabel('B')
plt.show()
Machine Learning 21BEC505

Output:

Conclusion:
In this experiment, we talked about clustering, a method of unsupervised machine learning used to group similar data
points together, in this experiment. K-means, hierarchical clustering, and DBSCAN were among the various clustering
algorithms we looked into. We also looked at various approaches, such as the elbow method and Within-Cluster-Sum-
of-Squares (WCSS), and talked about how crucial it is to choose the right number of clusters for a given problem. After
that, we applied these ideas to three distinct datasets: Ads on social media and shoppers in a mall. We pre-processed the
data for each dataset and used one or more clustering algorithms to group data points that are similar. Various evaluation
metrics, such as inertia and silhouette score, were utilized in our evaluation of each clustering algorithm's performance.
In general, clustering is an effective method for uncovering hidden data structures and patterns. Depending on the
problem at hand and the data, a clustering algorithm and performance metric should be selected. When it comes to
applying clustering to real-world issues, we are able to make well-informed choices if we comprehend the advantages
and disadvantages of various clustering algorithms and evaluation metrics.

You might also like