21BEC505 Exp2
21BEC505 Exp2
21BEC505 Exp2
Experiment-2
Objective: Application on k-means algorithm for clustering of unsupervised data
Task #1
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Output:
Task #2
Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Output:
Machine Learning 21BEC505
Exercise:
1. Take different features from the 'Mall_Customers.csv’ and repeat the task-2. OR perform
clustering on any other such dataset.
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = pd.read_csv("Social_Network_Ads.csv")
#X = df[['Age', 'EstimatedSalary']]
X = data.iloc[:, [2, 3]].values
plt.legend()
plt.show()
Output:
Inertia or Within-Cluster-Sum-of-Squares (WCSS): Measures the sum of squared distances of all points to
their respective cluster center. This metric is commonly used with k-means clustering.
Silhouette Score: Measures the similarity of data points within their own cluster compared to other clusters.
This metric provides a score between -1 and 1, where a higher score indicates better-defined clusters.
Calinski-Harabasz Index: Measures the ratio of between-cluster variance to within-cluster variance. A
higher index indicates better-defined clusters.
Davies-Bouldin Index: Measures the average similarity between each cluster and its most similar cluster. A
lower index indicates better-defined clusters.
Adjusted Rand Index (ARI): Measures the similarity between the true and predicted cluster assignments.
This metric provides a score between -1 and 1, where a score of 1 indicates a perfect match between true and
predicted cluster assignments.
The choice of performance parameter depends on the problem and the specific clustering algorithm used. Some
clustering algorithms, such as k-means, optimize the WCSS metric directly, while others may use different evaluation
metrics to optimize their objective function.
3. Create your own dataset using the function make_blobs function from from the sklearn.datasets
module and specifies the number of clusters (using parameter ‘centers’) as 7. Use elbow method to
verify the clusters and visualize it.
Code:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Machine Learning 21BEC505
# Generate dataset
X, y = make_blobs(n_samples=1000, centers=7, random_state=42)
Output:
Conclusion:
In this experiment, we talked about clustering, a method of unsupervised machine learning used to group similar data
points together, in this experiment. K-means, hierarchical clustering, and DBSCAN were among the various clustering
algorithms we looked into. We also looked at various approaches, such as the elbow method and Within-Cluster-Sum-
of-Squares (WCSS), and talked about how crucial it is to choose the right number of clusters for a given problem. After
that, we applied these ideas to three distinct datasets: Ads on social media and shoppers in a mall. We pre-processed the
data for each dataset and used one or more clustering algorithms to group data points that are similar. Various evaluation
metrics, such as inertia and silhouette score, were utilized in our evaluation of each clustering algorithm's performance.
In general, clustering is an effective method for uncovering hidden data structures and patterns. Depending on the
problem at hand and the data, a clustering algorithm and performance metric should be selected. When it comes to
applying clustering to real-world issues, we are able to make well-informed choices if we comprehend the advantages
and disadvantages of various clustering algorithms and evaluation metrics.