Cluster Analysis Clustering
Cluster Analysis Clustering
Cluster Analysis Clustering
Clustering:
The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.
Cluster:
A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.
Cluster Analysis:
It is an important human activity. Automated clustering is used to identify dense and sparse
regions in object space and therefore, discover overall distribution patterns and interesting
correlations among data attributes.
It has been widely used in numerous applications, including market research, pattern
recognition, data analysis and image processing.
Scalability
High dimensionality
Constraint-based clustering
The major clustering methods can be classified into the following categories.
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods
Model-based methods
Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partitions represents a cluster and k<=n.
Hierarchical methods:
It can be classified as being either agglomerative or divisive, based on how the hierarchical
decomposition is formed.
Density-based methods:
The general idea is to continue growing the given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
Grid-based methods:
It quantize the object space into a finite number of cells that form a grid structure.
Model-based methods:
It hypothesize a model for each of the clusters and find the best fit of the data to the given
model.
There are two classes of clustering tasks: Clustering high-dimensional data and Constraint-
based clustering.
Given D, a set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k<=n), where each partition represents a cluster.
1. k-means
2. k-mediods
The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k cluster so that the resulting intracluster similarity is high but the
intercluster similarity is low.
Algorithm:
Input:
Method:
Repeat
(re)design each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster.
Update the cluster means; i.e., calculate the mean value of the objects for
each cluster;
Until no change;
Strength:
This method is relatively scalable and efficient in processing large data set.
Weakness:
PAM was one of the k-mediods algorithms PAM starts from an initial set
of mediods and iteratively replaces one of the mediods by one of the non-mediods
if it improves the total distance of the resulting clustering.
It works effectively for small data sets, but does not scale well for large
data set.
Algorithm: k-mediods. PAM, a k-mediods algorithm for partitioning based on mediod or central
objects.
Input:
Method:
2. Repeat
3. Assign each remaining object to the cluster with the nearest representative object;
6. If S<0 then swap O j , with O random to form the new set of k representative objects.
7. Until no change.
PAM is more robust than k-means in the presence of noise and outliers
because a mediod is less influenced by outliers or other extreme values than a
mean.
PAM works efficiently for small data sets but does not scale well for large
data sets.
Sampling based methods are used to deal with larger data sets.
Instead of taking the whole set of data into consideration, a small portion
of the actual data is chosen as the representative of the data.
Weakness: