Cluster Analysis Clustering

Cluster Analysis
Clustering:
The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.
Cluster:
A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.
Cluster Analysis:
 It is an important human activity. Automated clustering is used to identify dense and sparse
regions in object space and therefore, discover overall distribution patterns and interesting
correlations among data attributes.
 It has been widely used in numerous applications, including market research, pattern
recognition, data analysis and image processing.
Requirements of Clustering in data mining:
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input parameters
 Ability to deal with noisy data
 Incremental clustering and insensitivity to the order of input records
 High dimensionality
 Constraint-based clustering
 Interpretability and usability
The major clustering methods can be classified into the following categories.
 Partitioning methods
 Hierarchical methods
 Density-based methods
 Grid-based methods
 Model-based methods
Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partitions represents a cluster and k<=n.
Hierarchical methods:
 It creates a hierarchical decomposition of the given set of data objects.
 It can be classified as being either agglomerative or divisive, based on how the hierarchical
decomposition is formed.
Density-based methods:
 It is based on the notion of the density.
 The general idea is to continue growing the given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.
Grid-based methods:
 It quantize the object space into a finite number of cells that form a grid structure.
 All of the clustering operations are performed on the grid structure.
Model-based methods:
 It hypothesize a model for each of the clusters and find the best fit of the data to the given
model.
 There are two classes of clustering tasks: Clustering high-dimensional data and Constraint-
based clustering.
 Clustering high-dimensional data: It is an important task in cluster analysis because many

applications require the analysis of objects containing a large number of features of
dimensions.
 Constraint-based clustering: It performs clustering by incorporation of user-specified or

application-oriented constraint.
Partitioning Methods
Given D, a set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k<=n), where each partition represents a cluster.
Classical Partitioning Methods:
1. k-means
2. k-mediods
1. Centroid-based techniques: The k-means method:
 The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k cluster so that the resulting intracluster similarity is high but the
intercluster similarity is low.
 Cluster similarity is measured in regard to the mean value of the objects in a

cluster, which can be viewed as the cluster’s centroid or center of gravity.
Algorithm:
Input:
 k: The number of clusters
 D: a data set containing n objects
Output: A set of k clusters
Method:
 Arbitrarily choose k objects from D as the initial cluster centers
 Repeat
 (re)design each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster.
 Update the cluster means; i.e., calculate the mean value of the objects for
each cluster;
 Until no change;
Strength:
This method is relatively scalable and efficient in processing large data set.
Weakness:
 Applicable only when the mean is defined.
 Need to specify k, the number of clusters in advance.
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes.
Problems of k-means method:
 The k-means algorithm is sensititve to outliers, since an object with an

extremely large value may substantially distort the distribution of the data.
 k-mediods: Instead of taking the mean value of the object in a cluster as a

reference point, mediods can be used, which is the most centrally located object in
a cluster.
2. Representative object-based technique: The k-mediods method:
 Pick actual objects to represent the clusters, using one representative

object per cluster. Each remaining object is clustered with the representative
object to which it is the most similar.
 The partitioning method is then performed based on the principle of

minimizing the sum of the dissimilarities between each object and its
corresponding reference point.
PAM (Partitioning Around Mediods):
 PAM was one of the k-mediods algorithms PAM starts from an initial set
of mediods and iteratively replaces one of the mediods by one of the non-mediods
if it improves the total distance of the resulting clustering.
 It works effectively for small data sets, but does not scale well for large
data set.
 It is a typical k-mediods algorithm.
Algorithm: k-mediods. PAM, a k-mediods algorithm for partitioning based on mediod or central
objects.
Input:
1. k: the number of clusters

2. D: a data set containing n objects
Output: A set of k clusters
Method:
1. Arbitrarily choose k objects in D as the initial representative objects or seeds;
2. Repeat
3. Assign each remaining object to the cluster with the nearest representative object;
4. Randomly select a non representative object, O random ;
5. Compute the total cost, S, of swapping representative object, O j , with O random ;
6. If S<0 then swap O j , with O random to form the new set of k representative objects.
7. Until no change.
Problem with PAM:
 PAM is more robust than k-means in the presence of noise and outliers
because a mediod is less influenced by outliers or other extreme values than a
mean.
 PAM works efficiently for small data sets but does not scale well for large
data sets.
Partitioning Methods in Large Databases:
Sampling based methods are used to deal with larger data sets.
CLARA (Clustering LARGE application)
 Instead of taking the whole set of data into consideration, a small portion
of the actual data is chosen as the representative of the data.
 Mediods are then chosen from this sample using PAM.
Strength: Deals with larger data sets than PAM
Weakness:
 Efficiency depends on the sample size.
 A good clustering based on samples will not necessarily represent a good

clustering of the whole data set if the sample is biased.
CLARANS (Clustering Large Applications based on Randomized search)
 It combines the sampling techniques with PAM.
 It draws sample of neighbors dynamically. The clustering process can be

represented as searching a graph where every node is a potential solution, that is, a
set of k mdiods.
 If local optimum is found CLARANS starts with new randomly selected

node in search for a new local optimum.
 It is more efficient and scalable than both PAM and CLARA.

Cluster Analysis Clustering

Uploaded by

Copyright:

Available Formats

Cluster Analysis Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis Clustering

Uploaded by

Copyright:

Available Formats

Cluster Analysis

Requirements of Clustering in data mining:

 Ability to deal with different types of attributes

 Discovery of clusters with arbitrary shape

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noisy data

 Incremental clustering and insensitivity to the order of input records

 Interpretability and usability

 It creates a hierarchical decomposition of the given set of data objects.

 It is based on the notion of the density.

 All of the clustering operations are performed on the grid structure.

 Clustering high-dimensional data: It is an important task in cluster analysis because many

 Constraint-based clustering: It performs clustering by incorporation of user-specified or

Classical Partitioning Methods:

1. Centroid-based techniques: The k-means method:

 Cluster similarity is measured in regard to the mean value of the objects in a

 k: The number of clusters

 D: a data set containing n objects

Output: A set of k clusters

 Arbitrarily choose k objects from D as the initial cluster centers

 Applicable only when the mean is defined.

 Need to specify k, the number of clusters in advance.

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex shapes.

Problems of k-means method:

 The k-means algorithm is sensititve to outliers, since an object with an

 k-mediods: Instead of taking the mean value of the object in a cluster as a

2. Representative object-based technique: The k-mediods method:

 Pick actual objects to represent the clusters, using one representative

 The partitioning method is then performed based on the principle of

PAM (Partitioning Around Mediods):

 It is a typical k-mediods algorithm.

1. k: the number of clusters

Output: A set of k clusters

1. Arbitrarily choose k objects in D as the initial representative objects or seeds;

4. Randomly select a non representative object, O random ;

5. Compute the total cost, S, of swapping representative object, O j , with O random ;

Problem with PAM:

Partitioning Methods in Large Databases:

CLARA (Clustering LARGE application)

 Mediods are then chosen from this sample using PAM.

Strength: Deals with larger data sets than PAM

 Efficiency depends on the sample size.

 A good clustering based on samples will not necessarily represent a good

 It combines the sampling techniques with PAM.

 It draws sample of neighbors dynamically. The clustering process can be

 If local optimum is found CLARANS starts with new randomly selected

 It is more efficient and scalable than both PAM and CLARA.

You might also like