Cluster Analysis Clustering

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Cluster Analysis

Clustering:

The process of grouping a set of physical or abstract objects into classes of similar objects is
called clustering.

Cluster:

A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters.

Cluster Analysis:

 It is an important human activity. Automated clustering is used to identify dense and sparse
regions in object space and therefore, discover overall distribution patterns and interesting
correlations among data attributes.

 It has been widely used in numerous applications, including market research, pattern
recognition, data analysis and image processing.

Requirements of Clustering in data mining:

 Scalability

 Ability to deal with different types of attributes

 Discovery of clusters with arbitrary shape

 Minimal requirements for domain knowledge to determine input parameters

 Ability to deal with noisy data

 Incremental clustering and insensitivity to the order of input records

 High dimensionality

 Constraint-based clustering

 Interpretability and usability

The major clustering methods can be classified into the following categories.

 Partitioning methods

 Hierarchical methods

 Density-based methods
 Grid-based methods

 Model-based methods

Partitioning methods:

Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partitions represents a cluster and k<=n.

Hierarchical methods:

 It creates a hierarchical decomposition of the given set of data objects.

 It can be classified as being either agglomerative or divisive, based on how the hierarchical
decomposition is formed.

Density-based methods:

 It is based on the notion of the density.

 The general idea is to continue growing the given cluster as long as the density (number of
objects or data points) in the “neighborhood” exceeds some threshold.

Grid-based methods:

 It quantize the object space into a finite number of cells that form a grid structure.

 All of the clustering operations are performed on the grid structure.

Model-based methods:

 It hypothesize a model for each of the clusters and find the best fit of the data to the given
model.

 There are two classes of clustering tasks: Clustering high-dimensional data and Constraint-
based clustering.

 Clustering high-dimensional data: It is an important task in cluster analysis because many


applications require the analysis of objects containing a large number of features of
dimensions.

 Constraint-based clustering: It performs clustering by incorporation of user-specified or


application-oriented constraint.
Partitioning Methods

Given D, a set of n objects, and k, the number of clusters to form, a partitioning algorithm
organizes the objects into k partitions (k<=n), where each partition represents a cluster.

Classical Partitioning Methods:

1. k-means

2. k-mediods

1. Centroid-based techniques: The k-means method:

 The k-means algorithm takes the input parameter, k, and partitions a set of n
objects into k cluster so that the resulting intracluster similarity is high but the
intercluster similarity is low.

 Cluster similarity is measured in regard to the mean value of the objects in a


cluster, which can be viewed as the cluster’s centroid or center of gravity.

Algorithm:

Input:

 k: The number of clusters

 D: a data set containing n objects

Output: A set of k clusters

Method:

 Arbitrarily choose k objects from D as the initial cluster centers

 Repeat

 (re)design each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster.

 Update the cluster means; i.e., calculate the mean value of the objects for
each cluster;

 Until no change;

Strength:

This method is relatively scalable and efficient in processing large data set.
Weakness:

 Applicable only when the mean is defined.

 Need to specify k, the number of clusters in advance.

 Unable to handle noisy data and outliers

 Not suitable to discover clusters with non-convex shapes.

Problems of k-means method:

 The k-means algorithm is sensititve to outliers, since an object with an


extremely large value may substantially distort the distribution of the data.

 k-mediods: Instead of taking the mean value of the object in a cluster as a


reference point, mediods can be used, which is the most centrally located object in
a cluster.

2. Representative object-based technique: The k-mediods method:

 Pick actual objects to represent the clusters, using one representative


object per cluster. Each remaining object is clustered with the representative
object to which it is the most similar.

 The partitioning method is then performed based on the principle of


minimizing the sum of the dissimilarities between each object and its
corresponding reference point.

PAM (Partitioning Around Mediods):

 PAM was one of the k-mediods algorithms PAM starts from an initial set
of mediods and iteratively replaces one of the mediods by one of the non-mediods
if it improves the total distance of the resulting clustering.

 It works effectively for small data sets, but does not scale well for large
data set.

 It is a typical k-mediods algorithm.

Algorithm: k-mediods. PAM, a k-mediods algorithm for partitioning based on mediod or central
objects.

Input:

1. k: the number of clusters


2. D: a data set containing n objects

Output: A set of k clusters

Method:

1. Arbitrarily choose k objects in D as the initial representative objects or seeds;

2. Repeat

3. Assign each remaining object to the cluster with the nearest representative object;

4. Randomly select a non representative object, O random ;

5. Compute the total cost, S, of swapping representative object, O j , with O random ;

6. If S<0 then swap O j , with O random to form the new set of k representative objects.

7. Until no change.

Problem with PAM:

 PAM is more robust than k-means in the presence of noise and outliers
because a mediod is less influenced by outliers or other extreme values than a
mean.

 PAM works efficiently for small data sets but does not scale well for large
data sets.

Partitioning Methods in Large Databases:

Sampling based methods are used to deal with larger data sets.

CLARA (Clustering LARGE application)

 Instead of taking the whole set of data into consideration, a small portion
of the actual data is chosen as the representative of the data.

 Mediods are then chosen from this sample using PAM.

Strength: Deals with larger data sets than PAM

Weakness:

 Efficiency depends on the sample size.

 A good clustering based on samples will not necessarily represent a good


clustering of the whole data set if the sample is biased.
CLARANS (Clustering Large Applications based on Randomized search)

 It combines the sampling techniques with PAM.

 It draws sample of neighbors dynamically. The clustering process can be


represented as searching a graph where every node is a potential solution, that is, a
set of k mdiods.

 If local optimum is found CLARANS starts with new randomly selected


node in search for a new local optimum.

 It is more efficient and scalable than both PAM and CLARA.

You might also like