Clustering Partition Hierachy

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

1
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E   ik1 pCi dist ( p, ci ) 2


 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
2
K-means Algorithm
 Given the cluster number K, the K-means algorithm is
carried out in three steps after initialization:

Initialization: set seed points (randomly)


1)Assign each object to the cluster of the nearest seed point
measured with a specific distance metric
2)Compute new seed points as the centroids of the clusters of
the current partition (the centroid is the centre, i.e., mean
point, of the cluster)
3)Go back to Step 1, stop when no more new assignment
(i.e., membership in each cluster no longer changes)

3
K Means Example Sec. 16.4

(K=2)

Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Example

Suppose we have 4 types of medicines and each has two


attributes (pH and weight index). Our goal is to group
these objects into K=2 group of medicine.

D
Medicine Weight pH-Index
C
A 1 1

B 2 1

C 4 3 A B

D 5 4

5
Example
 Step 1: Use initial seed points for partitioning
c1  A (1,1), c 2  B (2,1)

d( D , c1 )  ( 5  1)2  ( 4  1)2  5 Euclidean distance

A B d( D , c2 )  ( 5  2)2  ( 4  1)2  4.24

Assign each object to the cluster


with the nearest seed point

6
Example
 Step 2: Compute new centroids of the
current partition
Knowing the members of each
cluster, now we compute the
new
centroid of each group based on
these new memberships.
c1  (1, 1)

2  4  5 1 3  4 
c 2   , 
 3 3 
11 8
( , )
3 3

7
Example
 Step 2: Renew membership based on
new centroids
Compute the distance of all
objects to the new centroids

Assign the membership to objects

8
Example
 Step 3: Repeat the first two steps until
its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.

 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
45 34 1 1
c2   ,   ( 4 , 3 )
 2 2  2 2

9
Example

 Step 3: Repeat the first two steps until


its convergence
Compute the distance of all
objects to the new centroids

Stop due to no new assignment


Membership in each cluster no
longer change

10
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer two questions as follows:
1. What are memberships of two clusters after convergence?
2. What are centroids of two clusters after convergence?

Medicine Weight pH- D


Index
C
A 1 1

B 2 1
A B
C 4 3

D 5 4

11
Sec. 16.4

Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Centroid positions don’t change.

 RSS(Residual Sum of Squares) falls

below a threshold.
 The decrease in RSS falls below a

threshold.
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-
dimensional space
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes

13
14
Sec. 16.4

Seed Choice

 Results can vary based on random Example showing


seed selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or convergence
to sub-optimal clusterings.
 Select good seeds using a
In the above, if you start
heuristic (e.g., point least similar with B and E as centroids
to any existing mean) you converge to {A,B,C}
 Try out multiple starting points and {D,E,F}
If you start with D and F
 Initialize with the results of
you converge to
another method. {A,B,D,E} {C,F}
k-means++
 Instead of choosing initial clusters centers
randomly, choose them smarter
 Randomly choose one of the observations to be a
cluster center
 For each observation x, determine d(x), where d(x)
denotes the MINIMAL distance from x to a current
cluster center
 Choose next cluster center from the data points, with
probability of making an observation x a cluster center
proportional to d(x)2
 Repeat 2 and 3 until you have chosen the right number
of clusters

16
Example

17
Example
Cluster centers :{(7,4)}

Cluster centers :{(7,4), (1,3)}

18
Example
Cluster centers :{(7,4), (1,3)}

Cluster centers :{(7,4), (1,3), (5,9)}


k-means++
 Instead of choosing initial clusters centers randomly,
choose them smarter
 Randomly choose one of the observations to be a cluster center
 For each observation x, determine d(x), where d(x) denotes the
MINIMAL distance from x to a current cluster center
 Choose next cluster center from the data points, with probability of
making an observation x a cluster center proportional to d(x)2
 Repeat 2 and 3 until you have chosen the right number of clusters

This process has a setup cost, but convergence


tends to be faster and better(lower heterogeneity)

20
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-
dimensional space
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes

21
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-
dimensional space
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes

23
24
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
25
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

26
Example
 Five objects : a, b, c, d, e a b c d e
Distance matrix

a b c d e
a 0
b 12 0
c 4 10 0
a c b d e
d 15 6 12 0
e 7 8 5 13 0

27
Example
Distance matrix
a b c d e
a 0
b 12 0 a c b d e
c 4 10 0
d 15 6 12 0
e 7 8 5 13 0

After merging a and c


ac b d e
ac 0
b 10 0
a c e b d
d 12 6 0
e 5 8 13 0
28
Example
After merging a and c
ac b d e
ac 0
b 10 0
a c e b d
d 12 6 0
e 5 8 13 0

After merging ac and e

ace b d
ace 0
b 8 0
a c e b d
d 12 6 0

29
Exercise
 Five objects : a, b, c, d, e a b c d e
Distance matrix

a b c d e
a 0
b 12 0
c 4 10 0
d 15 6 12 0
e 7 8 5 13 0

Complete link

30
dendrogram

Decompose data objects into a


several levels of nested partitioning,
called a dendrogram
 Shows How Clusters are
Merged

A clustering of the data objects is


obtained by cutting the dendrogram
a c e b d at the desired level, then each
connected component forms a
cluster

31
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

32
Hierarchical Clustering (Cont.)
 Most hierarchical clustering algorithms are variants of the
single-link, complete-link or average link.

 Of these, single-link and complete link are most popular.

 In the single-link method, the distance between two


clusters is the minimum of the distances between all
pairs of patterns drawn one from each cluster.

 In the complete-link algorithm, the distance between two


clusters is the maximum of all pairwise distances
between pairs of patterns drawn one from each cluster.

 In the average-link algorithm, the distance between two


clusters is the average of all pairwise distances between
Distance Between Clusters
 Single Link: smallest distance between any
pair of points from two clusters
 Complete Link: largest distance between any pair
of points from two clusters
Distance between Clusters (Cont.)
 Average Link: average distance between points
from two clusters

 Centroid: distance between centroids


of the two clusters
Single Link vs. Complete Link (Cont.)

Single link
works but not
complete link
Complete link
works but not
single link
Single Link vs. Complete Link (Cont.)

1 1
1 1

1 1
1 2 2 1 2 2
2 2 2
2

1 1 1
1

1 1

Single link works Complete link doesn’t


Single Link vs. Complete Link (Cont.)

2 2

1 1
2 2
1 2 1 2 2 1 2 1 2 2
1 1
2 2

2 2
1-cluster noise 2-cluster

Single link doesn’t works Complete link does


Hierarchical vs. Partitional

 Hierarchical algorithms are more versatile than


partitional algorithms.
 For example, the single-link clustering algorithm works
well on data sets containing non-isotropic (non-
roundish) clusters including well-separated, chain-like,
and concentric clusters, whereas a typical partitional
algorithm such as the k-means algorithm works well
only on data sets having isotropic clusters.
 On the other hand, the time and space
complexities of the partitional algorithms are
typically lower than those of the hierarchical
algorithms.
More on Hierarchical Clustering Methods

 Major weakness of agglomerative clustering


methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously (greedy
algorithm)
 Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses Clustering Feature tree (CF-tree)
and incrementally adjusts the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of
the cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree

41
The CF Tree Structure
Root
CF1 CF2 CF3 CF6
child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

42
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)


N: Number of data points
N
LS: linear sum of N points:  X
i
i 1

SS: square sum of N points CF = (5, (16,30),(54,190))

N 2 10
(3,4)
 Xi
9

(2,6)
8

i 1
7

4
(4,5)
(4,7)
3

(3,8)
1

0
0 1 2 3 4 5 6 7 8 9 10

43
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the

0-th, 1st, and 2nd moments of the subcluster from


the statistical point of view
 Registers crucial measurements for computing

cluster and utilizes storage efficiently

44
CF-Tree in BIRCH

A CF tree is a height-balanced tree that stores the


clustering features for a hierarchical clustering
 A nonleaf node in a tree has descendants or

“children”
 The nonleaf nodes store sums of the CFs of their

children
 A CF tree has two parameters
 Branching factor: max # of children

 Threshold: max diameter of sub-clusters stored at

the leaf nodes

45
5. BIRCH algorithm

• An example of the CF Тree


root
Initially, the data points in one A
cluster.

17 / 32
5. BIRCH algorithm

• An example of the CF Тree


root
The data arrives, and a check is A
made whether the size of the
cluster does not exceed T.

T
A

18 / 32
5. BIRCH algorithm

• An example of the CF Тree


root
If the cluster size grows
too big, the cluster is split
into two clusters, A B
and the points
are redistributed.

T
A

19 / 32
5. BIRCH algorithm

• An example of the CF Тree


root
At each node of the tree,
the CF tree keeps information
about the mean of the A B
cluster, and the mean
of the sum of squares to
compute the size of the
clusters efficiently.

A B

20 / 32
5. BIRCH algorithm

• Another example of the CF Tree Insertion

LN3
LN2
Root
sc5 sc7
sc4 sc6

LN1 sc3

sc1 LN1 LN2 LN3


sc8
sc2

sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7

21 / 32
5. BIRCH algorithm

• Another example of the CF Tree Insertion


If the branching factor of a leaf node can not exceed 3, then LN1 is split.

LN3
LN2 Root
sc5 sc7
sc4 sc6

LN1’ LN1’’

LN1’ LN1’’ LN2 LN3


sc1 sc3
sc8 sc2

sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7

22 / 32
5. BIRCH algorithm

• Another example of the CF Tree Insertion


If the branching factor of a non-leaf node can not exceed 3,
then the root is split and the height of the
CF Tree increases by one. Root
LN3

NLN1 NLN2
sc5
sc4 sc7
LN2 sc6
NLN2
LN1’’ LN1’ LN1’’ LN2 LN3
NLN1
sc3
sc1 sc2
sc8
LN1’
sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7

23 / 32
5. BIRCH algorithm

• Phase 1: Scan all data and build an initial in-memory CF


tree, using the given amount of memory and recycling
space on disk.

• Phase 2: Condense into desirable length by building a


smaller CF tree.

• Phase 3: Global clustering.

• Phase 4: Cluster refining – this is optional, and requires


more passes over the data to refine the results.

24 / 32
The Birch Algorithm
 Cluster Diameter 1 2
 (x  x )
n( n  1) i j

 For each point in the input


 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be so

natural
 Clusters tend to be spherical given the radius and diameter

measures
54
Property

N N

 i j
(t  t
j 1 i 1
) 2

Dm 
N ( N  1)
N N

 i i j j )
(t 2

j 1 i 1
2t t  t 2


N ( N  1)
N N N N

  i j i  j )
( t
j 1
2
 2t
i 1
t  t 2

i 1 i 1

N ( N  1)
Property

 (M
j 1
2  2t j M 1  Nt j )2


N ( N  1)
N N N

M
j 1
2  2M1  t j  N  t j
j 1 j 1
2


N ( N  1)
2
NM 2  2 M 1  NM 2

N ( N  1)
 N = na +nb
 M1 = Ma1 + Mb1
 M2 = Ma2 + Mb2

Cluster Ca Cluster Cb

• CFa = (na, Ma1, Ma2) • CFb = (nb, Nb1, Mb2)


BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Scales linearly: finds a good clustering with a
single scan and improves the quality with a few
additional scans

 Weakness: handles only numeric data, and


sensitive to the order of the data record

58

You might also like