Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Association Analysis Basic Concepts Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
Chapter 5
Association Analysis: Basic Concepts
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
Computationally prohibitive!
02/14/2018 Introduction to Data Mining, 2nd Edition 3/28/21 12:17 PM 5
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1d d 1
Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support
but
can have different confidence
Thus, we may decouple the support and confidence requirements
02/14/2018 Introduction to Data Mining, 2nd Edition 3/28/21 12:17 PM 7
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
02/14/2018 Introduction to Data Mining, 2nd Edition 3/28/21 12:17 PM 13
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1
Minimum Support = 3
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1
Minimum Support = 3
Minimum Support = 3
Algorithm
– Let k=1
– Generate F1 = {frequent 1-itemsets}
– Repeat until Fk is empty
Candidate Generation: Generate Lk+1 from Fk
Candidate Pruning: Prune candidate itemsets in Lk+1
containing subsets of length k that are infrequent
Support Counting: Count the support of each candidate in
Lk+1 by scanning the DB
Candidate Elimination: Eliminate candidates in Lk+1 that are
infrequent, leaving only those that are frequent => Fk+1
02/14/2018 Introduction to Data Mining, 2
nd
Edition 3/28/21 12:17 PM 20
Candidate Generation: Brute-force method
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, ABD) = ABCD
– Merge(ABC, ABE) = ABCE
– Merge(ABD, ABE) = ABDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
Candidate pruning
– Prune ABCE because ACE and BCE are infrequent
– Prune ABDE because ADE is infrequent
After candidate
02/14/2018
pruning: L4 2= Edition
Introduction to Data Mining,
nd
{ABCD} 3/28/21 12:17 PM 25
Alternate Fk-1 x Fk-1 Method
F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE}
– Merge(ABC, BCD) = ABCD
– Merge(ABD, BDE) = ABDE
– Merge(ACD, CDE) = ACDE
– Merge(BCD, CDE) = BCDE
Let F3 = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be
the set of frequent 3-itemsets
TID Items
Itemset
1 Bread, Milk
{ Beer, Diaper, Milk}
2 Beer, Bread, Diaper, Eggs { Beer,Bread,Diaper}
3 Beer, Coke, Diaper, Milk {Bread, Diaper, Milk}
{ Beer, Bread, Milk}
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
02/14/2018 Introduction to Data Mining, 2nd Edition 3/28/21 12:17 PM 38
Rule Generation
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
10
Number of frequent itemsets 3
10
k
k 1
Maximal A B C D E
Itemsets
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Infrequent
Itemsets Border
ABCD
E
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: ?
1
2
3
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1
2
3
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J
Support threshold (by count) : 5
1 Frequent itemsets: {F}
2
Support threshold (by count): 4
3 Frequent itemsets: ?
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1
2 Support threshold (by count): 4
Frequent itemsets: {E}, {F}, {E,F}, {J}
3
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1
2 Support threshold (by count): 4
Frequent itemsets: {E}, {F}, {E,F}, {J}
3
Transactions
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1
2 Support threshold (by count): 4
Frequent itemsets: {E}, {F}, {E,F}, {J}
3
Transactions
6
7
8
9
10
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: ?
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
Items
A B C D E F G H I J Support threshold (by count) : 5
Frequent itemsets: {F}
1 Maximal itemsets: {F}
7
8
9
10
Items
A B C D E F G H I J Support threshold (by count) : 5
Maximal itemsets: {A}, {B}, {C}
1
Support threshold (by count): 4
2 Maximal itemsets: {A,B}, {A,C},{B,C}
4
5
6
7
8
9
10
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{C,D} 2
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
A B C D E F G H I J Itemsets Support Closed
(counts) itemsets
1
{C} 3
2
{D} 2
3
{E} 2
Transactions
4
{C,D} 2
5
{C,E} 2
6
{D,E} 2
7
{C,D,E} 2
8
9
10
Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C,F}
1
2
3
Transactions
4
5
6
7
8
9
10
Items
A B C D E F G H I J Closed itemsets: {C,D,E,F}, {C}, {F}
1
2
3
Transactions
4
5
6
7
8
9
10
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
a. What is the number of frequent itemsets for each dataset? Which dataset will produce the
most number of frequent itemsets?
b. Which dataset will produce the longest frequent itemset?
c. Which dataset will produce frequent itemsets with highest maximum support?
d. Which dataset will produce frequent itemsets containing items with widely varying support
levels (i.e., itemsets containing items with mixed support, ranging from 20% to more than
70%)?
e. What is the number of maximal frequent itemsets for each dataset? Which dataset will
produce the most number of maximal frequent itemsets?
f. What is the number of closed frequent itemsets for each dataset? Which dataset will produce
the most number of closed frequent itemsets?
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
The criterion
confidence(X Y) = support(Y)
is equivalent to:
– P(Y|X) = P(Y)
– P(X,Y) = P(X) P(Y)
P (Y | X )
Lift
P (Y ) lift is used for rules while
interest is used for itemsets
P( X , Y )
Interest
P ( X ) P (Y )
PS P( X , Y ) P ( X ) P(Y )
P ( X , Y ) P ( X ) P (Y )
coefficien t
P( X )[1 P( X )] P(Y )[1 P(Y )]
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
0.1 0.9
Lift 10 Lift 1.11
(0.1)(0.1) (0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
B B A A
A p q B p r
A r s B q s
Symmetric measures:
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc
2x 10x
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
A B C D E F
.
Transaction 1
1 0 0 1 0 0
0 0 1 1 1 0
. 0 0 1 1 1 0
. 0 0 1 1 1 0
. 0
0
1
0
1
1
0
1
1
1
1
0
. 0
0
0
0
1
1
1
1
1
1
0
0
0 0 1 1 1 0
Transaction N 1 0 0 1 0 0
B B B B
A p q A p q
A r s A r s+k
Invariant measures:
support, cosine, Jaccard, etc
Non-invariant measures:
correlation, Gini, mutual information, odds ratio, etc
=> Customers who buy HDTV are more likely to buy exercise machines
College students:
c({HDTV Yes} {Exercise Machine Yes} ) 1 / 10 10%
c({HDTV No} {Exercise Machine Yes} ) 4 / 34 11 .8%
Working adults:
c({HDTV Yes} {Exercise Machine Yes} ) 98 / 170 57.7%
c({HDTV No} {Exercise Machine Yes} ) 50 / 86 58.1%
Many items
Support with low
distribution of support
a retail data set
caviar milk
Given
an itemset,, with items, we can define a measure
of cross support,r, for the itemset
𝐦𝐢𝐧 {𝑠 ( 𝑥 1) , 𝑠 ( 𝑥2 ) ,… , 𝑠( 𝑥 𝑑 ) }
𝑟 ( 𝑋 )=
𝐦𝐚𝐱 {𝑠 (𝑥 1 ), 𝑠( 𝑥 2) , … , 𝑠 ( 𝑥 𝑑 )}
Observation:
conf(caviarmilk) is very high
but
conf(milkcaviar) is very low
Therefore,
min( conf(caviarmilk),
conf(milkcaviar) )
caviar milk
To
avoid patterns whose items have very different
support, define a new evaluation measure for itemsets
– Known as h-confidence or all-confidence
where , ,
For example:
But,
given an itemset
– What is the lowest confidence rule you can obtain
from ?
– Recall conf(→) = s() / support()
The numerator is fixed: s() = s(X )
Thus, to find the lowest confidence rule, we need to find the
X1 with highest support
Consider only rules where is a single item, i.e.,
{} – {}, {} – {}, …, or {} – {}
By
the anti-montone property of support
Thus,
Since,
we can eliminate cross support patterns by
finding patterns with
h-confidence < hc, a user set threshold
Notice that
Hypercliques
are itemsets, but not necessarily
frequent itemsets
– Good for finding low support patterns
H-confidence is anti-monotone
Hypercliques
have the high-affinity property
– Think of the individual items as sparse binary vectors
– h-confidence gives us information about their pairwise
Jaccard and cosine similarity
Assume and are any two items in an itemset X
f(X)/2
f(X)
– Hypercliques that have a high h-confidence consist of
very similar items as measured by Jaccard and cosine
The items in a hyperclique cannot have widely
different support
– Allows for more efficient pruning