Title | 06Association - Lecture notes 5 |
---|---|
Course | Data Mining Techniques |
Institution | Concordia University |
Pages | 17 |
File Size | 495.7 KB |
File Type | |
Total Downloads | 70 |
Total Views | 138 |
data mining lectures...
Introduction to Data Mining
Ch. 6: Mining Frequent Patterns, Association and Correlations Nizar Bouguila Concordia University
Set notations A
1, 2, 3, 2, 4
• • • •
A = {1, 2, 3, 2, 4} = {1, 2, 2, 3, 4} = {1, 2, 3, 4} 1 ∈"A 4 ∈"A 5 ∉"A {1, 2} ⊂ A {1, 2, 3, 4} ⊆ A {3, 5} ⊄ A A ⊃ {1, 2} {1, 2, 3, 4} ⊆ A
September 26, 2016
Data Mining: Concepts and Techniques
2
Set notations A
B
1
2
• A = {1, 2, 3, 4} • A ∪ B = {1, 2, 3, 4, 5, 6}
September 26, 2016
3
4
5
6
B = {3, 4, 5, 6} A ∩ B = {3, 4}
Data Mining: Concepts and Techniques
3
Transactions in Real Applications n
A large department store often carries more than 100 thousand different kinds of items n Amazon.com carries more than 2M books. n Walmart has more than 20 million transactions per day. n Mining large transaction databases of many items is a real demand
September 26, 2016
Data Mining: Concepts and Techniques
4
What Is Frequent Pattern Analysis? n
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
n
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
n
n
Motivation: Finding inherent regularities in data n
What products were often purchased together?— Beer and diapers?!
n
What are the subsequent purchases after buying a PC?
n
What kinds of DNA are sensitive to a particular new drug?
Applications n
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
September 26, 2016
Data Mining: Concepts and Techniques
5
Why Is Freq. Pattern Mining Important? n n
Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks n
Association, correlation, and causality analysis
n
Sequential, structural (e.g., sub-graph) patterns
n
Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data
n
Classification: associative classification
n
Cluster analysis: frequent pattern-based clustering
September 26, 2016
Data Mining: Concepts and Techniques
6
Basic Concepts: Frequent Patterns Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
n
30
Beer, Diaper, Eggs
n
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk Customer buys both
Customer buys diaper
n
n
n
Customer buys beer
itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support or support count of X: Frequency or occurrence of an itemset X (relative) support, sup, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold 7
Basic Concepts: Association Rules Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Nuts, Eggs, Milk
n
Nuts, Coffee, Diaper, Eggs, Milk Customer buys both
Customer buys beer
Customer buys diaper
Find all the rules X à Y with minimum support and confidence n support, sup , probability that a transaction contains X ∪ Y n confidence, conf, conditional probability that a transaction having X also contains Y
Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 n
Association rules: Xà Y (sup, conf) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%) 8
A Naïve Attempt n
n
n
n
Generate all possible itemsets, test their supports against the database A transaction of length 100 needs to update the support of 2100-1 = 1.27x1030 possible itemsets. How to hold a large number of itemsets into main memory? How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? Tid Items bought
September 26, 2016
1
A, B, C
2 3
B, C A, C
… 100000000
C B, C
Data Mining: Concepts and Techniques
9
Chapter 5: Mining Frequent Patterns, Association and Correlations n Basic concepts and a road map n Efficient and scalable frequent itemset mining methods n Mining various kinds of association rules n From association mining to correlation analysis n Constraint-based association mining n Summary September 26, 2016
Data Mining: Concepts and Techniques
10
How to Get an Efficient Method? n
n n
Reduce the number of itemsets that need to be checked Check the supports of selected itemsets efficiently Scalable mining methods: Three major approaches n Apriori (Agrawal & Srikant@VLDB’94) n Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) n Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
September 26, 2016
Data Mining: Concepts and Techniques
11
The Downward Closure Property n
n
Any subset of a frequent itemset must be also frequent – downward closure (apriori) property n If {beer, diaper, nuts} is frequent, then {beer, diaper} must also be frequent. n A transaction containing {beer, diaper, nuts} also contains {beer, diaper}. Items bought In other words, any superset of an Tid 10 Beer, Nuts, Milk, Diaper infrequent itemset must also be 20 Beer, Nuts, Diaper infrequent 30 Beer, Nuts, Diaper n No superset of any infrequent 40 Nuts, Eggs, Milk itemset should be generated or Nuts, Coffee, Diaper, Eggs, Milk 50 tested. n Suppose sup(eggs)=2, sup(beer)=3. n Then sup(egg,beer)...