06Association - Lecture notes 5 PDF

Title	06Association - Lecture notes 5
Course	Data Mining Techniques
Institution	Concordia University
Pages	17
File Size	495.7 KB
File Type	PDF
Total Downloads	70
Total Views	138

Preview

CLICK TO PREVIEW PDF

Summary

data mining lectures...

Description

Introduction to Data Mining

Ch. 6: Mining Frequent Patterns, Association and Correlations Nizar Bouguila Concordia University

Set notations A

1, 2, 3, 2, 4

•  •  •  • 

A = {1, 2, 3, 2, 4} = {1, 2, 2, 3, 4} = {1, 2, 3, 4} 1 ∈"A 4 ∈"A 5 ∉"A {1, 2} ⊂ A {1, 2, 3, 4} ⊆ A {3, 5} ⊄ A A ⊃ {1, 2} {1, 2, 3, 4} ⊆ A

September 26, 2016

Data Mining: Concepts and Techniques

2

Set notations A

B

1

2

•  A = {1, 2, 3, 4} •  A ∪ B = {1, 2, 3, 4, 5, 6}

September 26, 2016

3

4

5

6

B = {3, 4, 5, 6} A ∩ B = {3, 4}

Data Mining: Concepts and Techniques

3

Transactions in Real Applications n฀ 

A large department store often carries more than 100 thousand different kinds of items n฀  Amazon.com carries more than 2M books. n฀  Walmart has more than 20 million transactions per day. n฀  Mining large transaction databases of many items is a real demand

September 26, 2016

Data Mining: Concepts and Techniques

4

What Is Frequent Pattern Analysis? n฀ 

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

n฀ 

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

n฀ 

n฀ 

Motivation: Finding inherent regularities in data n฀ 

What products were often purchased together?— Beer and diapers?!

n฀ 

What are the subsequent purchases after buying a PC?

n฀ 

What kinds of DNA are sensitive to a particular new drug?

Applications n฀ 

Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

September 26, 2016

Data Mining: Concepts and Techniques

5

Why Is Freq. Pattern Mining Important? n฀  n฀ 

Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks n฀ 

Association, correlation, and causality analysis

n฀ 

Sequential, structural (e.g., sub-graph) patterns

n฀ 

Pattern analysis in spatiotemporal, multimedia, timeseries, and stream data

n฀ 

Classification: associative classification

n฀ 

Cluster analysis: frequent pattern-based clustering

September 26, 2016

Data Mining: Concepts and Techniques

6

Basic Concepts: Frequent Patterns Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

n฀ 

30

Beer, Diaper, Eggs

n฀ 

40

Nuts, Eggs, Milk

50

Nuts, Coffee, Diaper, Eggs, Milk Customer buys both

Customer buys diaper

n฀ 

n฀ 

n฀ 

Customer buys beer

itemset: A set of one or more items k-itemset X = {x1, …, xk} (absolute) support or support count of X: Frequency or occurrence of an itemset X (relative) support, sup, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold 7

Basic Concepts: Association Rules Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40 50

Nuts, Eggs, Milk

n฀ 

Nuts, Coffee, Diaper, Eggs, Milk Customer buys both

Customer buys beer

Customer buys diaper

Find all the rules X à฀ Y with minimum support and confidence n฀  support, sup , probability that a transaction contains X ∪ Y n฀  confidence, conf, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 n฀ 

Association rules: Xà฀ Y (sup, conf) n฀  Beer à฀ Diaper (60%, 100%) n฀  Diaper à฀ Beer (60%, 75%) 8

A Naïve Attempt n฀ 

n฀ 

n฀ 

n฀ 

Generate all possible itemsets, test their supports against the database A transaction of length 100 needs to update the support of 2100-1 = 1.27x1030 possible itemsets. How to hold a large number of itemsets into main memory? How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? Tid Items bought

September 26, 2016

1

A, B, C

2 3

B, C A, C

… 100000000

C B, C

Data Mining: Concepts and Techniques

9

Chapter 5: Mining Frequent Patterns, Association and Correlations n฀  Basic concepts and a road map n฀  Efficient and scalable frequent itemset mining methods n฀  Mining various kinds of association rules n฀  From association mining to correlation analysis n฀  Constraint-based association mining n฀  Summary September 26, 2016

Data Mining: Concepts and Techniques

10

How to Get an Efficient Method? n฀ 

n฀  n฀ 

Reduce the number of itemsets that need to be checked Check the supports of selected itemsets efficiently Scalable mining methods: Three major approaches n฀  Apriori (Agrawal & Srikant@VLDB’94) n฀  Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) n฀  Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)

September 26, 2016

Data Mining: Concepts and Techniques

11

The Downward Closure Property n฀ 

n฀ 

Any subset of a frequent itemset must be also frequent – downward closure (apriori) property n฀  If {beer, diaper, nuts} is frequent, then {beer, diaper} must also be frequent. n฀  A transaction containing {beer, diaper, nuts} also contains {beer, diaper}. Items bought In other words, any superset of an Tid 10 Beer, Nuts, Milk, Diaper infrequent itemset must also be 20 Beer, Nuts, Diaper infrequent 30 Beer, Nuts, Diaper n฀  No superset of any infrequent 40 Nuts, Eggs, Milk itemset should be generated or Nuts, Coffee, Diaper, Eggs, Milk 50 tested. n฀  Suppose sup(eggs)=2, sup(beer)=3. n฀  Then sup(egg,beer)...