Exam 2012, Data Mining, questions and answers PDF

Title	Exam 2012, Data Mining, questions and answers
Course	Data Mining
Institution	University of Queensland
Pages	14
File Size	390.8 KB
File Type	PDF
Total Downloads	24
Total Views	143

Preview

CLICK TO PREVIEW PDF

Summary

Download Exam 2012, Data Mining, questions and answers PDF

Description

You can disable automatic email alerts of comment discussions via the "Discussions" button.

Question1 1.1)[1marks] BrieflydescribethegeneralobjectiveofAssociationRulesmining.  Theobjectiveofassociationrulesminingistodiscoverinterestingrelationsbetweenobjectsin largedatabases.  1.2)[1marks]DescribethegeneralAprioriPrinciple.  Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.Thesupportofanitemset neverexceedsthesupportofitssubsets.  1.3)[1marks]ExplainthebenefitsofapplyingtheAprioriPrincipleinthecontextoftheApriori AlgorithmforAssociationRulesmining.  Aprioriisaalgorithmusedtodetermineassociationrulesinthedatabasebyidentifyingfrequent individualtermstoconstructitemsetswithrespecttotheirsupport.Theaprioriprinciplerulesout allsupersetsofitemsetswithsupportundertheminimumthreshold.   

Question2 Considerthemarketbaskettransactionsshowninthefollowingtable.Assumethat min_support=40%andmin_confidence=70%.Further,assumethattheApriorialgorithmis usedtodiscoverstrongassociationrulesamongtransactionitems.  TransactionID

ItemsBought

1

{Bread,Butter,Milk}

2

{Bread,Butter}

3

{Beer,Cookies,Diapers}

4

{Milk,Diapers,Bread,Butter}

5

{Beer,Diapers}

 2.1)[5marks]Showstepbystepthegeneratedcandidateitemsets(Ck)andthequalified frequentitemsets(Lk)untilthelargestfrequentitemset(s)aregenerated.UsetableC1asa template.Makesureyouclearlyidentifyallthefrequentitemsets.  Itemset

Support_count

Bread

3

Butter

3

Milk

2

Beer

2

Cookies

1

Diapers

3

C1  min_support_count=min_supportxitemset_count  min_support_count=40%x5 =2 

Itemset

Support_count

Bread

3

Butter

3

Diapers

3

Milk

2

Beer

2

L1   Itemset Bread,Butter Bread,Diapers Bread,Milk Bread,Beer Butter,Diapers Butter,Milk Butter,Beer Diapers,Milk Diapers,Beer Milk,Beer C2  2ndScan Itemset

Support_count

Bread,Butter

3

Bread,Diapers

1

Butter,Diapers

1



Itemset

Support_count

Bread,Butter

3

Bread,Diapers

1

Bread,Milk

2

Bread,Beer

0

Butter,Diapers

1

Butter,Milk

2

Butter,Beer

0

Diapers,Milk

1

Diapers,Beer

2

Milk,Beer

0

Itemset

Support_count

Bread,Butter

3

Bread,Milk

2

Butter,Milk

2

Diapers,Beer

2

C2 

L2  Itemset Bread,Butter,Milk C3  Itemset

Support_count

Bread,Butter,Milk

2

C3   

2.2)[5marks]Generateallpossibleassociationrulesfromthefrequentitemsetsobtainedinthe previousquestion.Calculatetheconfidenceofeachruleandidentifyallthestrongassociation rules.  Confidence(X→Y)=P(Y|X)=P(X∪Y)/P(X)  MinConfidence=70%  bread>milk=2/3=67% butter>milk=2/3=67% diapers>beer=2/3=67% butter>bread=3/3=100%  milk>bread=2/2=100% milk>butter=2/2=100% beer>diapers=2/2=100% bread>butter=3/3=100%  bread,butter>milk=2/3=67% bread,milk>butter=2/2=100% milk,butter>bread=2/2=100%  bread>butter,milk=2/3=67% butter>bread,milk=2/3=67% milk>bread,butter=2/2=100%  8Strongrulesoverall   

Question3 3.1)[1marks]Brieflydiscussthemaindifferencesbetweenthedataminingtasksof:1) Classification,and2)Clustering.  ● Classificationisthemethodofdefiningwhatitemsetsdetermineresultingcategories. ● Clusteringisthemethodofgroupingdataintocategoriesbysimilarity.   3.2)[2marks]ComparethePrecisionandRecallmetricsforclassifierevaluation.Illustrateyour answerusingtheConfusionMatrix.  ● Precision:exactnesswhatpercentoftuplesthattheclassifierlabeledaspositiveare actuallypositive? Precision=TruePositive/(TruePositive+FalsePositive) ● Recall:completenesswhatpercentofpositivetuplesdidtheclassifierlabelas positive? Recall=TruePositive/(TruePositive+FalseNegative)   

PredictedPositive

PredictedNegative

ActualPositive

TruePositive

FalseNegative

ActualNegative

FalsePositive

TrueNegative

  3.3)[1marks]Whatarethedifferencesbetweenlazyandeagerclassifiers?  ● Eagerclassifiersbuildmodelsexplicitly,forexample:decisiontreescreateamodelthat canbeappliedtonewunknownrecordswhentheyareaddedtothedataset. ● Lazyclassifiersdonotbuildmodelsexplicitly,classifyingunknownrecordsisrelatively expensive.  3.4)[1marks]WhatisthemainlimitationoftheNaïveBayesianClassifier?  Itdoesnottakeintoaccountthatsomevariablesmaydependonothers.Practically, dependenciesexistamongvariablesandNaïveBayesianclassifierassumesthattheyare independent.     

3.5)[2marks]InthecontextofDecisionTreeinduction,whatdoesoverfittingmean?Andhow toavoidoverfitting? 

OverfittinginDecisionTreesiswhentheclassifieristoocomplexorhaspooraccuracy.Over complexityiscausedfromtoomanybranchesthatmaybepronetonoiseoroutlierswhilethe pooraccuracyisduetothesmalltrainingsethavingunseensamples.  Decisiontreesshouldbestoppedbeforethefullygrowntreeiscreatedtoavoidoverfitting.For example:stopatanodeifallinstancesbelongtothesameclassorifalltheattributevaluesare thesame.  Question4 Considertheonedimensionaldatasetshownbelow:

 [4marks]Classifythedatapointx=5.0accordingtoits1,3,5,and9nearestneighbors (usingmajorityvote).Clearlyjustifyyourclassificationdecisions. 

x>1nearestneighbor= x

4.9

y

+

thereforex=+  x>3nearestneighbor= x

4.9

5.2

5.3

y

+





thereforex=  x>5nearestneighbor= x

4.5

4.6

4.9

5.2

5.3

y

+

+

+





thereforex=+  x>9nearestneighbor= x

0.5

3.0

4.5

4.6

4.9

5.2

5.3

5.5

7.0

y





+

+

+





+



thereforex=

Question5 Considerthedatasetshownbelow:

 5.1)[3marks]Estimatetheconditionalprobabilitiesfor P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−),andP(C=1|−)  ● P(A=1|+)=3/5=60% ● P(B=1|+)=2/5=40% ● P(C=1|+)=4/5=80% ● P(A=1|−)=2/5=40% ● P(B=1|−)=2/5=40% ● P(C=1|−)=1/5=20%  5.2)[3marks]Usetheconditionalprobabilitiesestimatedinthepreviousquestiontopredictthe classlabelforatestsample(A=1,B=1,C=1)usingthenaiveBayesapproach. 

LetX=(A=1,B=1,C=1)then P(X|)=P(A=1|)xP(B=1|)xP(C=1|)=0.4*0.4*0.2=0.032 P(X|+)=P(A=1|+)xP(B=1|+)xP(C=1|+)=0.6*0.4*0.8=0.192 NextwouldcalculateP(X|)xP()andP(X|+)xP(+)butinthisinstance,P()=P(+)=0.5 sohavenoimpactontheoutcome.ThepredictedclasslabelforXwouldbe‘+’asthis hasahigherclassprobabilityP(X|+)=0.192whichishigherthanP(X|).   

  Question6 6.1)[2marks]DescribetheproblemsinvolvedinselectingtheinitialpointsoftheKmeans clusteringmethod.Describepostprocessingtechniquestosolvesuchproblems.  SelectingdifferentinitialpointsoftheKmeansclusteringmethodcanreturndifferentclustersfor theresults.Thiscanreturnsuboptimalclustersbecausesometimesthecentroidsdonot readjustinthe‘right’way.Postprocessingtechniquestoovercomethisproblemconsistof eliminating‘small’clustersthatmayrepresentoutliers,splitting‘loose’clusterswithrelatively highsumofsquarederrorandmerging‘close’clusterswithrelativelylowsumofsquarederror.  6.2)[1marks]InDensitybasedclustering,whatarethenecessaryconditionsfortwodata objectstobedensityreachable?  Usingthelocaloutlierfactorapproachfordensitybasedclusteringtheaverageiscalculatedof theratioofthedensityofthesampleandthedensityofitsnearestneighbors.Inordertobe densityreachablethetwodataobjects…(somethingabouthowthetwodataobjectsmustbe closetooneanotherbutalsohavesimilardensities?)  

Question7 Considerthe5datapointsshownbelow: P1:(1,2,3), P2:(0,1,2), P3:(3,0,5), P4:(4,1,3), P5:(5,0,1)  7.1)[5marks]ApplytheKmeansclusteringalgorithm,togroupthosedatapointsinto2clusters, usingtheL1distancemeasure.(L1distanceisjustmanhattandistance:sumofdifferencesin eachdimension)  SupposethattheinitialcentroidsareC1:(1,0,0)andC2:(0,1,1).  UsethefollowingtableasatemplatetoillustrateeachoftheKmeansiterations.  Iteration1:  Cluster#

OldCentroids

ClusterElements

NewCentroids

1

(1,0,0)

P3,P5

(4,0,3)

2

(0,1,1)

P1,P2,P4

(1.66,1.33,2.66)

Cluster#

OldCentroids

ClusterElements

NewCentroids

1

(4,0,3)

P3,P4,P5

(4,0.33,3)

2

(1.66,1.33,2.66)

P1,P2

(0.5,1.5.2.5)

Cluster#

OldCentroids

ClusterElements

NewCentroids

1

(4,0.33,3)

P3,P4,P5

(4,0.33,3)

2

(0.5,1.5.2.5)

P1,P2

(0.5,1.5.2.5)

 Iteration2: 

 Iteration3: 

   

7.2)[1marks]ExplainhowtheKmeansterminationconditionhasbeenreached.  Afteriteration3thenewcentroidsandclusterelementsarethesameastheoldcentroidsand clusterelements.  7.3)[2marks]Foryourfinalclustering,calculatethecorrespondingSSE(SumofSquaredError) value.

 wheremiisthecentroidoftheclusterCi 

SSEforcluster1=SSEP3+SSEP4+SSEP5 =11.0889+0.4489+11.0889 =22.6267  SSEforcluster2=SSEP1+SSEP2 =2.25+2.25 =4.5  TotalSSE=22.6134+4.5 =27.1134  

Question8 Considerthefollowingsetof6onedimensionaldatapoints:  18,22,25,42,27,43  [5marks]Applytheagglomerativehierarchicalclusteringalgorithmtobuildthehierarchical clusteringdendogram.MergetheclustersusingMindistanceandupdatetheproximitymatrix accordingly.Clearlyshowtheproximitymatrixcorrespondingtoeachiterationofthealgorithm.  Step1: 

18

22

25

27

42

43

18

0

4

7

9

24

25

22

4

0

3

5

20

21

25

7

3

0

2

17

18

27

9

5

2

0

15

16

42

24

20

17

15

0

1

43

25

21

18

16

1

0

 Step2: 

18

22 25 27 42,43

18

0

4

7

9

24

22

4

0

3

5

20

25

7

3

0

2

17

27

9

5

2

0

15

42,43 24

20 17 15

0

 Step3: 

18

22 25,27 42,43

18

0

4

7

24

22

4

0

3

20

25,27 7

3

0

15

42,43 24

20

15

0

   

Step4: 

18 22,25, 42,43 27

18

0

4

24

22, 4 25,27

0

15

42,43 24

15

0

 Step5: 

18,22, 42,43 25,27

18,22, 25,27

0

15

42,43

15

0

 Step6: 

18,22,25,27,42, 43

18,22,25,27, 42,43

0

 Dendogram: 

 



Question9 [4marks]HandlingunstructureddataisoneofthemainchallengesinTextMining.Describein detailsthenecessarystepsthatareneededtoprovideastructuredrepresentationoftext documents.  ToprovideastructuredrepresentationofatextdocumentVectorSpaceModelisusedto calculatetheTermFrequencyInverseDocumentFrequency.  ● Step1:Extractthetext.(Moveallofthetextintoadatabase.) ● Step2:Removestopwords.(Removewordslike‘the’and‘a’.) ● Step3:Convertallwordstolowercase.(Toensurewordcomparisonwillwork.) ● Step4:Usestemmingonallwords.(Wordslike‘stemming’become‘stem’.) ● Step5:Countthetermfrequencies.(Getcountforeachwordineachdocument.) ● Step6:Createanindexingfile.(ForeachwordtocreatetheVSM.) ● Step7:Createthevectorspacemodel.(Usestheindexfiletorepresenttheterm frequencyineachdocument.) ● Step8:Computeinversedocumentfrequency.(IDF(word)=log(totaldocuments/ documentfrequency).) ● Step9:Computetermweights.(Weight(word)=TF(word)xIDF(word).WhereTF(word) =numberoftimesthewordappearsinthedocument.) ● Step10:Normalizealldocumentstounitlength.(weight(word)=weight(word)/ sqrt(eachweightofeachwordsquared).)

...