Title | Exam 2012, Data Mining, questions and answers |
---|---|
Course | Data Mining |
Institution | University of Queensland |
Pages | 14 |
File Size | 390.8 KB |
File Type | |
Total Downloads | 24 |
Total Views | 143 |
Download Exam 2012, Data Mining, questions and answers PDF
You can disable automatic email alerts of comment discussions via the "Discussions" button.
Question1 1.1)[1marks] BrieflydescribethegeneralobjectiveofAssociationRulesmining. Theobjectiveofassociationrulesminingistodiscoverinterestingrelationsbetweenobjectsin largedatabases. 1.2)[1marks]DescribethegeneralAprioriPrinciple. Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequent.Thesupportofanitemset neverexceedsthesupportofitssubsets. 1.3)[1marks]ExplainthebenefitsofapplyingtheAprioriPrincipleinthecontextoftheApriori AlgorithmforAssociationRulesmining. Aprioriisaalgorithmusedtodetermineassociationrulesinthedatabasebyidentifyingfrequent individualtermstoconstructitemsetswithrespecttotheirsupport.Theaprioriprinciplerulesout allsupersetsofitemsetswithsupportundertheminimumthreshold.
Question2 Considerthemarketbaskettransactionsshowninthefollowingtable.Assumethat min_support=40%andmin_confidence=70%.Further,assumethattheApriorialgorithmis usedtodiscoverstrongassociationrulesamongtransactionitems. TransactionID
ItemsBought
1
{Bread,Butter,Milk}
2
{Bread,Butter}
3
{Beer,Cookies,Diapers}
4
{Milk,Diapers,Bread,Butter}
5
{Beer,Diapers}
2.1)[5marks]Showstepbystepthegeneratedcandidateitemsets(Ck)andthequalified frequentitemsets(Lk)untilthelargestfrequentitemset(s)aregenerated.UsetableC1asa template.Makesureyouclearlyidentifyallthefrequentitemsets. Itemset
Support_count
Bread
3
Butter
3
Milk
2
Beer
2
Cookies
1
Diapers
3
C1 min_support_count=min_supportxitemset_count min_support_count=40%x5 =2
Itemset
Support_count
Bread
3
Butter
3
Diapers
3
Milk
2
Beer
2
L1 Itemset Bread,Butter Bread,Diapers Bread,Milk Bread,Beer Butter,Diapers Butter,Milk Butter,Beer Diapers,Milk Diapers,Beer Milk,Beer C2 2ndScan Itemset
Support_count
Bread,Butter
3
Bread,Diapers
1
Butter,Diapers
1
Itemset
Support_count
Bread,Butter
3
Bread,Diapers
1
Bread,Milk
2
Bread,Beer
0
Butter,Diapers
1
Butter,Milk
2
Butter,Beer
0
Diapers,Milk
1
Diapers,Beer
2
Milk,Beer
0
Itemset
Support_count
Bread,Butter
3
Bread,Milk
2
Butter,Milk
2
Diapers,Beer
2
C2
L2 Itemset Bread,Butter,Milk C3 Itemset
Support_count
Bread,Butter,Milk
2
C3
2.2)[5marks]Generateallpossibleassociationrulesfromthefrequentitemsetsobtainedinthe previousquestion.Calculatetheconfidenceofeachruleandidentifyallthestrongassociation rules. Confidence(X→Y)=P(Y|X)=P(X∪Y)/P(X) MinConfidence=70% bread>milk=2/3=67% butter>milk=2/3=67% diapers>beer=2/3=67% butter>bread=3/3=100% milk>bread=2/2=100% milk>butter=2/2=100% beer>diapers=2/2=100% bread>butter=3/3=100% bread,butter>milk=2/3=67% bread,milk>butter=2/2=100% milk,butter>bread=2/2=100% bread>butter,milk=2/3=67% butter>bread,milk=2/3=67% milk>bread,butter=2/2=100% 8Strongrulesoverall
Question3 3.1)[1marks]Brieflydiscussthemaindifferencesbetweenthedataminingtasksof:1) Classification,and2)Clustering. ● Classificationisthemethodofdefiningwhatitemsetsdetermineresultingcategories. ● Clusteringisthemethodofgroupingdataintocategoriesbysimilarity. 3.2)[2marks]ComparethePrecisionandRecallmetricsforclassifierevaluation.Illustrateyour answerusingtheConfusionMatrix. ● Precision:exactnesswhatpercentoftuplesthattheclassifierlabeledaspositiveare actuallypositive? Precision=TruePositive/(TruePositive+FalsePositive) ● Recall:completenesswhatpercentofpositivetuplesdidtheclassifierlabelas positive? Recall=TruePositive/(TruePositive+FalseNegative)
PredictedPositive
PredictedNegative
ActualPositive
TruePositive
FalseNegative
ActualNegative
FalsePositive
TrueNegative
3.3)[1marks]Whatarethedifferencesbetweenlazyandeagerclassifiers? ● Eagerclassifiersbuildmodelsexplicitly,forexample:decisiontreescreateamodelthat canbeappliedtonewunknownrecordswhentheyareaddedtothedataset. ● Lazyclassifiersdonotbuildmodelsexplicitly,classifyingunknownrecordsisrelatively expensive. 3.4)[1marks]WhatisthemainlimitationoftheNaïveBayesianClassifier? Itdoesnottakeintoaccountthatsomevariablesmaydependonothers.Practically, dependenciesexistamongvariablesandNaïveBayesianclassifierassumesthattheyare independent.
3.5)[2marks]InthecontextofDecisionTreeinduction,whatdoesoverfittingmean?Andhow toavoidoverfitting?
OverfittinginDecisionTreesiswhentheclassifieristoocomplexorhaspooraccuracy.Over complexityiscausedfromtoomanybranchesthatmaybepronetonoiseoroutlierswhilethe pooraccuracyisduetothesmalltrainingsethavingunseensamples. Decisiontreesshouldbestoppedbeforethefullygrowntreeiscreatedtoavoidoverfitting.For example:stopatanodeifallinstancesbelongtothesameclassorifalltheattributevaluesare thesame. Question4 Considertheonedimensionaldatasetshownbelow:
[4marks]Classifythedatapointx=5.0accordingtoits1,3,5,and9nearestneighbors (usingmajorityvote).Clearlyjustifyyourclassificationdecisions.
x>1nearestneighbor= x
4.9
y
+
thereforex=+ x>3nearestneighbor= x
4.9
5.2
5.3
y
+
thereforex= x>5nearestneighbor= x
4.5
4.6
4.9
5.2
5.3
y
+
+
+
thereforex=+ x>9nearestneighbor= x
0.5
3.0
4.5
4.6
4.9
5.2
5.3
5.5
7.0
y
+
+
+
+
thereforex=
Question5 Considerthedatasetshownbelow:
5.1)[3marks]Estimatetheconditionalprobabilitiesfor P(A=1|+),P(B=1|+),P(C=1|+),P(A=1|−),P(B=1|−),andP(C=1|−) ● P(A=1|+)=3/5=60% ● P(B=1|+)=2/5=40% ● P(C=1|+)=4/5=80% ● P(A=1|−)=2/5=40% ● P(B=1|−)=2/5=40% ● P(C=1|−)=1/5=20% 5.2)[3marks]Usetheconditionalprobabilitiesestimatedinthepreviousquestiontopredictthe classlabelforatestsample(A=1,B=1,C=1)usingthenaiveBayesapproach.
LetX=(A=1,B=1,C=1)then P(X|)=P(A=1|)xP(B=1|)xP(C=1|)=0.4*0.4*0.2=0.032 P(X|+)=P(A=1|+)xP(B=1|+)xP(C=1|+)=0.6*0.4*0.8=0.192 NextwouldcalculateP(X|)xP()andP(X|+)xP(+)butinthisinstance,P()=P(+)=0.5 sohavenoimpactontheoutcome.ThepredictedclasslabelforXwouldbe‘+’asthis hasahigherclassprobabilityP(X|+)=0.192whichishigherthanP(X|).
Question6 6.1)[2marks]DescribetheproblemsinvolvedinselectingtheinitialpointsoftheKmeans clusteringmethod.Describepostprocessingtechniquestosolvesuchproblems. SelectingdifferentinitialpointsoftheKmeansclusteringmethodcanreturndifferentclustersfor theresults.Thiscanreturnsuboptimalclustersbecausesometimesthecentroidsdonot readjustinthe‘right’way.Postprocessingtechniquestoovercomethisproblemconsistof eliminating‘small’clustersthatmayrepresentoutliers,splitting‘loose’clusterswithrelatively highsumofsquarederrorandmerging‘close’clusterswithrelativelylowsumofsquarederror. 6.2)[1marks]InDensitybasedclustering,whatarethenecessaryconditionsfortwodata objectstobedensityreachable? Usingthelocaloutlierfactorapproachfordensitybasedclusteringtheaverageiscalculatedof theratioofthedensityofthesampleandthedensityofitsnearestneighbors.Inordertobe densityreachablethetwodataobjects…(somethingabouthowthetwodataobjectsmustbe closetooneanotherbutalsohavesimilardensities?)
Question7 Considerthe5datapointsshownbelow: P1:(1,2,3), P2:(0,1,2), P3:(3,0,5), P4:(4,1,3), P5:(5,0,1) 7.1)[5marks]ApplytheKmeansclusteringalgorithm,togroupthosedatapointsinto2clusters, usingtheL1distancemeasure.(L1distanceisjustmanhattandistance:sumofdifferencesin eachdimension) SupposethattheinitialcentroidsareC1:(1,0,0)andC2:(0,1,1). UsethefollowingtableasatemplatetoillustrateeachoftheKmeansiterations. Iteration1: Cluster#
OldCentroids
ClusterElements
NewCentroids
1
(1,0,0)
P3,P5
(4,0,3)
2
(0,1,1)
P1,P2,P4
(1.66,1.33,2.66)
Cluster#
OldCentroids
ClusterElements
NewCentroids
1
(4,0,3)
P3,P4,P5
(4,0.33,3)
2
(1.66,1.33,2.66)
P1,P2
(0.5,1.5.2.5)
Cluster#
OldCentroids
ClusterElements
NewCentroids
1
(4,0.33,3)
P3,P4,P5
(4,0.33,3)
2
(0.5,1.5.2.5)
P1,P2
(0.5,1.5.2.5)
Iteration2:
Iteration3:
7.2)[1marks]ExplainhowtheKmeansterminationconditionhasbeenreached. Afteriteration3thenewcentroidsandclusterelementsarethesameastheoldcentroidsand clusterelements. 7.3)[2marks]Foryourfinalclustering,calculatethecorrespondingSSE(SumofSquaredError) value.
wheremiisthecentroidoftheclusterCi
SSEforcluster1=SSEP3+SSEP4+SSEP5 =11.0889+0.4489+11.0889 =22.6267 SSEforcluster2=SSEP1+SSEP2 =2.25+2.25 =4.5 TotalSSE=22.6134+4.5 =27.1134
Question8 Considerthefollowingsetof6onedimensionaldatapoints: 18,22,25,42,27,43 [5marks]Applytheagglomerativehierarchicalclusteringalgorithmtobuildthehierarchical clusteringdendogram.MergetheclustersusingMindistanceandupdatetheproximitymatrix accordingly.Clearlyshowtheproximitymatrixcorrespondingtoeachiterationofthealgorithm. Step1:
18
22
25
27
42
43
18
0
4
7
9
24
25
22
4
0
3
5
20
21
25
7
3
0
2
17
18
27
9
5
2
0
15
16
42
24
20
17
15
0
1
43
25
21
18
16
1
0
Step2:
18
22 25 27 42,43
18
0
4
7
9
24
22
4
0
3
5
20
25
7
3
0
2
17
27
9
5
2
0
15
42,43 24
20 17 15
0
Step3:
18
22 25,27 42,43
18
0
4
7
24
22
4
0
3
20
25,27 7
3
0
15
42,43 24
20
15
0
Step4:
18 22,25, 42,43 27
18
0
4
24
22, 4 25,27
0
15
42,43 24
15
0
Step5:
18,22, 42,43 25,27
18,22, 25,27
0
15
42,43
15
0
Step6:
18,22,25,27,42, 43
18,22,25,27, 42,43
0
Dendogram:
Question9 [4marks]HandlingunstructureddataisoneofthemainchallengesinTextMining.Describein detailsthenecessarystepsthatareneededtoprovideastructuredrepresentationoftext documents. ToprovideastructuredrepresentationofatextdocumentVectorSpaceModelisusedto calculatetheTermFrequencyInverseDocumentFrequency. ● Step1:Extractthetext.(Moveallofthetextintoadatabase.) ● Step2:Removestopwords.(Removewordslike‘the’and‘a’.) ● Step3:Convertallwordstolowercase.(Toensurewordcomparisonwillwork.) ● Step4:Usestemmingonallwords.(Wordslike‘stemming’become‘stem’.) ● Step5:Countthetermfrequencies.(Getcountforeachwordineachdocument.) ● Step6:Createanindexingfile.(ForeachwordtocreatetheVSM.) ● Step7:Createthevectorspacemodel.(Usestheindexfiletorepresenttheterm frequencyineachdocument.) ● Step8:Computeinversedocumentfrequency.(IDF(word)=log(totaldocuments/ documentfrequency).) ● Step9:Computetermweights.(Weight(word)=TF(word)xIDF(word).WhereTF(word) =numberoftimesthewordappearsinthedocument.) ● Step10:Normalizealldocumentstounitlength.(weight(word)=weight(word)/ sqrt(eachweightofeachwordsquared).)
...