Title | Practical - SS2015 - Data Preprocessing with RapidMiner |
---|---|
Course | Business Intelligence und Management-Unterstützungssysteme |
Institution | Humboldt-Universität zu Berlin |
Pages | 17 |
File Size | 775 KB |
File Type | |
Total Downloads | 24 |
Total Views | 139 |
SS2015 - Data Preprocessing with RapidMiner...
ManagementSupportSystems&BusinessIntelligence
SummerSemester2015
Exercises: DataPreprocessingwith RapidMiner BenFabian&StefanLessmann
Agenda DataPreprocessing with RapidMiner IntroductiontotheLoanDataSet DataIntegrationandMissingValues DataTransformation
Continuousvariables Categoricalvariables
DataReductionUsingSampling ManagingComplexProcesses
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
2
DataPreprocessing with RapidMiner
LearningObjectives Aftercompletingthischapter,studentswillbeableto: Handlemissingvaluesthroughmean/mode replacement Createnew,derivedattributesusingtheGenerate AttributeOperator Performdatatransformationoperationsintheformof scaling,discretization,anddummycoding Drawrandomandstratifiedsamplesfrom adataset Structurecomplexdataminingprocesses usingnestedchains 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
3
IntroductiontotheLoanDataSet Datareferringtoacreditproduct
Binarypredictiontask Willcreditapplicantdefault(1)ornot(2) 1225cases(applications)and14independentvariables Source:Thomas,L.C.,Edelman,D.B.,&Crook,J.N.(2002).Credit ScoringanditsApplications.Philadelphia:Siam.
Files
Dataset:public_us.csv Datadictionary:publicdict.xls BothavailableinMoodle
Dataaccompaniestheabovebookandisnotmeanttobe
sharedand/ordistributedinanyway 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
4
IntroductiontotheLoanDataSet
Exercise1
DownloadtheloandatafromMoodletoyourPC ImporttheCSVfile
AssigncolumnBADtheroleofthedependentvariable.Makesureitsdatatype issettobinominal Mostothervariableareoftypereal;variablesAESandRESareoftype polynomial(RapidMiner shoulddetectthatautomatically);variablePHON shouldbebinominal StorethedatainyourlocalrepositoryunderthenameLOAN
Readthedatadictionaryand performsomeexplanatory analysistofamiliarizeyourself withthedata
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
5
DataIntegrationandMissingValues
Exercise2
Correcttheentrieswithvalue99intheYOBvariable.Assigntheseentriesthe rolemissingvalue
Tothatend,youcanmakeuseoftheDeclareMissingValueOperator CheckthemaximalvalueoftheYOBvariableafterthetransformationtoverifythat thevalue99nolongerexistsinthedata.Youcanalsore‐examinethehistogramplot
CreateanewvariableAGE
UsetheoperatorGenerateAttributes,which allowsyoutocreatenewvariablesthrough mathematicalexpressions ConfiguretheGenerateAttributesoperator suchthatitcalculatestheageofanapplicant fromhis/heryearofbirth Assumethedatawascollectedin2002. Hence,avalueYOB=80impliesthatthe applicationis22yearsold
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
6
DataIntegrationandMissingValues
MissingValues RapidMiner featuresmanyapproaches
tohandlemissingvalues Operatorsareavailableintheoperator
windowunderthetreeData Transformation DataCleansing Replaceoperatorsimplementmeanor
modereplacementofmissingvalues Imputeoperatorsallowyoutodevelop
amodeltopredictmissingvalues
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
7
DataIntegrationandMissingValues
Exercise3 Replaceallmissingvaluesinthedatasetwiththeir
average
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
8
DataTransformation Wecovereddifferentmethodstotransformthevalues
ofcontinuousandcategoricalattributesinthelecture Continuousattribute Scalingintheformofmin/maxscalingandthez‐transformation Discretization
Categoricalattributes
Consecutivenumbering(badidea) Dummycoding
SubsequentexercisesexplorehowRapidMiner
supportsthesefunctions 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
9
DataTransformation:ContinuousVariables
Exercise4a) ManyvariablenamescontaineitherthetextINC(income)
orOUT(outgoings,i.e.,expenditures).Thevaluesinthese variablesrepresentsomeamountofmoney.Scalethe valuesin(only!)thesevariablessuchthateveryvariable hasaminimumof‐1andamaximumof+1. UsetheoperatorNormalizeforthispurpose.
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
10
DataTransformation:ContinuousVariables
Exercise4b)
ExaminethedistributionofthevariablesDHVAL(valueofthehouse) andDMORT(mortgagevalue).Theirdistributionisveryodd.The explanationisavailableinthedatadictionary.Avalueofzerohasa specialmeaning.Asyoucansee,suchcodesdistortthedistributionof thevariable.Tocorrectthis:
ForDHVAL,createanewdummyvariableDHVAL_d.Thisvariableshouldbe zeroifDHVALiszeroandoneotherwise. RepeatthepreviousstepforDMORT
Aftercreatingthedummies,
discretizethevariables DHVALandDMORTusing equalfrequencybinningwith 3bins. 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
11
DataTransformation:CategoricalVariables
Exercise5 Yourdatasetcontainstwo(poly)nominalvariables,AES (applicant’semploymentstatus)andRES(applicant’sresidential status) RapidMiner supportsdummycodingtotransformthese variables TrytofindthecorrespondingoperatorandconvertAESandRES intodummyvariables
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
12
DataReductionUsingSampling
Exercise6 Familiarizeyourselfwiththeoptionstosampledata
usingRapidMiner Notethatthedatasetcontainsfarmoregoodcredit risks(BAD=0)thanbadcreditrisks(BAD=1).Such imbalanceddistributionsmaycauseproblemsattimes Tobalancetheclass distribution,drawarandom samplesuchthatyousample allcasesofclass1(minority) butonly50%ofthecasesof class0(majority) 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
13
ManagingComplexProcesses Bynowyourprocessmustlookquitecomplex...
Toincreasecomprehensibility,RapidMiner allowsyou
tostructureyourprocessintomultiplesub‐steps ThekeyoperatortodosoistheSubprocess operator, whichyoufindunderUtilityintheoperatortree 17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
14
ManagingComplexProcesses
NestingProcessUsingSubprocess Subprocess isoneofseveraloperatorsthatfacilitates
processnesting Double‐clicktheoperatortoobtainan Thisiconindicates emptymodelingwindowentitled thattheoperatorisa NestedChain nestedoperatorthat encompassesspecific Inthisnewwindow,youmodeldata modeling(sub‐)steps analyticoperationsinthesamewayasintheMain Processwindow
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
15
ManagingComplexProcesses
Exercise7 RefactoryourprocessusingtheSubprocess operator Yourrefactoredprocessshouldhavetwosubprocesses,
oneforhandlingmissingvaluesandoneforthe transformationofcontinuousandcategoricalattributes. NotethatRapidMiner facilitatescopy&pasteofoperators Operatorsthatarenotrelated
tomissingvalues/data transformationcanremainin themainprocess Makesuretherefactored processproducesthesame resultasbefore
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
16
Thankyou!AnyQuestions? TheLoanDataSet DataIntegrationandMissingValues
DataTransformation
Min/maxscalingforcontinuousattributes Equalfrequencyandequalwidthbinningforcontinuousattributes Dummycodingforcategoricalattributes
DataReductionUsingSampling
Declaremissingvalues Calculatenew,derivedattributes Createdummyvariablestoflaganomalies Mean/modereplacement
Randomsampling Stratifiedsamplestobalanceskewedclassdistributions
StructuringprocessesusingSubprocess andnestedchains
17‐Jun‐15
ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann
17...