Practical - SS2015 - Data Preprocessing with RapidMiner PDF

Title Practical - SS2015 - Data Preprocessing with RapidMiner
Course Business Intelligence und Management-Unterstützungssysteme
Institution Humboldt-Universität zu Berlin
Pages 17
File Size 775 KB
File Type PDF
Total Downloads 24
Total Views 139

Summary

SS2015 - Data Preprocessing with RapidMiner...


Description

ManagementSupportSystems&BusinessIntelligence

SummerSemester2015

Exercises: DataPreprocessingwith RapidMiner BenFabian&StefanLessmann

Agenda DataPreprocessing with RapidMiner  IntroductiontotheLoanDataSet  DataIntegrationandMissingValues  DataTransformation

Continuousvariables  Categoricalvariables 

 DataReductionUsingSampling  ManagingComplexProcesses

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

2

DataPreprocessing with RapidMiner

LearningObjectives Aftercompletingthischapter,studentswillbeableto:  Handlemissingvaluesthroughmean/mode replacement  Createnew,derivedattributesusingtheGenerate AttributeOperator  Performdatatransformationoperationsintheformof scaling,discretization,anddummycoding  Drawrandomandstratifiedsamplesfrom adataset  Structurecomplexdataminingprocesses usingnestedchains 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

3

IntroductiontotheLoanDataSet  Datareferringtoacreditproduct    

Binarypredictiontask Willcreditapplicantdefault(1)ornot(2) 1225cases(applications)and14independentvariables Source:Thomas,L.C.,Edelman,D.B.,&Crook,J.N.(2002).Credit ScoringanditsApplications.Philadelphia:Siam.

 Files   

Dataset:public_us.csv Datadictionary:publicdict.xls BothavailableinMoodle

 Dataaccompaniestheabovebookandisnotmeanttobe

sharedand/ordistributedinanyway 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

4

IntroductiontotheLoanDataSet

Exercise1  

DownloadtheloandatafromMoodletoyourPC ImporttheCSVfile  





AssigncolumnBADtheroleofthedependentvariable.Makesureitsdatatype issettobinominal Mostothervariableareoftypereal;variablesAESandRESareoftype polynomial(RapidMiner shoulddetectthatautomatically);variablePHON shouldbebinominal StorethedatainyourlocalrepositoryunderthenameLOAN

Readthedatadictionaryand performsomeexplanatory analysistofamiliarizeyourself withthedata

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

5

DataIntegrationandMissingValues

Exercise2 

Correcttheentrieswithvalue99intheYOBvariable.Assigntheseentriesthe rolemissingvalue  



Tothatend,youcanmakeuseoftheDeclareMissingValueOperator CheckthemaximalvalueoftheYOBvariableafterthetransformationtoverifythat thevalue99nolongerexistsinthedata.Youcanalsore‐examinethehistogramplot

CreateanewvariableAGE 





UsetheoperatorGenerateAttributes,which allowsyoutocreatenewvariablesthrough mathematicalexpressions ConfiguretheGenerateAttributesoperator suchthatitcalculatestheageofanapplicant fromhis/heryearofbirth Assumethedatawascollectedin2002. Hence,avalueYOB=80impliesthatthe applicationis22yearsold

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

6

DataIntegrationandMissingValues

MissingValues  RapidMiner featuresmanyapproaches

tohandlemissingvalues  Operatorsareavailableintheoperator

windowunderthetreeData Transformation  DataCleansing  Replaceoperatorsimplementmeanor

modereplacementofmissingvalues  Imputeoperatorsallowyoutodevelop

amodeltopredictmissingvalues

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

7

DataIntegrationandMissingValues

Exercise3  Replaceallmissingvaluesinthedatasetwiththeir

average

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

8

DataTransformation  Wecovereddifferentmethodstotransformthevalues

ofcontinuousandcategoricalattributesinthelecture  Continuousattribute Scalingintheformofmin/maxscalingandthez‐transformation  Discretization 

 Categoricalattributes

Consecutivenumbering(badidea)  Dummycoding



 SubsequentexercisesexplorehowRapidMiner

supportsthesefunctions 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

9

DataTransformation:ContinuousVariables

Exercise4a)  ManyvariablenamescontaineitherthetextINC(income)

orOUT(outgoings,i.e.,expenditures).Thevaluesinthese variablesrepresentsomeamountofmoney.Scalethe valuesin(only!)thesevariablessuchthateveryvariable hasaminimumof‐1andamaximumof+1.  UsetheoperatorNormalizeforthispurpose.

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

10

DataTransformation:ContinuousVariables

Exercise4b) 

ExaminethedistributionofthevariablesDHVAL(valueofthehouse) andDMORT(mortgagevalue).Theirdistributionisveryodd.The explanationisavailableinthedatadictionary.Avalueofzerohasa specialmeaning.Asyoucansee,suchcodesdistortthedistributionof thevariable.Tocorrectthis:  

ForDHVAL,createanewdummyvariableDHVAL_d.Thisvariableshouldbe zeroifDHVALiszeroandoneotherwise. RepeatthepreviousstepforDMORT

 Aftercreatingthedummies,

discretizethevariables DHVALandDMORTusing equalfrequencybinningwith 3bins. 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

11

DataTransformation:CategoricalVariables

Exercise5 Yourdatasetcontainstwo(poly)nominalvariables,AES (applicant’semploymentstatus)andRES(applicant’sresidential status)  RapidMiner supportsdummycodingtotransformthese variables  TrytofindthecorrespondingoperatorandconvertAESandRES intodummyvariables 

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

12

DataReductionUsingSampling

Exercise6  Familiarizeyourselfwiththeoptionstosampledata

usingRapidMiner  Notethatthedatasetcontainsfarmoregoodcredit risks(BAD=0)thanbadcreditrisks(BAD=1).Such imbalanceddistributionsmaycauseproblemsattimes  Tobalancetheclass distribution,drawarandom samplesuchthatyousample allcasesofclass1(minority) butonly50%ofthecasesof class0(majority) 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

13

ManagingComplexProcesses  Bynowyourprocessmustlookquitecomplex...

 Toincreasecomprehensibility,RapidMiner allowsyou

tostructureyourprocessintomultiplesub‐steps  ThekeyoperatortodosoistheSubprocess operator, whichyoufindunderUtilityintheoperatortree 17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

14

ManagingComplexProcesses

NestingProcessUsingSubprocess  Subprocess isoneofseveraloperatorsthatfacilitates

processnesting  Double‐clicktheoperatortoobtainan Thisiconindicates emptymodelingwindowentitled thattheoperatorisa NestedChain nestedoperatorthat encompassesspecific  Inthisnewwindow,youmodeldata modeling(sub‐)steps analyticoperationsinthesamewayasintheMain Processwindow

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

15

ManagingComplexProcesses

Exercise7  RefactoryourprocessusingtheSubprocess operator  Yourrefactoredprocessshouldhavetwosubprocesses,

oneforhandlingmissingvaluesandoneforthe transformationofcontinuousandcategoricalattributes.  NotethatRapidMiner facilitatescopy&pasteofoperators  Operatorsthatarenotrelated

tomissingvalues/data transformationcanremainin themainprocess  Makesuretherefactored processproducesthesame resultasbefore

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

16

Thankyou!AnyQuestions? TheLoanDataSet  DataIntegrationandMissingValues



   



DataTransformation   



Min/maxscalingforcontinuousattributes Equalfrequencyandequalwidthbinningforcontinuousattributes Dummycodingforcategoricalattributes

DataReductionUsingSampling  



Declaremissingvalues Calculatenew,derivedattributes Createdummyvariablestoflaganomalies Mean/modereplacement

Randomsampling Stratifiedsamplestobalanceskewedclassdistributions

StructuringprocessesusingSubprocess andnestedchains

17‐Jun‐15

ManagementSupportSystems&BusinessIntelligence,DataUnderstanding/Preprocessing,Fabian/Lessmann

17...


Similar Free PDFs