Title | Assignment 1 W22 R Tutorial |
---|---|
Author | Anhad Chadha |
Course | Statistical Learning |
Institution | The University of British Columbia |
Pages | 10 |
File Size | 604.1 KB |
File Type | |
Total Downloads | 71 |
Total Views | 179 |
R Studio tutorial for lecture 1. This tutorial was for Assignment 1....
STAT231OnlineAssignment1RTutorial InAssignment1youwillbeusingRtoanalyzethevariatesTravel.method.to.schooland Wrist.circumferenceinyourdataset.ThistutorialwillcoversomeoftheRcodeneededtoconductthese analyses.FormoreinformationyoushouldconsulttheIntroductiontoRandRStudiofoundonLEARN,as wellastheexamplesofRcodeintheendofchapterproblemsintheCourseNotes. Thistutorialwillanalyzeadifferentdatasetcalleddataset_goodbuywhichyoucandownloadfromthe Assignment1folderonLEARN.Eachrowofthedatasetrepresentsapurchasemadefromacompany called“GoodBuy”inJanuary2020.Foreachpurchaseanumberofvariateshavebeenrecorded. Youshouldthinkabouthowthecodeusedinthistutorialmaybeadaptedtoanalyzethevariates Travel.method.to.schoolandWrist.circumferenceinyourdataset. ThecodeusedinthistutorialisprovidedonLEARNintheAssignment1folder.Youmayadaptthis code foryourassignment.
ImportingtheData
YoushouldfollowtheinstructionstoimportthedataasgivenintheIntroductiontoRandRStudio.Forthis tutorialthedatasetisdataset_goodbuy.Youwillneedtoadaptyourcodeaccordinglytousethenameof yourdataset. IfweusetheRcommanddim()onthedatasetweobtain: > dim(dataset_goodbuy) [1] 500 12
whichindicatesthatthedataset(calledadataframeinR)has500rowsand12columns. IfweusetheRcommandhead()weobtain: > head(dataset_goodbuy) 1 2 3 4 5 6 1 2 3 4 5 6
Index Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating Cost_of_the_Product 2353 D Ship NA 5 234 5702 E Ship 4 4 147 1030 B Ship 3 1 229 3164 E Flight 5 2 160 8089 D Flight 3 2 262 2140 B Ship 6 3 272 Prior_purchases Product_importance Gender Discount_offered Weight_in_gms Reached_on_Time 6 medium F 60 2443 1 4 low F 8 5882 0 2 medium F 12 NA 1 3 high M 1 5834 0 2 medium M 4 5738 0 3 low F 39 2518 1
whichdisplaysthefirst6rowsofthedatasetincludingthecolumnheadings. Usingthecommandhead()onthecolumnlabeledProduct_importancegivesusthefirst6rowsinthat column. > head(dataset_goodbuy$Product_importance) [1] medium low medium high medium low Levels: high low medium
Syntaxnote:A$signisusedtotellRtoapplyafunctiontoaspecificcolumninthedataset.
1
RelativeFrequencyTables Forconveniencewecreateanewdatasetordataframecalleddatasetwhichisequaltothedataset dataset_goodbuy. > dataset table(dataset$Product_importance) Product_importance high low medium 9 43 259 189
Thistellsusthatthereare9observationsforwhichthevalueofProduct_importanceisblankormissing. Thisalsotellsusthatthereare43high,259low,and189mediumobservations. YoumightnoticethatRhasarrangedthetableinalphabeticalorder,startingwiththeblankobservations, high,low,andthenmedium.Wemightwanttodisplaytheresultsintheorderlow,medium,high,blank. Wecandothisusingthefactor()commandand‘levels’option,asfollows: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189
high 43
9
Ifwewanttodisplaytheresultsignoringthemissingvaluesthenweuse: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189
high 43
Ifweuse: > sum(table(factor(dataset$Product_importance,levels=names))) [1] 491
wecanconfirmthatthereare500‐9=491observationsonthevariateProduct_importance. Toobtainatableofrelativefrequencieswhichignorethemissingvaluesweuse: > > > >
names n ames=200]) 5
hig gh 222
low medium 109 149
pposewew antedtoco omparetherelativefre quenciesfo orproductimportanceforthecostcategoriess Sup less than$200andgreate erthanorequalto$20 q 0. > n ames mean(dataset$Cost_of_Product) [1] NA
Ohno!WhydidwegettheanswerNA(meaningnotavailable)?Thisoccuredbecausetherearemissing valuesforthevariateCost_of_product.Insteadwemustuse: > mean(dataset$Cost_of_Product,na.rm=T) [1] 208.7102
Theoptionna.rm=Torna.rm=TRUEmeansthatthemissingvaluesareremovedbeforecalculatingthe samplemean.Aneasywaytocheckformissingvaluesistousethesummary()command. > summary(dataset$Cost_of_Product) Min. 1st Qu. Median Mean 3rd Qu. 97.0 166.0 213.0 208.7 251.8
Max. 307.0
NA's 10
Thisoutputindicatesthatthereare10missing(NA)valuesforthethevariateCost_of_Product. Notethatthecommandsummary()alsoprovidesthefivenumbersummaryandthesamplemean. Supposeweassignthistableaname: summarycost summarycost[4] Mean 208.7102 > summarycost[3] Median 213
Notethatsummary()hasalsodeterminedthenumericalsummariesbyignoringthemissingvalues. Toobtainthesampledeviationweuse > sd(dataset$Cost_of_Product,na.rm=T) [1] 48.72511
Wecanalsostorethesesummariesforlateruse,forexample: > cost.mean sd.mean round(cost.mean,3) [1] 208.71 > round(sd.mean,2) [1] 48.73
Syntaxnote:Theround()commanddisplaystheresulttothenumberofdigitsindicated. Sometimeswemayneedtoconstructourownfunctions.Forexample,forskewnessandkurtosiswecan define newfunctionstocalculatethese: >skewnesskurtosis s kewness(n na.omit(da ataset$Cosst_of_Prod duct)) [1] -0.19648 894 > kurtosis(n k na.omit(da ataset$Cosst_of_Prod duct)) [1] 1.970631 1
Not ethatfortheseselfdeefinedfuncttionsthatw wedealwith hmissingva aluesbyusingthefuncc tionna.omit().
RellativeFreq quencyHiistogram Top plotafrequuencyhistog g ramwecanusethefunctionhist u t(). > h ist(datas set$Cost_o of_Productt)
Ab etterfunctiontousefoorplottingrrelativefreq quencyhisttogramsist hefunctionntruehist().Thisfunctio on hasmoreoptio onscomparedtothefuunctionhistt()whichallowsusmorrecontrolo overhowthehistogram mis creaated.Thisfu unctionisn notincluded dbydefault,butisinthhe‘MASS’ppackage.Weethereforeneedtote llR wewanttouseethispackagesowecaanusethettruehist()fu unction.Weedothisusi ng: > library(MA ASS)
We cannowuss etruehist(),anditsvaariousoptioonstomakeearelativeffrequencyh histogram. > truehist(d t dataset$Co ost_of_Pro oduct,brea aks=seq(955,315,20),ylim=c(0,0.008),xl lab="Cost of pro oduct",yla ab="Relative Frequeency",las= =1,col="daarkblue",d density=30 0,angle=45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p
Not ethatyoushouldseleectvaluesfo orthevario usoptionsttocreateawell‐displayyedhistogra am.Thiswi ll som metimesreq quiresome trialanderror.Ahistogramshouldnothave toofewortoomanyb bins.Ageneeral ruleeisthattheereshouldb betentofiftteenbinsdee pendingonthedatasset.Ahistogramshoul g dalsobe labeeled.
7
Trychangingth hevariousooptionstoseehowtheehistogramlooks.
Ad dingaGaussianProobabilityD DensityFu unction We nowwanttosuperim poseaGaussianproba bilitydensityfunction withmeanequaltothesample plestandard ddeviation forourdataaset.Wedothiswith the meaanandstanndarddeviattionequalttothesamp follo owingcomm mands: > cost.mean< c cost.sd truehist(d t dataset$Co ost_of_Prooduct,brea aks=seq(95 5,315,20) ,xlim=c(80 0,340),yli im=c(0,0. 008 ),x xlab="Cost t of product",ylab=="Relative e Frequenccy",las=1,col="dark kblue",den nsity=30, ang le= =45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p > curve(dnor c rm(x,cost .mean,costt.sd),col= ="red",addd=TRUE,lwd d=2)
mmandscarrefully.Firsttwestoretthesamplemeanand sample s Syn ntaxnote:Lookthroug hthesecom stan ndarddeviaationsotha twecanussethemlat er.Wethen e nplotthehistogram.Fi nally,wead ddthe Gau ussianpdfuusingthecu rve()anddnorm()com mmands.
Syn ntaxnote:sometimesy y oumightfiindyourprobabilitydeensityfunct ioncurveiss‘cutoff’a tthetopor t in thetails.Ifthisshappens,yyouneedto oexpandtheyorxaxeesofyourplot.Thiscan nbedoneu usingtheyli m and dxlimoptio nswhenyouusethetrruehist()command. Lookingattheplotwiththhesuperim posedGausssianprobabilitydensitt yfunction,howwelld doyouthinkk the sedataaremodeledbyaGaussiaandistribution?
8
Boxxplots Finaally,let’smaakeaboxpl otoftheva ariateCost_ _of_productt.Weusetheboxplot( h )command d: > boxplot(da b ataset$Cost_of_Prodduct,ylab= ="Cost",cool="dodger rblue3",la as = 1) > title(main t n="Boxplot t of Cost of Produc ct")
9
Whatmightbeemoreinte restingisaside‐by‐side e boxplotcomparingth hecostfor productsofflowandhigh g useof[squaarebracketss]totellRw wewanttofocuson productimportance.Todothisweagainmakeu specificsubsettsofthedatta: > clr=c("dod c dgerblue3","seagreeen4") > n amespi=c( ("low","high") > boxplot(da b ataset$Cost_of_Prodduct[datas set$Producct_importa ance=="low w"],datase et$Cost_o f_P rod uct[datas set$Produc ct_importaance=="hig gh"],col=cclr,ylab=" "Cost",nam mes=namesp pi) > title(main t n="Boxplot ts of costt for low and high product importance i e")
earnfromthisboxplott? Whatcanwele ushouldnow w beableto ocomplete therealdaataanalysessfromAssiggnment1.I fyouhaveanyquestioons You abo outthisRtuutorial,plea seaskonPiazzausingthe‘RCode’tag!
1...