Assignment 1 W22 R Tutorial PDF

Title	Assignment 1 W22 R Tutorial
Author	Anhad Chadha
Course	Statistical Learning
Institution	The University of British Columbia
Pages	10
File Size	604.1 KB
File Type	PDF
Total Downloads	71
Total Views	179

Preview

CLICK TO PREVIEW PDF

Summary

R Studio tutorial for lecture 1. This tutorial was for Assignment 1....

Description

STAT231OnlineAssignment1RTutorial InAssignment1youwillbeusingRtoanalyzethevariatesTravel.method.to.schooland Wrist.circumferenceinyourdataset.ThistutorialwillcoversomeoftheRcodeneededtoconductthese analyses.FormoreinformationyoushouldconsulttheIntroductiontoRandRStudiofoundonLEARN,as wellastheexamplesofRcodeintheendofchapterproblemsintheCourseNotes.  Thistutorialwillanalyzeadifferentdatasetcalleddataset_goodbuywhichyoucandownloadfromthe Assignment1folderonLEARN.Eachrowofthedatasetrepresentsapurchasemadefromacompany called“GoodBuy”inJanuary2020.Foreachpurchaseanumberofvariateshavebeenrecorded.  Youshouldthinkabouthowthecodeusedinthistutorialmaybeadaptedtoanalyzethevariates Travel.method.to.schoolandWrist.circumferenceinyourdataset.  ThecodeusedinthistutorialisprovidedonLEARNintheAssignment1folder.Youmayadaptthis code foryourassignment. 

ImportingtheData 

YoushouldfollowtheinstructionstoimportthedataasgivenintheIntroductiontoRandRStudio.Forthis tutorialthedatasetisdataset_goodbuy.Youwillneedtoadaptyourcodeaccordinglytousethenameof yourdataset.  IfweusetheRcommanddim()onthedatasetweobtain: > dim(dataset_goodbuy) [1] 500 12

whichindicatesthatthedataset(calledadataframeinR)has500rowsand12columns. IfweusetheRcommandhead()weobtain: > head(dataset_goodbuy) 1 2 3 4 5 6 1 2 3 4 5 6

Index Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating Cost_of_the_Product 2353 D Ship NA 5 234 5702 E Ship 4 4 147 1030 B Ship 3 1 229 3164 E Flight 5 2 160 8089 D Flight 3 2 262 2140 B Ship 6 3 272 Prior_purchases Product_importance Gender Discount_offered Weight_in_gms Reached_on_Time 6 medium F 60 2443 1 4 low F 8 5882 0 2 medium F 12 NA 1 3 high M 1 5834 0 2 medium M 4 5738 0 3 low F 39 2518 1

whichdisplaysthefirst6rowsofthedatasetincludingthecolumnheadings. Usingthecommandhead()onthecolumnlabeledProduct_importancegivesusthefirst6rowsinthat column. > head(dataset_goodbuy$Product_importance) [1] medium low medium high medium low Levels: high low medium

Syntaxnote:A$signisusedtotellRtoapplyafunctiontoaspecificcolumninthedataset.  

1



RelativeFrequencyTables  Forconveniencewecreateanewdatasetordataframecalleddatasetwhichisequaltothedataset dataset_goodbuy. > dataset table(dataset$Product_importance) Product_importance high low medium 9 43 259 189

 Thistellsusthatthereare9observationsforwhichthevalueofProduct_importanceisblankormissing. Thisalsotellsusthatthereare43high,259low,and189mediumobservations. YoumightnoticethatRhasarrangedthetableinalphabeticalorder,startingwiththeblankobservations, high,low,andthenmedium.Wemightwanttodisplaytheresultsintheorderlow,medium,high,blank. Wecandothisusingthefactor()commandand‘levels’option,asfollows: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189

high 43

9

Ifwewanttodisplaytheresultsignoringthemissingvaluesthenweuse: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189

high 43

Ifweuse: > sum(table(factor(dataset$Product_importance,levels=names))) [1] 491

wecanconfirmthatthereare500‐9=491observationsonthevariateProduct_importance. Toobtainatableofrelativefrequencieswhichignorethemissingvaluesweuse: > > > >

names n ames=200]) 5

hig gh 222

low medium 109 149

pposewew antedtoco omparetherelativefre quenciesfo orproductimportanceforthecostcategoriess Sup less than$200andgreate erthanorequalto$20 q 0. > n ames mean(dataset$Cost_of_Product) [1] NA

Ohno!WhydidwegettheanswerNA(meaningnotavailable)?Thisoccuredbecausetherearemissing valuesforthevariateCost_of_product.Insteadwemustuse: > mean(dataset$Cost_of_Product,na.rm=T) [1] 208.7102



Theoptionna.rm=Torna.rm=TRUEmeansthatthemissingvaluesareremovedbeforecalculatingthe samplemean.Aneasywaytocheckformissingvaluesistousethesummary()command. > summary(dataset$Cost_of_Product) Min. 1st Qu. Median Mean 3rd Qu. 97.0 166.0 213.0 208.7 251.8

Max. 307.0

NA's 10

Thisoutputindicatesthatthereare10missing(NA)valuesforthethevariateCost_of_Product. Notethatthecommandsummary()alsoprovidesthefivenumbersummaryandthesamplemean. Supposeweassignthistableaname: summarycost summarycost[4] Mean 208.7102 > summarycost[3] Median 213

Notethatsummary()hasalsodeterminedthenumericalsummariesbyignoringthemissingvalues. Toobtainthesampledeviationweuse > sd(dataset$Cost_of_Product,na.rm=T) [1] 48.72511

Wecanalsostorethesesummariesforlateruse,forexample: > cost.mean sd.mean round(cost.mean,3) [1] 208.71 > round(sd.mean,2) [1] 48.73

 Syntaxnote:Theround()commanddisplaystheresulttothenumberofdigitsindicated.  Sometimeswemayneedtoconstructourownfunctions.Forexample,forskewnessandkurtosiswecan define newfunctionstocalculatethese:  >skewnesskurtosis s kewness(n na.omit(da ataset$Cosst_of_Prod duct)) [1] -0.19648 894 > kurtosis(n k na.omit(da ataset$Cosst_of_Prod duct)) [1] 1.970631 1

Not ethatfortheseselfdeefinedfuncttionsthatw wedealwith hmissingva aluesbyusingthefuncc tionna.omit(). 

RellativeFreq quencyHiistogram  Top plotafrequuencyhistog g ramwecanusethefunctionhist u t().  > h ist(datas set$Cost_o of_Productt)

Ab etterfunctiontousefoorplottingrrelativefreq quencyhisttogramsist hefunctionntruehist().Thisfunctio on hasmoreoptio onscomparedtothefuunctionhistt()whichallowsusmorrecontrolo overhowthehistogram mis creaated.Thisfu unctionisn notincluded dbydefault,butisinthhe‘MASS’ppackage.Weethereforeneedtote llR wewanttouseethispackagesowecaanusethettruehist()fu unction.Weedothisusi ng: > library(MA ASS)

We cannowuss etruehist(),anditsvaariousoptioonstomakeearelativeffrequencyh histogram. > truehist(d t dataset$Co ost_of_Pro oduct,brea aks=seq(955,315,20),ylim=c(0,0.008),xl lab="Cost of pro oduct",yla ab="Relative Frequeency",las= =1,col="daarkblue",d density=30 0,angle=45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p

Not ethatyoushouldseleectvaluesfo orthevario usoptionsttocreateawell‐displayyedhistogra am.Thiswi ll som metimesreq quiresome trialanderror.Ahistogramshouldnothave toofewortoomanyb bins.Ageneeral ruleeisthattheereshouldb betentofiftteenbinsdee pendingonthedatasset.Ahistogramshoul g dalsobe labeeled.

7

 Trychangingth hevariousooptionstoseehowtheehistogramlooks. 

Ad dingaGaussianProobabilityD DensityFu unction  We nowwanttosuperim poseaGaussianproba bilitydensityfunction withmeanequaltothesample plestandard ddeviation forourdataaset.Wedothiswith the meaanandstanndarddeviattionequalttothesamp follo owingcomm mands:  > cost.mean< c cost.sd truehist(d t dataset$Co ost_of_Prooduct,brea aks=seq(95 5,315,20) ,xlim=c(80 0,340),yli im=c(0,0. 008 ),x xlab="Cost t of product",ylab=="Relative e Frequenccy",las=1,col="dark kblue",den nsity=30, ang le= =45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p > curve(dnor c rm(x,cost .mean,costt.sd),col= ="red",addd=TRUE,lwd d=2)



mmandscarrefully.Firsttwestoretthesamplemeanand sample s Syn ntaxnote:Lookthroug hthesecom stan ndarddeviaationsotha twecanussethemlat er.Wethen e nplotthehistogram.Fi nally,wead ddthe Gau ussianpdfuusingthecu rve()anddnorm()com mmands. 

Syn ntaxnote:sometimesy y oumightfiindyourprobabilitydeensityfunct ioncurveiss‘cutoff’a tthetopor t in thetails.Ifthisshappens,yyouneedto oexpandtheyorxaxeesofyourplot.Thiscan nbedoneu usingtheyli m and dxlimoptio nswhenyouusethetrruehist()command.  Lookingattheplotwiththhesuperim posedGausssianprobabilitydensitt yfunction,howwelld doyouthinkk the sedataaremodeledbyaGaussiaandistribution?

8

 Boxxplots  Finaally,let’smaakeaboxpl otoftheva ariateCost_ _of_productt.Weusetheboxplot( h )command d: > boxplot(da b ataset$Cost_of_Prodduct,ylab= ="Cost",cool="dodger rblue3",la as = 1) > title(main t n="Boxplot t of Cost of Produc ct")

  9

 Whatmightbeemoreinte restingisaside‐by‐side e boxplotcomparingth hecostfor productsofflowandhigh g useof[squaarebracketss]totellRw wewanttofocuson productimportance.Todothisweagainmakeu specificsubsettsofthedatta: > clr=c("dod c dgerblue3","seagreeen4") > n amespi=c( ("low","high") > boxplot(da b ataset$Cost_of_Prodduct[datas set$Producct_importa ance=="low w"],datase et$Cost_o f_P rod uct[datas set$Produc ct_importaance=="hig gh"],col=cclr,ylab=" "Cost",nam mes=namesp pi) > title(main t n="Boxplot ts of costt for low and high product importance i e")



  earnfromthisboxplott? Whatcanwele   ushouldnow w beableto ocomplete therealdaataanalysessfromAssiggnment1.I fyouhaveanyquestioons You abo outthisRtuutorial,plea seaskonPiazzausingthe‘RCode’tag!

1...