Assignment 1 W22 R Tutorial PDF

Title Assignment 1 W22 R Tutorial
Author Anhad Chadha
Course Statistical Learning
Institution The University of British Columbia
Pages 10
File Size 604.1 KB
File Type PDF
Total Downloads 71
Total Views 179

Summary

R Studio tutorial for lecture 1. This tutorial was for Assignment 1....


Description

STAT231OnlineAssignment1RTutorial InAssignment1youwillbeusingRtoanalyzethevariatesTravel.method.to.schooland Wrist.circumferenceinyourdataset.ThistutorialwillcoversomeoftheRcodeneededtoconductthese analyses.FormoreinformationyoushouldconsulttheIntroductiontoRandRStudiofoundonLEARN,as wellastheexamplesofRcodeintheendofchapterproblemsintheCourseNotes.  Thistutorialwillanalyzeadifferentdatasetcalleddataset_goodbuywhichyoucandownloadfromthe Assignment1folderonLEARN.Eachrowofthedatasetrepresentsapurchasemadefromacompany called“GoodBuy”inJanuary2020.Foreachpurchaseanumberofvariateshavebeenrecorded.  Youshouldthinkabouthowthecodeusedinthistutorialmaybeadaptedtoanalyzethevariates Travel.method.to.schoolandWrist.circumferenceinyourdataset.  ThecodeusedinthistutorialisprovidedonLEARNintheAssignment1folder.Youmayadaptthis code foryourassignment. 

ImportingtheData 

YoushouldfollowtheinstructionstoimportthedataasgivenintheIntroductiontoRandRStudio.Forthis tutorialthedatasetisdataset_goodbuy.Youwillneedtoadaptyourcodeaccordinglytousethenameof yourdataset.  IfweusetheRcommanddim()onthedatasetweobtain: > dim(dataset_goodbuy) [1] 500 12

whichindicatesthatthedataset(calledadataframeinR)has500rowsand12columns. IfweusetheRcommandhead()weobtain: > head(dataset_goodbuy) 1 2 3 4 5 6 1 2 3 4 5 6

Index Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating Cost_of_the_Product 2353 D Ship NA 5 234 5702 E Ship 4 4 147 1030 B Ship 3 1 229 3164 E Flight 5 2 160 8089 D Flight 3 2 262 2140 B Ship 6 3 272 Prior_purchases Product_importance Gender Discount_offered Weight_in_gms Reached_on_Time 6 medium F 60 2443 1 4 low F 8 5882 0 2 medium F 12 NA 1 3 high M 1 5834 0 2 medium M 4 5738 0 3 low F 39 2518 1

whichdisplaysthefirst6rowsofthedatasetincludingthecolumnheadings. Usingthecommandhead()onthecolumnlabeledProduct_importancegivesusthefirst6rowsinthat column. > head(dataset_goodbuy$Product_importance) [1] medium low medium high medium low Levels: high low medium

Syntaxnote:A$signisusedtotellRtoapplyafunctiontoaspecificcolumninthedataset.  

1



RelativeFrequencyTables  Forconveniencewecreateanewdatasetordataframecalleddatasetwhichisequaltothedataset dataset_goodbuy. > dataset table(dataset$Product_importance) Product_importance high low medium 9 43 259 189

 Thistellsusthatthereare9observationsforwhichthevalueofProduct_importanceisblankormissing. Thisalsotellsusthatthereare43high,259low,and189mediumobservations. YoumightnoticethatRhasarrangedthetableinalphabeticalorder,startingwiththeblankobservations, high,low,andthenmedium.Wemightwanttodisplaytheresultsintheorderlow,medium,high,blank. Wecandothisusingthefactor()commandand‘levels’option,asfollows: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189

high 43

9

Ifwewanttodisplaytheresultsignoringthemissingvaluesthenweuse: > names table(factor(dataset$Product_importance,levels=names)) low medium 259 189

high 43

Ifweuse: > sum(table(factor(dataset$Product_importance,levels=names))) [1] 491

wecanconfirmthatthereare500‐9=491observationsonthevariateProduct_importance. Toobtainatableofrelativefrequencieswhichignorethemissingvaluesweuse: > > > >

names n ames=200]) 5

hig gh 222

low medium 109 149

pposewew antedtoco omparetherelativefre quenciesfo orproductimportanceforthecostcategoriess Sup less than$200andgreate erthanorequalto$20 q 0. > n ames mean(dataset$Cost_of_Product) [1] NA

Ohno!WhydidwegettheanswerNA(meaningnotavailable)?Thisoccuredbecausetherearemissing valuesforthevariateCost_of_product.Insteadwemustuse: > mean(dataset$Cost_of_Product,na.rm=T) [1] 208.7102



Theoptionna.rm=Torna.rm=TRUEmeansthatthemissingvaluesareremovedbeforecalculatingthe samplemean.Aneasywaytocheckformissingvaluesistousethesummary()command. > summary(dataset$Cost_of_Product) Min. 1st Qu. Median Mean 3rd Qu. 97.0 166.0 213.0 208.7 251.8

Max. 307.0

NA's 10

Thisoutputindicatesthatthereare10missing(NA)valuesforthethevariateCost_of_Product. Notethatthecommandsummary()alsoprovidesthefivenumbersummaryandthesamplemean. Supposeweassignthistableaname: summarycost summarycost[4] Mean 208.7102 > summarycost[3] Median 213

Notethatsummary()hasalsodeterminedthenumericalsummariesbyignoringthemissingvalues. Toobtainthesampledeviationweuse > sd(dataset$Cost_of_Product,na.rm=T) [1] 48.72511

Wecanalsostorethesesummariesforlateruse,forexample: > cost.mean sd.mean round(cost.mean,3) [1] 208.71 > round(sd.mean,2) [1] 48.73

 Syntaxnote:Theround()commanddisplaystheresulttothenumberofdigitsindicated.  Sometimeswemayneedtoconstructourownfunctions.Forexample,forskewnessandkurtosiswecan define newfunctionstocalculatethese:  >skewnesskurtosis s kewness(n na.omit(da ataset$Cosst_of_Prod duct)) [1] -0.19648 894 > kurtosis(n k na.omit(da ataset$Cosst_of_Prod duct)) [1] 1.970631 1

Not ethatfortheseselfdeefinedfuncttionsthatw wedealwith hmissingva aluesbyusingthefuncc tionna.omit(). 

RellativeFreq quencyHiistogram  Top plotafrequuencyhistog g ramwecanusethefunctionhist u t().  > h ist(datas set$Cost_o of_Productt)

Ab etterfunctiontousefoorplottingrrelativefreq quencyhisttogramsist hefunctionntruehist().Thisfunctio on hasmoreoptio onscomparedtothefuunctionhistt()whichallowsusmorrecontrolo overhowthehistogram mis creaated.Thisfu unctionisn notincluded dbydefault,butisinthhe‘MASS’ppackage.Weethereforeneedtote llR wewanttouseethispackagesowecaanusethettruehist()fu unction.Weedothisusi ng: > library(MA ASS)

We cannowuss etruehist(),anditsvaariousoptioonstomakeearelativeffrequencyh histogram. > truehist(d t dataset$Co ost_of_Pro oduct,brea aks=seq(955,315,20),ylim=c(0,0.008),xl lab="Cost of pro oduct",yla ab="Relative Frequeency",las= =1,col="daarkblue",d density=30 0,angle=45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p

Not ethatyoushouldseleectvaluesfo orthevario usoptionsttocreateawell‐displayyedhistogra am.Thiswi ll som metimesreq quiresome trialanderror.Ahistogramshouldnothave toofewortoomanyb bins.Ageneeral ruleeisthattheereshouldb betentofiftteenbinsdee pendingonthedatasset.Ahistogramshoul g dalsobe labeeled.

7

 Trychangingth hevariousooptionstoseehowtheehistogramlooks. 

Ad dingaGaussianProobabilityD DensityFu unction  We nowwanttosuperim poseaGaussianproba bilitydensityfunction withmeanequaltothesample plestandard ddeviation forourdataaset.Wedothiswith the meaanandstanndarddeviattionequalttothesamp follo owingcomm mands:  > cost.mean< c cost.sd truehist(d t dataset$Co ost_of_Prooduct,brea aks=seq(95 5,315,20) ,xlim=c(80 0,340),yli im=c(0,0. 008 ),x xlab="Cost t of product",ylab=="Relative e Frequenccy",las=1,col="dark kblue",den nsity=30, ang le= =45) > title(main t n = "Relat tive frequuency hist togram of cost of product") p > curve(dnor c rm(x,cost .mean,costt.sd),col= ="red",addd=TRUE,lwd d=2)



mmandscarrefully.Firsttwestoretthesamplemeanand sample s Syn ntaxnote:Lookthroug hthesecom stan ndarddeviaationsotha twecanussethemlat er.Wethen e nplotthehistogram.Fi nally,wead ddthe Gau ussianpdfuusingthecu rve()anddnorm()com mmands. 

Syn ntaxnote:sometimesy y oumightfiindyourprobabilitydeensityfunct ioncurveiss‘cutoff’a tthetopor t in thetails.Ifthisshappens,yyouneedto oexpandtheyorxaxeesofyourplot.Thiscan nbedoneu usingtheyli m and dxlimoptio nswhenyouusethetrruehist()command.  Lookingattheplotwiththhesuperim posedGausssianprobabilitydensitt yfunction,howwelld doyouthinkk the sedataaremodeledbyaGaussiaandistribution?

8

 Boxxplots  Finaally,let’smaakeaboxpl otoftheva ariateCost_ _of_productt.Weusetheboxplot( h )command d: > boxplot(da b ataset$Cost_of_Prodduct,ylab= ="Cost",cool="dodger rblue3",la as = 1) > title(main t n="Boxplot t of Cost of Produc ct")

  9

 Whatmightbeemoreinte restingisaside‐by‐side e boxplotcomparingth hecostfor productsofflowandhigh g useof[squaarebracketss]totellRw wewanttofocuson productimportance.Todothisweagainmakeu specificsubsettsofthedatta: > clr=c("dod c dgerblue3","seagreeen4") > n amespi=c( ("low","high") > boxplot(da b ataset$Cost_of_Prodduct[datas set$Producct_importa ance=="low w"],datase et$Cost_o f_P rod uct[datas set$Produc ct_importaance=="hig gh"],col=cclr,ylab=" "Cost",nam mes=namesp pi) > title(main t n="Boxplot ts of costt for low and high product importance i e")



  earnfromthisboxplott? Whatcanwele   ushouldnow w beableto ocomplete therealdaataanalysessfromAssiggnment1.I fyouhaveanyquestioons You abo outthisRtuutorial,plea seaskonPiazzausingthe‘RCode’tag!

1...


Similar Free PDFs