Title | Stats 250 Completed Notes |
---|---|
Author | Han Na Shin |
Course | Introduction to Statistics and Data Analysis |
Institution | University of Michigan |
Pages | 212 |
File Size | 10.4 MB |
File Type | |
Total Downloads | 66 |
Total Views | 140 |
All solutions and notes for 250...
Statistics250
Interactive LectureNotes Solutions
Fall2015–Winter2016 Dr.BrendaGunderson DepartmentofStatistics UniversityofMichigan
Stat250GundersonLectureNotes Introduction
Statistics...the most important science in the whole world: for upon it depends the practicalapplicationofeveryotherscienceandofeveryart:theonescienceessentialto allpoliticalandsocialadministration,alleducation,allorganizationbasedonexperience, foritonlygivesresultsofourexperience."FlorenceNightingale,Statistician
Definitions: Statisticsarenumbersmeasuredforsomepurpose. Statisticsisacollectionofproceduresandprinciplesforgatheringdataand analyzinginformationinordertohelppeoplemakedecisionswhenfacedwith uncertainty.
CourseGoal:Learnvarioustoolsforusingdatatogainunderstandingandmakesounddecisions abouttheworldaroundus.
1
2
Stat250GundersonLectureNotes 1:SummarizingData
“Youmustnevertellathing.Youmustillustrateit.Welearnthrough theeyeandnotthenoggin." ‐‐WillRogers(1879‐1935)
“Simplesummariesofdatacantellaninterestingstoryandareeasiertodigestthanlonglists.” Sowewillbeginbylookingatsomedata.
RawData
Rawdatacorrespondtonumbersandcategorylabelsthathavebeencollectedormeasuredbut havenotyetbeen processed inanyway.Onthe nextpage isasetofRAWDATA‐information aboutagroupofitemsinthiscase,individuals.ThedatasettitleisDEPRIVEDandhasinformation about a sample size of n = 86 college students. For each student we are provided with their answertothequestion:“Doyoufeelthat youare sleepdeprived?”(yesor no),andtheirself reportedtypicalamountofsleeppernight(inhours).Theinformationwehaveisorganizedinto variables. In this case these 86 college students are a subset from a larger population of all collegestudents,sowehavesampledata.
Definition: Avariableisacharacteristicthatdiffersfromoneindividualtothenext. Sampledataarecollectedfromasubsetofalargerpopulation. Populationdataarecollectedwhenallindividualsinapopulationaremeasured. Astatisticisasummarymeasureofsampledata. Aparameterisasummarymeasureofpopulationdata.
TypesofVariables
Wehave2variablesinourdataset.Nextwewanttodistinguishbetweenthedifferenttypesof variables‐differenttypesofvariablesprovidedifferentkinds ofinformationandthetypewill guidewhatkindsofsummaries(graphs/numerical)areappropriate.
3
Thinkaboutit: Couldyoucomputethe“AVERAGEAMOUNTOFSLEEP”forthese86students?YES Couldyoucomputethe“AVERAGESLEEPDEPRIVEDSTATUS”forthese86students?NO (couldcode,butwouldbearbitrary:0and1,orcoulduseanytwovalueslike1and203)
SLEEPDEPRIVEDSTATUSissaidtobeaCATEGORICALvariable,
AMOUNTOFSLEEPisaQUANTITATIVE_variable.
Definitions: A categorical variable places an individual or item into one of several groups or categories. When thecategories have an ordering or ranking, it is called an ordinal variable.
Aquantitativevariabletakesnumericalvaluesforwhicharithmeticoperationssuchas adding and averaging make sense. Other names for quantitative variable are: measurementvariableandnumericalvariable.
TryIt!– Foreachvariablelistedbelow,giveitstypeascategoricalorquantitative.
Age(years)
QUANTITATIVE
TypicalClassroomSeatLocation(Front,Middle,Back) CATEGORICAL
NumberofsongsonaniPodQUANTITATIVE
Timespentstudyingmaterialforthisclassinthelast24‐hourperiod(inhours)QUANTITATIVE
SoftDrinkSize(small,medium,large,super‐sized)
CATEGORICAL(ordinal)
The“Andthen...”countrecordedinapsychologystudyonchildren(detailswillbeprovided) numberoftimesachildsaysthephrase“andthen”whenaskedtorecallastudythey justheard QUANTITATIVE
Lookingahead:Later,whenwetalkaboutrandomvariables,wewilldiscusswhether avariableismodeleddiscretely(becauseitsvaluesarecountable)orwhetheritwould bemodeledcontinuously(becauseitcan takeanyvalue inan intervalorcollection of intervals).Gobackthroughthelistaboveandthinkaboutisitdiscreteorcontinuous?
4
DATASET=DEPRIVED From Utts, Jessica M. and Robert F. Heckard. Mind on Statistics, Fourth Edition. 2012. Used with permission.
FeelSleep Deprived? No No No Yes Yes Yes Yes Yes No No No No Yes Yes Yes No No No Yes Yes No No No Yes No No No No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes Yes No Yes
AmountSleep perNight(hours) 9 7 8 7 7 8 7 8 10 8 9 8 8 4 6 8 10 4 7 8 9 9 7 8 9 9 8 6 9 7 11 7 9 7 8 7 7 9 1 7 6 8 6
FeelSleep Deprived? No No No Yes Yes Yes Yes Yes No Yes No No Yes Yes No No Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes
5
AmountSleep perNight(hours) 8 7 9 7 7 7 7 6 8 6 9 8 7 8 8 8 7 7 7 7 7 8 7 7 7 7 8 6 6 8 9 7 8 6 7 8 5 6 7 8 8 7 6
Ourdatasetissomewhatlarge,containingalotofmeasurements inalonglist. Presentedasa tablelisting,wecanviewtherecordofaparticularcollegestudent,butitisjustalisting,andnot easytofindthelargest valuefor the amountofsleep or thenumberofstudents whofeltthey aresleepdeprived.Wewouldliketolearnappropriatewaystosummarizethedata.
SummarizingCategoricalVariables
NumericalSummaries How would you go about summarizing the SLEEP DEPRIVED STATUS data? The first step is to simplycounthowmanyindividuals/items fallintoeach category.Sincepercentsare generally more meaningful than counts, the second step is to calculate the percent (or proportion) of individuals/itemsthatfallintoeachcategory.
Count = Frequency
Percent or Proport ion is ok!
SleepDeprived?
Count
Percent
Yes
51
(51/86)*100=59.3%
No
35
(35/86)*100=40.7%
Total
86
100%
The table above provides both the frequency distribution and the relative frequency distributionforthevariableSLEEPDEPRIVEDSTATUS.
VisualSummaries There are two simple visual summaries for categoricaldata–abargraphorapiechart.Hereis thetablesummaryandbargraphmadewithR. Ifyouweremakingone:Don’tforgettolabeleach axisandshowsomevaluesoneachaxis!
c oun t s : De p r i v e d No Yes 35 51 per c ent ages : De p r i v e d No Yes 40. 7 59. 3
Aside:Doesitmatterwhether the‘No’or‘Yes’barisgivenfirst?No,notordinalhere=> weshouldnotcommenton‘shape’(i.e.donotusewordslike“skewed”or“increasing pattern”here)
6
PieChart:Anothergraphforcategoricaldatawhichhelpsussee whatpartofthewholeeach groupforms. Piechartsarenotaseasytodrawbyhand. Itisnotaseasytocomparesizesofpie piecesversuscomparingheightsofbars. Thuswewillprefertouseabargraphfor categoricaldata. Recap: We have discussed that some variables are categorical and others are quantitative.Wehaveseen thatbargraphs and pie charts can be used to display data for categorical variables. We turn next to displaying the data for quantitative variables.
ExploringFeaturesofQuantitativeDatawithPictures RecallourSleepDeprivedDataforn=86collegestudents.Wehavedataontwovariables:sleep deprivedstatusand hoursofsleeppernight. Howwouldyougoaboutsummarizingthesleep hoursdata?Thesemeasurementsdovary.Howdotheyvary?Whatistherangeofvalues?What isthepatternofvariation?
Findthesmallestvalue=____1______andlargestvalue=_____11_______ Takethisoverallrangeandbreakitupintointervals(ofequalwidth). Whatmightbereasonablehere? Perhapsby2’s;butweneedtowatchtheendpoints.
7
SummaryTable:
Class
Frequency (orcount)
RelativeFrequency (orproportion)
Percent
[0,2]
1
1/86=0.012
1.2%
(2,4]
2
2/86=0.023
2.3%
(4,6]
12
0.139
13.9%
(6,8]
56
0.651
65.1%
(8,10]
14
0.163
16.3%
(10,12]
1
0.012
1.2%
watch endpoints‐‐ different softwarewill dodifferent endpoints W/tablewe canreadily drawa histogram
Graphforquantitativedata=Histogram:
Note:ifwedivide ‘count’by86, wewouldhave proportionbut ‘picture’would looksame.
Note:eachbarrepresentsaclass,andthebaseofthebarcoverstheclass. TheabovetableandhistogramshowthedistributionofthisquantitativevariableSLEEPHOURS, thatis,theoverallpatternofhowoftenthepossiblevaluesoccur. Remembertolabelaxesandaddsomevalues!
8
RHistograms(defaultontheleftandcustomizedontheright):
Allimages
9
Howtointerpret? LookforOverallPattern Threesummarycharacteristicsoftheoveralldistributionofthedata… Shape(approximatelysymmetric,skewed,bell‐shaped,uniform)
Location(center,average) Approximatelythemiddlevalueorwhereitwouldbalance
Spread(variability) Range(overallandthenwheremostoftheobservationsare)
LookfordeviationsfromOverallPattern Outliers=adatapointthatisnotconsistentwiththebulkofthedata. Outliersshouldnotbediscardedwithoutjustification.
DescribethedistributionforSLEEPHOURS: Approximatelybell‐shaped,symmetricdistribution,unimodal, centeredaround7hours,withmostvaluesbetween4and10hours. Noapparentoutliers.
Whatif…youhadsomedataandyoumadeahistogramofitanditlookedlikethis…
Count
Whatwouldittellyou?
Response Wewouldcallthisabimodaldistribution.Thereappearstobetwosubgroupsofobservations.It wouldbebesttoinvestigatewhy–(e.g.maybeM/ForOLD/YOUNG).Itmayleadtoanalyzingdata separatelyforeachgroup.
10
Othercomments– NOSPACEBETWEENBARS!Unlesstherearenoobservationsinthatinterval. HowManyClasses?Useyourjudgment:generallysomewherebetween6and15 intervals. Bettertouserelativefrequenciesontheyaxiswhencomparingtwoormoresetsof observations. Softwarehasdefaultsandmanyoptionstomodifysettings.
OneMoreExample:
AstudywasconductedinDetroit,Michigantofind out the number of hours children aged 8 to 12 Howmany yearsspentwatchingtelevisiononatypicalday. areinfirst class?3 Alistingofallhouseholdsinacertainhousingarea having children aged 8 to 12 years was first constructed. Out of the 100 households in this listing, a random sample of 20 households was selectedandallchildrenaged8to12yearsinthe selectedhouseholdswereinterviewed. Thefollowinghistogramwasobtainedforallthe childrenaged8to12yearsinterviewed. a. Complete the sentence: Based on this histogram,thedistributionofnumberofhours spentwatchingTVisunimodal,
withaslightskewnesstothe___left____.
b. Assumingthatallchildreninterviewedare representedinthehistogram,whatisthetotalnumberofchildreninterviewed?
3+6+9+10+4=32
c. Whatproportionofchildrenspentlessthan2hourswatchingtelevision?
(3+6)/32=0.281orabout28%
d. Canyoudeterminethemaximumnumberofhoursspent watching television byoneofthe interviewedchildren?Ifso,reportit.Ifnot,explainwhynot.
No,itissomewherebetween4and5, buttheexactvalueisnotknownforsure.
11
NumericalSummariesofQuantitativeVariables
Wehavediscussedsomeinterestingfeaturesofaquantitativedatasetandlearnedhowtolook fortheminpictures(graphs).Section2.5focusesonnumericalsummariesofthecenterandthe spreadofthedistribution(appropriateforquantitativedataonly).
Notationforagenericrawsetofdata: x1,x2,x3,…,xnwheren=#itemsinthedatasetorsamplesize
DescribingtheLocationorCenterofaDataSet Twobasicmeasuresoflocationorcenter:
Mean‐‐thenumericalaveragevalue Werepresentthemeanofasample(calledastatistic)by…
x
x1 x 2 xn n
x n
i
Median‐‐themiddlevaluewhendataarrangedfromsmallesttolargest.
nodd:M=middleobs;neven:M=avgoftwomiddleobservations
TryIt!FrenchFries Weightmeasurementsfor16smallordersofFrenchfries(ingrams). 78 72 69 81 63 67 65 75 79 74 71 83 71 79 80 69
Whatshouldwedowithdatafirst?Graphit!
Basedonourhistogram,the distributionofweightis unimodalandapproximately symmetric, socomputingnumericalsummariesisreasonable.Theweights(ingrams)rangefromthe60’sto thelower80’s,centeredaroundthelower70’s.
12
1. Computethemeanweight. 78 72 69 80 69 x Does73.6makesense?(yes–lookathistogram)Would83?(no) 16 73.5 grams 2. Computethemedianweight. Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,83
(n+1)/2=(16+1)=8.5soavg8thand9thobservations=>(72+74)/2=73 Note:½areabove73and½arebelowit.
3. Whatifthesmallestweightwasincorrectlyenteredas3gramsinsteadof63grams? Medianwouldstaythesame.Meanwoulddecrease.
Note: Themeanis____sensitivetoextremeobservations. Themedianis________resistanttoextremeobservations. Mostgraphicaldisplayswouldhavedetectedsuchanoutlyingvalue.
Somedianbetter ifoutliersor stronglyskewed.
SomePictures:MeanversusMedian
13
DescribingSpread:RangeandInterquartileRange Midtermsarereturnedandthe“average” wasreportedas76outof100. Youreceivedascoreof88. Howshouldyoufeel?Happytojustbeaboveaverage? Oftenwhatismissingwhenthe“average”ofsomethingisreported,isacorrespondingmeasure of spread or variability. Here we discuss various measures of variation, each useful in some situations,eachwithsomelimitations.
Range:
Measuresthespreadover100%ofthedata. Range=Highvalue–Lowvalue=Maximum–Minimum Percentiles: Thepthpercentileisthevaluesuchthatp%oftheobservationsfallatorbelow thatvalue.
SomeCommonpercentiles: Median: 50thpercentileQ2or .50 th Firstquartile: 25 percentileQ1or .25(medianofvaluesbelowmedian) th Thirdquartile: 75 percentileQ3or .75(medianofvaluesabovemedian)
FiveNumberSummary:
VariableNameandUnits
(n=numberofobservations)
Median Quartiles Extremes
M
Q1 Min
Q3 Max
Provides a quick overview of the data values and information about the center and spread. Dividesthedatasetintoapproximatequarters.
InterquartileRange: Measuresthespreadoverthemiddle50%ofthedata.
Tryit!FrenchFriesData Ordered:63,65,67,69,69,71,71,72,74,75,78,79,79,80,81,83
Findthefive‐numbersummary:Providesmeasuresoflocationandspread
WeightofFries(ingrams)
(n=16orders)
Median
73
Quartiles
69
79
Extremes
63
83
Range:83–63=20grams
IQR: 79–69=10grams
14
IQR=Q3–Q1
AndconfirmingthesevaluesusingRwehave:
> numSummar y( Fr enchFr i es[ , " Wei ght " ] , st at i st i cs=c( " mean" , " sd" , " I QR" , + " quant i l es" ) , quant i l es =c( 0, . 25, . 5, . 75, 1) )
me a n s d I QR 0 % 2 5 % 50 % 75 % 1 0 0 % n 73. 5 6. 0663 10 63 6 9 7 3 79 83 16
Example:TestScores The five‐number summary for the distribution of test scores for a very large math class is providedbelow:
TestScore(points)
(n=1200students)
Median Quartiles Extremes
58 46 34
78 95
1. Whatisthetestscoreintervalcontainingthelowest¼ofthestudents?
34to46points
2. Supposeyouscoreda46onthetest.Whatcanyousayaboutthepercentageofstudentswho scoredhigherthanyou?
75%
3. Supposeyouscoreda50onthetest.Whatcanyousayaboutthepercentageofstudentswho scoredhigherthanyou?
Between50%and75%
4. Ifthetop25%ofthestudentsreceivedanAonthetest,whatwasthe...