STAT1008 Lecture Notes PDF

Title STAT1008 Lecture Notes
Author Dorina Wu
Course Quantitative Research Methods
Institution Australian National University
Pages 36
File Size 2.4 MB
File Type PDF
Total Downloads 56
Total Views 106

Summary

STATCHAPTER 1 – DEFINING AND COLLECTING DATAC AR P RE FER EN CES E XAMPL E Source: statsci/data/oz/carprefs.html  Study aim: Data was collected to examine preferences for cars among men and women. Do women prefer smaller cars? Previous evidence showed that women who are involved in a car accident ...


Description

STAT1008 CHAPTER 1 – DEFINING AND COLLECTING DATA CAR PREFERENCES EXAMPLE  



Source: http://www.statsci.org/data/oz/carprefs.html Study aim: Data was collected to examine preferences for cars among men and women. Do women prefer smaller cars? Previous evidence showed that women who are involved in a car accident are more likely to suffer severe injury. Is this because they prefer smaller cars?? See file Carprefs.pdf for explanation of variables

SOME DEFINITION S 

What does a row in the dataset represent? o Each row is a unit of observation. The entity/item on which we collect data



What information is contained in the columns? o Each column is a variable (a characteristic/feature of each unit of observation) o Rows and columns together form a data set Note the tabular format of presenting data which we will focus on in this class



DATA COL LECTION     

How was the data collected? Do we have an unbiased sample? Data was collected by face-to-face survey, people were approached at random outside shopping centres and university carparks in Newcastle by research assistants. What might the data be used for? (That is, what questions could we answer after analysing the data?) The same data set can be used to answer all sorts of questions using different statistical tools. What statistical tools will we need to carry out the analysis?

POPULATION AND SAMPLE   

Population: All members of the group about which you want to draw a conclusion on Q: What is the population of interest for the cars example? Sample: Portion of the population selected for analysis

  

Why do we analyse the sample rather than the entire population? Isn’t it better to have information on all members of a population, and not just a subset? Parameter: characteristic of the population Statistic: A numerical measure that describes a characteristic of the sample. Calculated using the sample data. Used as an estimate of the population parameter

TYP ES OF VARIABLES  



Why is it important to classify variables by type? Categorical (values are class labels or levels) o nominal (no natural order, e.g. make of car={Ford, Subaru, Hyundai, …} ) o ordinal (natural order: e.g. very unsatisfied, unsatisfied, neutral, satisfied, very satisfied) Numerical (a measurement number: height, weight, age, number of children, …) o discrete (e.g. a count, number of car trips in past week = 0, 1, 2, … ) o continuous (e.g. age, height, …)

CAR PREFERENCES EXAMPLE AGE – NUMERICAL, CONTINUOUS  

Numerical – has a numerical value that quantifies the amount/size of something. Age is numerical because it quantifies how many years old the person is. Continuous – can take on any value between specified limits. Although age is usually reported in whole numbers, your exact age in terms of years, months, days etc can be reported as a real number. Eg, if your 18th birthday was exactly 3 months ago, you age is 18.25

SEX: 1=FEMALE, 2=MALE – CATEGORICAL NOMINAL 



  

   

  

Categorical – values fall into two or more classes. The classes can be coded as numbers as above, but the numbers are mere labels and have no quantitative meaning. o The variable Sex in the data set is a 2-level categorical variable.  female and male Nominal – no ranking is implied by the levels of the categorical variable. o In this case, the coding of 1=Female and 2=Male does not imply that Females are better than males nor vice versa LicYr and Mth? ActCar/Kids5/Kids6/PrefCarCar15L/Reason? What if instead of Kids5/Kids6, participants were asked to provide the number of children they had under the age of 16. Call this variable Kids_count. o Kids_count – numerical, discrete Discrete- provide numerical responses that arise from some counting process. Cost/Reliable…./Colour – nominal, ordinal Responses are categorised into 5 levels. Ordinal – natural ordering/ranking implied by the distinct categories o 1=not important, 2=little importance, 3=important, 4=very important  The levels 1 to 4 are of increasing order of importance. When classifying variables in your data set by type, you are making an assumption on the structure of the data. Different variable types imply different data structures and constraints, and convey different information The variable type assumption(s) affect the choice of statistical tools to analyse and model the data.

 

It is important to make valid variable type assumptions and to correctly incorporate the assumed data structure into your analysis. E.g. what are the implications of assuming a nominal categorical variable is ordinal?

DATA COL LECTION METHODS         

Car preferences survey – data was collected by face-to-face interviews of people randomly selected in the university car park and shopping centre Pros? Cons? Other methods of data collection? It is important to understand how the data were collected as it affects the makeup of your samples and whether the sample is representative of your target population. For the car preferences survey, does the data collection method enable unbiased estimates for the Newcastle driving population? What if the survey was undertaken only in the university car park? Can the current sampling method be used to produce a sample that is representative of the Australian driving population? What if the survey was conducted by randomly calling home phone numbers in the Newcastle region? Would the sample be biased/unbiased?

TYP ES OF SURV EY SAMP LIN G METHODS 

 





 

Sampling frame – the list of items that make up the population. What is the sampling frame for the car preferences data set? Once you select a frame, you draw a sample from that frame. Non-probability sample – select items without knowing probability of selection. Convenient and low cost but selection bias is problematic o eg convenient sampling , judgement sampling , quota sampling Convenient sampling  when you survey people only in a specific area, for the sake of convenience – this effectively causes coverage error as you cut off people not in the suburb that you are in creating a biased, unreliable sample Judgement sampling  when you survey a specific sample to reach a desired outcome causing a biased, subjective result (people have to fit a certain criteria) – e.g. a politician looks at historical records of individuals likely to vote for them and interview them to create survey results Quota sampling  surveying until you reach a specific quota and cutting people off afterwards – coverage error Probability sample – selection probabilities are known, produce unbiased samples

SIMPLE RANDOM SAMPLE  

Simple random sample – every item in the frame has an equal chance of being selected E.g. selecting 50 employees by drawing names from a hat from a company of 5000 employees to participate in a new employee training program o Everyone has the same chance of being selected 1/100

STR ATIFIE D SAMPLING

  

Divide frame into subpopulations (strata), perform simple random sampling in each strata Pros: ensure specific groups of the population are equally represented E.g. divide company employees into junior staff and senior staff, then randomly sample 25 employees from each group.

SYSTEMATIC SAMPLING   

Systematic sampling – start with the kth item (eg k=10, 20...) in the sampling frame, then pick every kth item thereafter Example use: product testing in a manufacturing factory  companies use this to test the quality of the sample Prone to selection bias given that the probability of selection will be affected by the order in which the items in the frame appear.

CLUSTER SAMPLING 





Cluster sampling - divide items in frame into clusters. Take a random sample of clusters then collect data on every item in that sampled cluster. o An advantage of this that you can sample many people at once, however due to them all having similar common attributes, you will be unable to cover a broad scope o Cost and time efficient A cluster sample typically gives less precise estimates than a systematic or simple random sample of the same size (especially if values tend to be similar within the same cluster), but it can be much cheaper. Cluster examples: households, postcode, electorate (more cost-effective than SRS if population is spread over a large region)

CAR PREFERENCES EXAMPLE – WHAT IS THE SAMPLING METHOD? Data collectors were instructed to obtain data from men and women with small, medium and large cars, with 50 people per group for a total of 300 respondents. • •

People’s groups (type of car) not known in advance. Approach people until required target in each group is met  QUOTA SAMPLING

SURVEY ERROR – CAN YOU TRUST THE DATA SOURCE ? Data is prone to errors. Four main types of survey errors: COVERAGE ERROR • •

Coverage error is when certain groups of items are excluded from the sampling scheme Is coverage error an issue for the car preferences survey? Yes there is as at a shopping centre, the demographic present is usually affected by the people that live in the area. This may exclude other people such as workers at hospitals and much more

NON-RESPON SE ERROR • •

Non-Response Error is the failure to collect data on all items in the survey, often denoted as a blank data entry. Eg. income level is typically a sensitive question and often missing

SAMPL IN G ERROR • •

Sampling error is when the chance differs from sample to sample Explain how sampling error would arise in the car preferences survey – depending on the time of day or day of the week, a different demographic would be present and may affect the results of the survey

MEASURE MENT ERROR • • •

Measurement error is when values recorded in the survey are different from the true response Eg leading question (do you have a problem with your boss vs tell me about your relationship with your boss); incorrect interpretation of question Cars survey example - inaccurately report car type because ashamed it is an old model

CHAPTER 2: ORGANISING AND VISUALISING DATA RECAP – CHAPTER 1 Need to know: • • • • • • • •

Appreciate the need for data Be familiar with how data is stored (rows and columns in a table, what do these represent?) Identify different sources of data Recognise different variable types in a data set Distinguish between a population and a sample, and between a parameter and a statistic Understand the requirement for unbiased random samples Be familiar with and identify common survey sampling methods Be familiar with and identify sources of survey error.

REVIEW QUESTION – SURV EY ERROR 1.

2.

The Crime Victimisation Survey is designed to provide statistics on crime related events. Respondents are asked to recall their experiences of events that occurred in the last 12 months. What is a potential source of survey error from this method of data collection? a. Sampling error  exists in every survey, but unlikely to be material b. Coverage error  depends on sampling frame i. E.g. if it a registrar of crimes reported to police, only those that brought their crim to the police will be asked to participate in the surveys, however those who suffer a particular a crime but don’t report it will be excluded  this happens a lot in the news c. Non-response error  information may be sensitive if they are asking for lots of detail in the survey A survey is conducted to measure the level of physical activity among year 12 students. Prior to completing the survey, students are given information on the benefits of physical activity. What is a potential source of survey error from reading the additional information prior to completing the survey? a. The additional information of the benefits of the physical activity may overstate the amount of physical activity they provide  they may try hide the little amount of physical activity that they do  this leads to measurement error

SUMMARISING CATEGORICAL DATA

• •

• •

Data: GRADES_Ch2.xls (from textbook) Summary Table: gives the frequency/count/proportion of the data in each level of the categorical variable (ordinal, nominal) o For each level of categorical data, we have the number of observations of the sample at that level o It can be a percentage or numbered The way you summarise data is either categorical or numerical To calculate for the frequency table – e.g. HD o Highlight the column  =COUNTIF(range, criteria) o o



=COUNTIF(D2:D56) – and then click on cell HD And then hover your mouse over bottom right corner where it becomes a little cross and

then drag the rectangle down to apply to the rest under the frequency column To calculate percentage o Drag number from frequency over to percentage and then “/” total numbers  click “home” o E.g. =O4/$O$9  this allows for the data to change as grades are appealed

VISUALISING CATEGORICAL DATA • •

Bar Chart: Graphical representation of summary table. What conclusions can you draw from the bar chart of grades?

VISUALISING TWO CATEGORICAL VAR IABLES • • • •





create side-by-side bar charts Data: ROAD_FATALITIES_example.xls. Two-way contingency table To find percentage per gender  =B6/B$17 o Only put the “$” after the second B as it allows for it to be unfixed for column, but the row to be fixed By graphing through percentages you can analyse the data, demonstrating that “20 to Scatter Plot Redline  semester mark = exam mark o Since more points below the redline, more people do better in semester work than in the exam Strong positive linear relationship

TIME SERIES PLOT 

Plot the data vs time (x-axis). Can observe patterns in the value of a variable over time.

MISLEADING GRAPHS • •

Consider 5 sales agents denoted as persons A, B, C, D and E. The bar chart shows their sales volumes for the past month. Which sales agents show outstanding performance relative to the others?? Misleading graphs 20 Consider the revised graph. Which sales agents show outstanding performance relative to the others??

o

o



The first graph shows a significant difference between A and D, however this is simply since the y-axis has been condensed  in reality there is very little difference between the two sales agents No agent performed exceedingly well

bottles are not drawn to scale  no direct mathematical association between the size of the bottle and the actual volum





• The first graph does not show the whole history of the road fatalities  this creates subjectivity by reducing information Cannot focus only on immediate impact

MISLEADING USE OF STATISTICS  KELLOGGS MINIWH EATS LAWSU IT. • •



https://www.npr.org/sections/thesalt/2013/05/30/1873302 35/no-frosted-mini-wheats-don-t-makeyour-kids-smarter Kellogg’s claim: ``Based upon independent clinical research, kids who at Frosted Mini Wheats cereal for breakfast had up to 18% better attentiveness three hours after breakfast than kids who ate no breakfast’’. Kellogg’s agreed to a $4m settlement in a class-action lawsuit because of the deceptive marketing campaign. o Deceptive marking o o

Observational study not a randomised treatment so they could not claim that the cereal actually increased attentiveness Other flaws include the small sample size and the disregard of the students who had worse attentiveness

ETHICAL ISSUES (SECTION 3.6 OF TEXTBOOK) •





Results should be presented in a “fair, objective and neutral manner” o Should show the whole picture  not condense scale or focus on small time frame o Should not magnify differences in statistics when there is minimal It is unethical to choose “an inappropriate summary measure … to distort the facts to support a particular position.” o Should not hide bad results Results should not be selectively quoted to support a particular conclusion, or to exaggerate certainty. For example, by only showing results from businesses where a new technology gave improvements, or by showing data from a conveniently selected time range.

CHAPTER 3 – NUMERICAL DESCRIPTIVE MEASURES CHAPTER 2 – NEED TO KN OW

How to summarise, visualize and interpret:     

Categorical data – summary table, bar charts Numerical data – frequency table, histogram Two categorical variables – contingency table (row, column or grand total based percentages), side by side bar plots Two numerical variables – scatter plots Time series plots

RECAP QU ESTION  

The data set blocks.xls includes information on the number of blocked shots during the season for each of the players. Summarise the data on `blocks’ in both tabular and graphical format. Interpret your output

    

Sample size: 176 Range: 195  large range so can argue it is continuous variable The blocks are numerical discrete variables The bottom two classes are known as modal classes Right-skewed distribution

NUMERICAL DESCRIPTIVE MEASURES 

Graphs and tables are useful to assess some general features of the data collected (range, most common values, shape of data distribution) o Can see shape of distribution



We can also summarise the important features of numerical variables using calculated numerical measures eg sample mean and sample standard deviation o Can see things not obvious in a graphical summary

SAMPL E MEAN – SIMPLE EXAMPLE 

How much time does an adult spend on their phone per day on average? The following times (in minutes) are collected from a random sample of 12 people o 120 135 150 300 90 110 220 200 100 150 120 90

  

On Excel: =Average (enter data) Use “text to columns” function to convert pasted data into separate cells ‘special paste’, ‘transpose’ allows for data in a row to fall into data in columns

SAMPL E MEAN

   

The mathematical definition of the sample mean is the sum of the values divided by the number of values The mean is a measure of location or central tendency of the data values for a numerical variable Interpretation: typical or central value in the data set. The sample average is a commonly reported sample statistic.



Examples:





o o

https://variety.com/2020/film/news/average-budgeteuropean-films-1203494821/ https://www.smh.com.au/business/the-economy/covidhasn-t-changed-the-way-we-use-our-time-but-onegroup-of-women-put-ahigh-price-on-theirs-20210226-p5769f.html



 

The sample mean does not have to be an actual observed value in the data set o Example: The average number of children born to an Australian woman of childbearing age in 2017 was 1.74 All data values are equally weighted in the sample mean calculation, and all the data is used. Therefore, the sample mean value will be affected by extremely large or extremely small data points

ILLUSTRATION OF SEN SITIVITY TO EXTRE ME VALUES  

 

Suppose in the past month, 243 houses were sold in Canberra. The average sale price was $595,000. Suppose I forgot to include in my calculation two houses that sold at the end of the month for $1.8m and $1.9m respectively 1. How will the addition of these two data points affect the value of the average sale price?  It increase the average prices  relatively large increase...


Similar Free PDFs