STAT1008 Lectures - Bronwyn loong, course convenor. lecture notes and summary notes of concepts PDF

Title STAT1008 Lectures - Bronwyn loong, course convenor. lecture notes and summary notes of concepts
Author Taara Chaudhuri
Course Quantitative Research Methods
Institution Australian National University
Pages 7
File Size 236.3 KB
File Type PDF
Total Downloads 79
Total Views 134

Summary

Bronwyn loong, course convenor. lecture notes and summary notes of concepts over semester 1....


Description

STAT1008 LECTURES SEM 1 2019 Feb 27th Week 1 Lecture 2 Chapter 1 – Defining and Collecting Data  

   





 

A good data set will have a data dictionary so that you can identify and understand what the variables mean What does a row in the dataset represent? o Each row is a unit of observation. The entity/item on which we collect data (e.g. a person, animal, day, location etc.) What information is contained in the columns? o Each column is a variable (a characteristic/feature of each unit of observation) Rows and columns together form a data set. Note tabular format of presenting data How was the data collected? o Data was collected face-to-face surveys randomly by research assistants etc. What might the data be used for? o What statistical tools are needed, what answers could this give, why is it important, can answer lots of questions with the same data set but different tools. What is the population of interest? o Population: All members of the group about which you want to draw a conclusion on What is the study sample? o Sample: Portion of the population selected for analysis o Why do we analyse the sample rather than the entire population? Isn’t it better to have information on all members of a population, and not just a subset? Give examples of some parameter of interest? o Parameter: characteristic of the population What statistics can we use to estimate the parameters? o Statistic: A numerical measure that describes a characteristic of the sample, calculated using the sample data.

Types of variables  Why is it important to classify variables by type? (can filter your data to find specific answers) o Let’s classify the variables in the car preferences data set. 1. Age – numerical, continuous o Numerical – has a numerical value that quantifies the amount/size of something. Age is numerical because it quantifies how many years old the person is o Continuous – can take on any value between specified limits. Although age is usually reported in whole numbers, your exact age in terms of years, months, days etc can be reported as a real number. E.g. if your 18th birthday was exactly 3 months ago, your age is 18.25

2. Sex: 1=Female, 2=Male – categorical nominal o Categorical – values fall into two or more classes. The classes can be coded as numbers as above, but the numbers are mere labels and have no quantitative meaning. The variable Sex in the data set is a 2-level cat3egorical variable o Nominal – no ranking is implied by the levels of the categorical variable] o In this case, the coding of 1=Female and 2=Male does not imply that females are better than males nor vice versa  

Cost/reliable…/Colour – nominal, ordinal Responses are categorised into 5 levels. o Ordinal – natural ordering/ranking implied by the distinct categories o 1=not important, 2=little importance, 3=important, 4=very important o The levels 1 to 4 are of increasing order of importance.

LicYr and Mth? ActCar/Kids5/Kids6/PrefCarCar15k/Reason? What if instead of Kids5/Kids6, participants were asked to provide the number of children they had under the age of 16/ Call this variable Kids_count. o Kids_ count – numerical, discrete o Discrete – provide numerical responses that arise from some counting process     

When classifying variables in your data set by type, you are making an assumption on the structure of the data Different variable types imply different data structures and constraints and convey different information. The variable type assumption(s) affect the choice of statistical tools to analyse and model the data It is important to make valid variable type assumptions and to correctly incorporate the assumed data structure into your analysis. E.g. what are the implications of assuming a nominal categorical variable is ordinal?

Data collection methods  For the car preferences survey, does the data collection method provide an unbiased sample of the Newcastle driving population?  What if the survey was undertaken only in the university car park?  Can the current sampling method be used to produce a sample that is representative of the Australian driving population?  What if the survey was conducted by randomly calling home phone numbers in the Newcastle region? Would the sample be biased/unbiased?

March 1st Week 1 Lecture 3

Types of survey sampling methods Sampling frame – the list of items that make up the population. Once you select the sample you draw a frame from that sample Types of samples used: Non-probability samples: - Judgement sample - Quota sample - Chunk sample - Convenience sample 

Probability samples: - Simple random sample - Systematic sample - Stratified sample - Cluster sample Non-probability sample: select items without knowing probability of selection. Convenient and low cost but selection bias is problematic e.g. convenient sampling, judgement sampling Probability sample: selection probabilities are known, produce unbiased samples. Simple random sample: every item in the frame has an equal chance of being selected. E.g. selecting 50 employees by drawing names from a hat from a company of 5000 employees to participate in a new employee training program Stratified sampling: divide frame into subpopulations (strata), perform simple random sampling in each strata o Pros: ensure specific groups of the population are equally represented o E.g. for the car preferences data set, data collectors were instructed to obtain data from men and women with small, medium and large cars, with 50 people per group for a total of 300 respondents. Systematic sampling: start with the kth item (e.g. k=10, 20...) in the sampling frame, then pick every kth item thereafter o Prone to selection bias given that the probability of selection will be affected by the order in which the items in the frame appear. o E.g. product testing in a manufacturing factory. Cluster sampling: divide items in frame into clusters. Each cluster is representative of the target population. Take a random sample of clusters then collect data on every item in that sampled cluster. o Cluster examples: households, postcode, electorate

Survey errors – can you trust the data source

Data is prone to errors. Four main types of survey errors: 1. Coverage error – certain groups of items are excluded from the sampling scheme 2. Non-response error – failure to collect data on all items in the survey, often denoted as a blank data entry (ignore that data set) 3. Sampling error – chance differences from sample to sample 4. Measurement error – values recorded in the survey are a different from the true response. E.g. Avoid leading questions (di you have a problem with your boss vs. tell me about your relationship with your boss); incorrect interpretation of question. RECAP – CHAPTER 1 Need to know:  Appreciate the need for data  Be familiar with how data is stored (rows and columns in a table, what do these represent)  Identify different sources of data  Recognise different variable types in a data set  Distinguish between a population and a sample, a parameter and a statistic  Understand the requirement for unbiased random samples  Be familiar with common survey sampling methods  Be familiar with sources of survey error

Chapter 2 – Organising and Visualising data Summarising Categorical Data  Summary Table: gives the frequency/proportion of the data in each level of the categorical variable  In excel use COUNTIF() function to calculate

5th March – Week 2 Lecture 4 Summarising Two Categorical Variables  Two-way summary table (contingency table) 

Ordered Arrays – means sort the data in order of magnitude

6th March – Week 2 Lecture 5  How to summarise, visualise and interpret: o Categorical data – summary table, bar charts o Numerical data – frequency table, histogram o Two categorical variables – contingency table (row, column or grand total based percentages), side by side bar plots o Two numerical variables – scatter plots o Time series plots  Ignore section 2.6 Chapter 3 – Numerical Descriptive Measures Sample Mean: - The sample mean is the sum of the values divided by the number of values (values)

-

Typical or central value in the data set, location point, allows you to reference your data in the bigger picture. In Excel: use AVERAGE() function The sample mean does not have to be an observed value in the data set All data values are equally weighted in the sample mean calculation, and all the data is used Therefore, the sample mean value will be affected by extremely high or low data points (outliers or extreme values)

Sample Median:

-

Middle value: 50% of the data are below the median, 50% of the data are above the median Not sensitive to extreme values (resistant statistic) In Excel: use MEDIAN() function (n+1)/2 = ranked value when smallest to largest sorted data

Mode: - The most common (frequently occurring) value

8th March – Week 2 Lecture 6 Quartiles/Percentiles  1st quartile/25th percentile/lower quarter/Q1 – 25% of values are smaller, 75% are larger o = (n+1)/4 ranked value  2nd quartile/50th percentile/median/Q2 o = (n+1)/2 ranked value  3rd quartile/75th percentile/upper quartile/Q3 – 75% below, 25% above o =3(n+1)/4 ranked value  (n= sample size, ranked value when sorted from smallest to largest, the smallest being the 1st ranked value  In Excel: use PERCENTILE() function -

Ignore Geometric mean in textbook

Measures of variation/spread/dispersion  Range = maximum – minimum  In Excel: use MAX() – MIN() Interquartile Range (IQR)  IQR = Q3-Q1, Tells us about the spread of the middle 50% of the data  Note that the median, Q1, Q3, IQR are not affected by extreme values. We call these resistant measures. Variation and Standard Deviation  The Sample variance is the sum of the squared deviations from the sample mean divided by the sample size minus one.

S^2 = notation for sample variance You need to square at the end to get a positive answer The equation is “SSX / n-1” which means sum of square deviations from the mean In Excel: use VAR() and STDEV() 12th March - Week 3 Lecture7...


Similar Free PDFs