STAT1008 Study Notes PDF

Title STAT1008 Study Notes
Course Quantitative Research Methods
Institution Australian National University
Pages 33
File Size 1.2 MB
File Type PDF
Total Downloads 167
Total Views 833

Summary

STAT1008 MID SEMESTER EXAM STUDY NOTES 1. COLLECTING DATA 1. STRUCTURE OF DATA Data:  Set of measurements taken on a set of individual units  Usually stored and presented in a data set, comprises of variables measured on cases Cases and Variables:  Obtain information about cases or units  Variab...


Description

STAT1008 MID SEMESTER EXAM STUDY NOTES 1. COLLECTING DATA 1.1. STRUCTURE OF DATA Data:  Set of measurements taken on a set of individual units  Usually stored and presented in a data set, comprises of variables measured on cases Cases and Variables:  Obtain information about cases or units  Variable: any characteristic that is recorded for each case  Generally, each case makes up a row in a dataset, and each variable makes up a column Categorical vs. Quantitative:  Categorical variable: divides cases into groups e.g. gender, degree program, smoking status  Quantitative variable: measures or records numerical quantity for each case e.g. height, weight  A variable is either categorical or quantitative (not both)  May convert a quantitative variable into a categorical variable – always look at context o E.g. height – instead of getting exact height, 155 < x < 160) Explanatory and Response Variables:  If we are using one variable to help us understand or predict values of another variable, we call the former variable the explanatory variable and the latter the response variable  E.g. does meditation (explanatory) help reduce stress (response)?  E.g. does sugar (explanatory) consumption increase hyperactivity (response)? 1.2. SAMPLING FROM A POPULATION  Population: includes all individuals or objects of interest  Sample: all the cases that we have collected data on (subset of population)  Size of sample: n  Statistical inference: process of using data from a sample to gain information about the population

Sampling Bias:  Occurs when method of selecting a sample causes the sample to differ from the population in some relevant/significant way. If sampling bias exists, we cannot trust generalisations from the sample to the population  Sample comprised of volunteers (e.g. airline survey) often creates sampling bias in opinion surveys because the people who chose to participate (the sample) often have more extreme opinions than the population  To avoid this, we try to obtain a sample that is representative of the population  resembles sample of population, only in smaller numbers Random Sampling:  Random sampling avoids sampling bias  Simple random sample of n units – all groups of size n in the population have same chance of becoming the sample  Must use a formal random sampling method e.g. technology, drawing names from hat  Random samples have averages centred around the correct number (true average)  Only random samples can truly be trusted when making generalisations to the entire population Non-Random Samples:  Non-random samples may suffer sampling bias, and averages may not be centred around correct number  Bad methods of sampling:

o Sampling units based on something obviously related to the variable you’re studying o Letting sample be comprised of whoever chooses to participate (volunteer bias) o People who chose to participate/respond are not representative of entire population Other Forms of Bias:  Bias exists when method of collecting data causes the sample data to inaccurately reflect the population  Question wording  Context  Inaccurate responses – people respond inaccurately, wrong response, misunderstood question, don’t want to answer correctly 1.3. EXPERIMENTS AND OBSERVATIONAL STUDIES Association and Causation:  Two variables are associated if values of one variable tend to be related to values of the other variable o An association between two variables, even a very strong one, does not imply that there is a cause and effect relationship between the two variables  Two variables are causally associated if changing the value of the explanatory variable influences the value of the response variable o E.g. taking pain relieving medication causes a reduction in pain  Association does not imply causation Confounding Variable:  Third variable associated with both the explanatory variable and the response variable  Can offer a plausible explanation for association between explanatory & response variables  When cofounding variables are present (or may be), a causal association cannot be determined  Example: more ice cream sales have been linked to more deaths by drowning o Associated, cofounding variable possibilities: summer, heat, weather, etc. Experiments vs. Observational Studies:  Observational study: a study in which the researcher does not actively control the value of any variable, but simply observes the values as they naturally exist o Causation Caution: difficult to avoid confounding variables in observational studies  observational studies can almost never be used to establish causality  Experiment: study in which the researcher actively controls one or more of the explanatory variables Randomisation:  Randomised experiment: the value of the explanatory variable for each unit is determined randomly, before response variable measured  The different levels of the explanatory variable are known as treatments  Randomly divide the units into groups, and randomly assign a different treatment to each group  If the treatments are randomly assigned, the treatment groups should all look similar  Because the explanatory variable is randomly assigned, it is not associated with any other variables. Confounding variables are eliminated  If a randomised experiment yields a significant association between the two variables, we can establish causation from the explanatory to the response variable  Ways to randomise: names/number into hat and pull out, names/number on cards, use technology  Types of randomised experiments: 1. Randomised comparative experiment: randomly assign cases to different treatment groups and then compare results on response variable(s) 2. Matched pairs experiment: each cases gets both treatments in random order, and we examine individual differences in the response variable between the two treatments  If the focus of the study is using a sample to estimate a statistic for the entire population, you need a random sample, but do not need a randomised experiment e.g. collection polling  If the focus of the study is establishing causality from one variable to another, you need a randomised experiment and can settle for a non-random sample e.g. drug testing Control Group:

 Comparison group – nothing done to group that might directly influence response variable  Provides good comparison for group that actually got the treatment of interest Placebo Effect:  When people experience the effect they think they should be experiencing, even if they aren’t actually receiving the treatment they think they are  Control groups should be given placebo/fake that resembles active treatment Blinding:  Single-blind experiment: participants not told which group they are in  Double-blind experiment: neither participants nor researchers know which treatments the patients are getting

2. DESCRIBING DATA 2.1. CATEGORICAL VARIABLES ONE CATEGORICAL VARIABLE:  Summary statistics: frequency table, proportion  Visualisation: bar chart, pie chart Frequency Table:  Shows number of cases that fall in each category  Example: random sample of US adults in 2012 surveyed on what phone they owned Android 458 IPhone 437 Blackberry 141 Non smartphone 924 No cell phone 293 Total 2253 Proportion: 

Proportion in a category =

Number∈that Category Total Number

Proportion for a sample: ^p (“p-hat”) Proportion for a population: p Help to compare results between different categories without referring to sample size Always add up to 1 or 100% Proportions also called Relative Frequencies Relative Frequency Table – shows proportion of cases that fall in each category (= 1) Android 0.203 IPhone 0.194 Blackberry 0.063 Non smartphone 0.410 No cell phone 0.410 Bar Chart/Plot/Graph:  Height of bar corresponds to number of cases falling in each category, height represents frequency Pie Chart:  Avoid using pie charts  Relative area of each slice of the pie corresponds to the proportion in each category      

TWO CATEGORICAL VARIABLES:  Summary statistics: two-way table, difference in proportions  Visualisation: segmented or side-by-side bar chart Two-Way Table:  Shows relationship between 2 categorical variables  One variable listed down the side (rows) and other listed across top (columns) Male Female Total Agree 372 363 735 Disagree 807 1005 1812 Don’t Now 34 44 78

Total 1213 1412 2625 Difference in Proportions:  Useful measure of association between 2 categorical variables ^p2  E.g. difference in proportions between population 1 and 2 = ^p1−¿ Side-by-Side Bar Chart:  Height of each bar is the number of the corresponding cell in the two-way table Segmented Bar Chart:  Similar to side-by-side bar chart, but the bars are stacked instead of side-by-side

2.2. ONE QUANTITATIVE VARIABLE: SHAPE AND CENTER  Visualisation: dotplot and histogram  Shape: symmetric, skewed  Measures of centre: mean and median  Outliers and resistance Dotplot:  Each case represented by a dot and dots are stacked  X-axis is a measure of choice (numerical/quantitative)  Y-axis – no value just stack of dots Histogram:  Height of each bar corresponds to the number of cases within that range of the variable  Different ‘binwidths’  Shape: o Symmetric  central bar is the tallest, the outer bars taper down o Right-skewed  longer right tail (higher on the left, L-shape) o Left-skewed  longer left tail (higher on right, opposite L-shape) o Bell-shaped  symmetric  Notation: o Sample size (number of cases in the sample) is denoted by n o Often let x or y stand for any variable, and x1, x2, …, xn represent the n values of the variable x

MEASURES OF CENTRE

Mean:  Numerical average of all the data values  Sample mean: ´x (x-bar)  Population mean: μ (“mu”)  n: represents the number of data cases in a dataset  x 1 , x 2 , x n represent the numerical values for the quantitative variable of interest

mean=

x1 +x 2 +…+x n ∑ x = n n

Median:  The middle value when the data is ordered: m  Even number of values: median is the average of the two middle values  Splits the data in half Skewness and Centre:  Mean is “pulled” in the direction of skewness (pulled down) Resistance:  Statistic is resistant if it is relatively unaffected by extreme values/outliers  Median is resistant a resistance statistic  The mean is a non-resistant statistic  Median changes very little, mean changes a lot Outliers:  An observed value that is notably distinct from the other values in a dataset  too large/small  Think about whether the outlier is a mistake, decide whether the outlier is part of your population of interest or not, see how much it is affecting results 2.3. ONE QUANTITATIVE VARIABLE: MEASURES OF SPREAD Standard Deviation:  Measures the spread of the data in a sample  Sample SD: s  Population SD: σ (sigma)  Gives a rough estimate of the typical distance of a data value from the mean  SD is not resistant to outliers  Larger SD = more variability in the data and more spread out data

 



SD=

∑ (x−´x )2 n−1

95% rule: o If distribution of data is bell-shaped, 95% of the data should fall within 2 SD of the mean o 95% within the interval: ´x −2 s∧´x +2 s

Z-Score:  Number of standard deviations a value falls from the mean  Z-score for a data value, x, from sample mean ´x and standard deviation s:

z=    

x−´x s

For a population, ´x is replaced with μ and s is replaced with σ Puts values on a common scale Values farther from 0 are more extreme For bell-shaped distributions, 95% of all z-scores fall between -2 and 2

Percentiles:  Pth percentile: value of a quantitative variable which is greater than P percent of the data  General formula: o Sort data from low to high o Count number of values (n) o

Selection the

p × ( n+1 ) observation 100

Five-Number Summary  Minimum: 0th percentile, smallest data value  Q1: 25th percentile, median of the values below m  m: 50th percentile, median  Q3: 75th percentile, median of the values above m  Maximum: 100th percentile, largest data value

Range and Interquartile Range:  Range = maximum – minimum  Interquartile range (IQR): Q1 - Q3  IQR is resistant to outliers  Range is not resistant to outliers 2.4. BOXPLOTS AND QUANTITATIVE/CATEGORICAL RELATIONSHIPS Boxplot:  Displays the 5 number summary for a single quantitative variable  Numerical scale appropriate for the data values  Box stretching from Q1 to Q3  Line divides the box drawn at the median  Lines (whiskers) extend from each quartile to the most extreme data value that is not an outlier  Each outlier is plotted individually with a symbol e.g. dot or asterisk  Detecting outliers in boxplots: o Smaller than Q1 – 1.5(IQR) o Larger than Q3+ 1.5(IQR) One Quantitative and One Categorical Variable:  Visualise the relationship through side-by-side graphs: o Includes a graph for the quantitative variable (e.g. boxplot, histogram or dotplot) for each group in the categorical variable, all using a common numeric axis 2.5. TWO QUANTITATIVE VARIABLES: SCATTERPLOT AND CORRELATION Scatterplot:  Graph of relationship between 2 quantitative variables  Each dot represents particular data item  Pair of axes with appropriate numerical scales, one for each variable  Scale of two axes does not need to be the same  Explanatory variable on the horizontal axis, response variable on the vertical axis  Association: o Positive association: general upward trend o Negative association: general downward trend o Linear association: trend follows a straight line Correlation:  Measure of the strength and direction of linear association between 2 quantitative variables  Sample correlation: r  Population correlation: ρ (“rho”)  −1≤ r ≤1 (Sample correlation always between -1 and 1)

 

Value of exactly -1 or 1 = perfectly correlated/linear association Sign of r (positive or negative) indicates the direction of association o Positive association: r > 0 (positive correlation/slope) o Negative association: r < 0 (negative correlation/slope) o No linear association: r ≈ 0  The closer r is to ±1, the stronger the linear association  r has no units and does not depend on the units of measurement (it’s just a ratio)  Correlation is symmetric: correlation between X & Y is the same as between Y & X o Order of variables is not important (correlation is the same) o Doesn’t matter which variable is on what axis  Correlation cautions: o Strong positive or negative correlation does not necessarily imply cause/effect relationship between 2 variables o Correlation can be heavily affect by outliers o Correlation near 0 does not necessarily mean two variables are not associated as correlation measures only the strength of a linear relationship Covariance:  Measures linear relationship between X and Y  Sign indicates direction of slope, but magnitude is dependent on units of measurement  Positive association: o Large values for x tend to occur with large values of y (both z-scores positive) o Small values (with negative z-scores) tend to occur together o Either case: products are positive, leading to positive sum  Negative association: o Z-scores tend to have opposite signs (small x with large y and vice versa) so products tend to be negative

2.6. TWO QUANTITATIVE VARIABLES: LINEAR REGRESSION REGRESSION LINE:  Equation of the line: ^y=a+bx  Straight line fitted using given data points  best fits the data in a scatterplot  Response variable: y  Explanatory variable: x  Slope: b  increase in predicted y for every = unit increase in x o If b is negative  line decreases  Slope is interpreted as the change in the predicted response (y) when explanatory variable (x) increases by 1  Y-intercept: a  predicted y value when x = 0 o Point where line cuts y intercept o However, does not mean when x = 0, y = …  Want to get line as close to all lines as possible  Residual  difference between observed/actual value and predicted value  If r = 0 (no correlation) line would be flat (horizontal) Predicted and Actual Values:  Observed response value, y, is the response value observed for a particular data point  Predicted response value, ^y is the response value that would be predicted for a given x value, based on a model



The best fitting line makes the predicted values closest to the actual values  makes all the residual values as small as possible

Residual:  Residual for each data point is observed – predicted =  Vertical distance from each point to the line  Want to minimise all the residuals

y− ^y

Least Squares Line:  The line which minimises the sum of squared residuals  For a quantitative response (y) and quantitative predictor (x), least squares line is: ^y =a+bx o Where slope (b) and intercept (a) chosen to minimise sum of squared residuals n

( y i−^y i )2 ∑ i=1



Minimise

  

= ∑ ( y i−a−bx )2 Rely on technology to give prediction equation (least squares line/regression line) “Least squares line” = “regression line”

Regression Cautions:  Do not use regression equation/line to predict outside range of x values available in data (do not extrapolate) o If none of the x values are anywhere near 0 – then intercept is meaningless  Computers will calculate a regression line for any 2 quantitative variables, even if they’re not associated or if association is not linear  Always plot data  Outliers can be very influential on the regression line o Assess outliers  are they true or just random (can you disregard?)  Higher values of x may lead to higher (or lower) predicted values of y, but this does NOT mean changing x will cause y to increase or decrease o Causation can only be determined if values of explanatory variable were determined randomly o Regression equation doesn’t say anything about causal links – just tells us about association 2.7. DATA VISUALISATION AND MULTIPLE VARIABLES 3. CONFIDENCE INTERVALS 3.1. SAMPLING DISTRIBUTIONS Statistical Inference: process of drawing conclusions about entire population based on information in a sample Statistic and Parameter:  Parameter: number that describes some aspect of a population  Statistic: number that is computed from data in a sample  Usually have a sample statistic and want to use it to make inferences about the population parameter Point Estimate:  Use the statistic from a sample as a point estimate for a population parameter  Don’t match population parameters exactly, but they are our best guess, given the data Sampling Distribution:  The distribution of sample statistics computer for different samples of the same size from the same population  Shows us how sample statistic varies from sample to sample Centre and Shape:  Centre: if samples randomly selected, the sampling distribution will be centred around population parameter



Shape: for most statistics we consider, if sample size is large enough the sampling distribution will be symmetric and bell-shaped Sampling Caution:  If sampling bias exists (if you don’t take random samples), sampling distribution may give bad information about true parameter Sampling Distribution:  We want to know about the variability of the sampling distribution Standard Error:  SE of a statistic, is the standard deviation of the sample statistic  Measures how much the statistic varies from sample to sample  Calculated as the standard deviation of the sampling distribution Sample size matters  As sample size increases, the variability of the sample statistics tends to decrease and the sample statistics tend to be closer to the true value of the population parameter  For larger sample sizes, you get less variability in the statistics  less certainty in estimates 3.2. UNDERSTAND AND INTERPRETING CONFIDENCE INTERVALS Interval Estimate: gives range of plausible values for population parameter Margin of Error:  One common form for interval estimate is statistic ± margin of error  Reflects precision of sample sta...


Similar Free PDFs