STAT200 Midterm 1 PDF

Title STAT200 Midterm 1
Course Elementary Statistics
Institution The Pennsylvania State University
Pages 3
File Size 246.8 KB
File Type PDF
Total Downloads 96
Total Views 140

Summary

Cheat sheet for midterm 1 for STAT200 ...


Description

Case: experimental unit from which data are collected Variable: characteristic of cases that can take on different values Categorical: groupings w/ no logical order or groups with inconsistent differences (rankings) Quantitative: numerical values with consistent interval (cups of coffee/day, age in months) Continuous: can take on any value between values (3.333) Discrete: can take on limited number of values (whole numbers) Explanatory Variable: independent/predictor (x axis, categorical) Response Variable: dependent, outcome (y axis, quantitative) Population: population parameter (entire group of individuals), parameter = fixed values Sample: sample statistic, representative if accurately reflects the population (subset of pop; rando b/c differ) Inferential Statistics: procedures use data from an observed sampled to make conclusion about population Bias: systematic favoring of certain outcomes Sampling Bias: systematic favoring of certain outcomes b/c methods employed to obtain the sample Simple Random Sampling: sample from population where every member has equal chance of selection Convenience Sampling: sample from a population by ease of accessibility Non-Response Bias: occurs when people choose not to participate, differs from those who choose to Response Bias: occurs when participants don’t respond truthfully Experimental Study: researcher manipulates treatments for subjects & collects data Observational Study: researcher collects data without performing any manipulations Confounding Variables: third party variable (ice cream, home invasions up b/c warm weather) Randomization: randomly assign participants to different levels of explanatory variables (NOT random sampling) Causation: changes in one variable can be attributed to changes in second variable (randomization necessary) Association: relationship between variable Independent Groups: cases in each groups are unrelated to one another Paired Groups: cases in each group are matched somehow (dependent samples/matched pair Blinding: procedure to prevent bias (single: participants don’t know/double: researchers & participants don’t know) Cohen’s D- effect size/ measure practical significance Publication Bias: only statistically significant results are published Population Mean = μ (mu) Population Standard Deviation = σ (lowercase sigma) Sample Mean = x(x-bar) Sample Standard Deviation: s (lowercase s) Data concerning 1 categorical variable can be summarized using... proportion/risk/probability or odds Population use p or π to note population proportion Sample use symbol p8 x/n = # w trait/total # 0≤p≤1 Proportion = p or p8 Risk = risk = # w trait/total # Probability = p(variable) Odds = risk/1-risk = p/1-p = # w trait/# wo trait

Probability = P(A) Complement (probability event did not occurs): uses 1 categorical > P(A^C) = 1-P(A) Intersections, unions, conditional probabilities:uses at least 2 categorical variables Intersections (probability events occur together) > P(A∩ B) > P (female ∩ beer) = 25/500(total #) = 0.05 Opposite of intersection = disjoint Union = probability of event A AND/OR event B occurring P(A ∪ B) > P(A ∪ B) = P(A) + P(B) - P(A ∩ B) Conditional Probabilities = P(A|B) bar is read as “given” If A and B are independent (unrelated) events, then P(A) = P(A|B) Distributions Symmetrical Distribution: normal/bell distribution (mean=median=mode) Right Skewed Distribution: data falls to the right (modemean) Mean “average average”: μ or x = Σx/n ( sum of all values/number of values) Median “middle” value: if 2 find average of the values Mode “most frequent” Standard Deviation (average difference between each observation and the mean) Sample = s (lowercase S) MiniTab Instructions

If you repeatedly pull samples of n from a known population and record a sample statistic (p-hat, x-bar, r, etc.) that distribution of sample statistics is known as the sampling distribution. Sampling Distribution: distribution of sample statistics (e.g. sample means/proportions) of the same size (n) drawn from a population Standard deviation of a sampling distribution = standard error STATKEY Confidence Intervals: sampling distribution (mean or proportion) be sure sample size (n) is correct, always at least 5K samples (Std. Error = Std. D) Sampling Distribution for Prop/Mean only work when all population values are known. With Confidence Interval there are no population values, just sample values used to make an estimate. In this case, use bootstrapping. Sampling distribution w/ known pop. the mean is approx. equal to the population parameter. As sample size (n) increases, standard error decreases. As sample size

Statistics> Summary Statistics> Descriptive Statistics 1. Compute x (mean) 2. Compute all deviations (difference between that observation and mean > x- x) ∑=0 3. Square all deviations (x-x) 4. Add all deviations squared (x-x )² sum of squares = SS 5. Divide SS by n-1 (results in s² AKA variance) 6. Take square root of variance √s² or √variance Percentile: proportion of values falling below a given value, ex. student scored in 85th percentile AKA st Z score: distance between an individual observation and mean in standard deviation units AKA standardized Calculate Z score: need mean and standard deviation of overall distribution

score

Sample Z Score: z = x-x/s Population Z Score: z = x-μ /σ Z Distribution = normal. mean=0, standard deviation=1 SAT scores normally distributed with μ=500, σ=100 Positive Z score is above the mean. Negative Z score is below the mean. Empirical Rule/95% Rule (applies to normal distributions) states that on a normal distribution about 68% of observations will fall within 1 standard deviation of the mean, about 95% will fall within 2 standard deviations of the mean, and about 99.7% will fall within 3 standard the mean. (95% is most used interval)

Five Number Summary [Minimum, Q1, Median, Q3, Maximum] *Used to describe key features of distribution* Minimum = lowest observation Q1 = first quartile (quarter of the distribution/25th percentile) Median = middle observation (50th percentile) Q3 = third quartile (75th percentile) Maximum = highest observation *Using these values can compute range and interquartile range (IQR)* Range = maximum-minimum (heavily influenced by outliers)

deviations of commonly

Bootstrapping is a resampling procedure that uses data from a sample to construct sampling distribution w/ replacement from original data. Increase sample size, decrease standard error. Increase sample size, narrower confidence interval. Increase sample size, shape becomes closer to a normal distribution. Hypothesis Test → Null (no diff in the pop; H0; =), Alternative (diff in pop; Ha/H1; /≠) Equation: 1. Parameter of interest 2. Direction (>0; negative association r regression > correlation

as x/y

Simple Linear Regression Explanatory Variable = X Response Variable = Y X is ALWAYS used to predict Y. (both quantitative) y=mx+b → ŷ=a+bx (y-intercept = a; slope = b; y has hat b/c predicted value) *Lesson 12: For population, may be written as: ŷ=σ+βx OR ŷ= β0 + β1(x) MiniTab Instructions: statistics > regression > simple regression The least squares method computes the values of the slope and y-intercept that make the same of squares errors (SSE) as small as possible. Error = e; e = residuals; e = y - ŷ (difference between observed y-value and y-value predicted using the regression equation) Regression line AKA line of best fit → if point is above the regression line, positive residual/if point is below regression line, negative residual EXAMPLE SHOWN THERE → Causations: avoid extrapolation, make scatterplot to check linearity, outliers can heavily influence a regression model Interpret point w/ 2+ variables → scatterplot MiniTab instructions: graphs > scatterplot > single y-variable w/ groups...


Similar Free PDFs