Title | Stats exam 1 review |
---|---|
Author | Sajil Ismail |
Course | Biostatistics |
Institution | University of Texas at Austin |
Pages | 7 |
File Size | 261 KB |
File Type | |
Total Downloads | 107 |
Total Views | 145 |
review...
Sefia Khan SDS 328M Exam 1 Review DATA: Descriptive Statistics: summarizing and displaying data (median, histogram) Inferential Statistics: using estimates from a sample to draw conclusions about a population (hypothesis testing, correlation coefficient, etc) Population Data: entire collection of individuals of interest - Use parameters to describe characteristics mean = µ, sd = σ Sample Data: subset of population - Use statistics to describe sample characteristics mean = ¯x, sd = s) Properties of Good Samples: no bias or variance Sampling Error (variance) - difference between an estimate and the population parameter being estimated caused by chance - lower the sampling error, higher the precision - precise (no variance) values obtained are tightly grouped and repeatable
-
Bias discrepancy between true population and estimates we would obtain if we could sample a population repeatedly would create an inaccurate estimate of the population accurate (no bias) data centered around the bull’s eye or true population stats
Random Sampling: (external validity) - Eliminates bias and quantifies sampling error 1. Every unit must have equal chance of being included in population. 2. Selection of units must be independent. Nonrandom Sampling: - Convenience Sampling – selection based on being easily available to researcher - Volunteer Sampling – selection based on subjects’ desire to participate (may result in behavioral differences) VARIABLES: Numeric Variables: quantitative measurements a. Discrete values are whole numbers with no intermediate values o ex – shoe size, number of children b. Continuous any value in range, no gaps in between o ex – height, weight, distance, time
Categorical Variables: qualitative characteristics a. Nominal unordered categories o Ex – gender, pet preference b. Ordinal ordered categories o Ex – grade on exam, survey responses, classification in school EXPERIMENTS: Experiment: treatment is assigned randomly to individuals reveal cause-effect Observational Study: treatment is not assigned randomly reveal associations True Experiments: use random assignment XY - X and Y are correlated - X preceded Y in time - No other explanations for X Y Minimize Bias: 1. Randomization o Creates two or more groups of units that are similar to each other o No preexisting differences o Allows researchers to tease apart the effects of the explanatory variable from those of confounding variable o Random assignment – making sure treatment and control groups are not systematically different prior to treatment (internal validity) Confounding variable: variable that masks the relationship between variables, bias the estimate - Influences explanatory and response variables so that we cannot discern actual effect - Non-manipulable events: natural disaster, childhood trauma - Non-manipulable attributes: age, height, being a smoker 2. Control o Group of subjects who do not receive treatments o Decrease placebo effect o Counterfactual – way to stimulate what would happen if treatment group was not treated 3. Blinding o Concealing information about who actually got treated from participants and researchers o Single blind participants don’t know if they got treated Prevents subjects from behaving differently Requires indistinguishable treatments o Double blind participants and experimenters unaware of who got treated Prevents researchers from behaving differently towards subjects
Minimize Sampling Error: 1. Replication 2. Balance 3. Blocking/Stratifying o Stratified random sampling is when the population is divided into subpopulations and random samples are drawn from each subset/stratum o Ensures sample ends up with desired distribution o Reduces sampling error, higher precision Selection Bias: happens when the average person receiving one condition (treatment) different from average person receiving another condition (control). - Occurs when individuals select themselves into treatments, no randomization, etc. - Random assignment eliminates selection bias - Matching: create a control group that ‘looks like’ the treatment group in terms of possible confounding variables. Match every individual who has the characteristic of interest with someone who does not DISPLAYING DATA: Univariate: graphs/stats with a single variable Bivariate: graphs/stats with two or more variables; deals with relationships/association use scatterplot, grouped boxplot/histogram/bar chart Frequency Distribution: counts for each value (or range of values) in dataset Type of Variable Categorical Numeric
Graph Bar Chart, contingency table Histogram
Histogram splits the range of the data into equal sized bins and shows how many data points fall into each DESCRIBING DATA: Measures of Center: - Mean - Median o Resistant to outliers Measures of Spread: - Standard deviation - Range (max – min)
-
Interquartile range (Q3 – Q1) o Q1 – 25th percentile = median of lower half o Q3 – 75th percentile = median of upper half o To calculate outliers: IQR x 1.5 Add answer for outliers above Q3 Subtract answer for outliers below Q1
Skewed Distribution Report median and IQR
Symmetric Distribution Report mean and standard deviation
Mean get skewed to where the direction of the tail is:
PROBABILITY: Probability of an event is the ‘long run’ relative frequency Conditional Probability: P(A | B) x P(B) = P(A & B) The probability of event A given that B has happened Ex: “what is the probability of surviving, given that you were in first class?” P(alive | first class) / P(first class) Independent: occurrence of one event does not inform us about the probability of another - Events are independent if any of these are true: (multiplication rule) o P(A) × P(B) = P(A and B) o P(A) = P(A | B) o P(B) = P(B | A) Dependent: P(A or B) = P(A) + P(B) – P(A and B) addition rule
Probability of events A or B (or both) occurring If A and B are mutually exclusive, then P(A and B) = 0 Probability Trees: - Roots are marginals: P(A) - Branches are conditionals: P(B | A) - Leaves are joints: P(A and B) - Multiply roots by branch to get leaf Law of Total Probability:
Bayes’ Theorem:
DISTRIBUTION: Normal Distribution: Defined by two parameters: mean (µ) and standard deviation (σ) - If variable X is normally distributed: X ~ N(µ, σ) - Most of the data found close to mean; less at extremes Properties: 1. Unimodal 2. Symmetrical 3. Mean = median = mode 4. Bell shaped 5. Height and width determined by standard deviation Standard Normal Distribution: - Z scores follow a standard normal distribution
For a population For a sample -
Z-scores in the range from −1.96 to 1.96 are usual or typical 95 % is usual
Empirical Rule:
Sampling Distributions: - Standard deviation of the sampling distribution (the standard error) is σx¯ = σ/√ n - Mean of sampling distribution same as the population - Standard deviation of sampling distribution (standard error) is much smaller HYPOTHESIS TESTING: Steps: 1. Hypothesize: state a claim and counterclaim about a population parameter (usually mean) o Null Hypothesis statement of no effect / no difference Ex: (µ = 0) o Alternative Hypothesis statement of difference Ex: µ 6= 0, µ < 0, or µ > 0 2. Test Statistics: To calculate whether your hypothesis is correct use x ~ N(µ, σ/√ n) a. Find the p value using the z score
3. P-value: probability of getting an estimate at least as extreme as the one we got in the direction of alternate hypothesis in the direction if null hypothesis is true 4. Conclusions: o If p is small (reject null – hypothesis is true) our result is unlikely if null hypothesis is true, so we can reject null as being true o If p is large (fail to reject null – hypothesis is false) our result is not unlikely if null hypothesis is true, so we cannot reject null o ‘how small’ is set by the significance level α (cut off) o α = .05
Errors in Hypothesis Testing: Conclusion Reject H0 Fail to reject H0
H0 true Type I error Correct
H0 false Correct Type II error
Power: the probability that a random sample will lead to rejection of a false null hypothesis - Study has more power: o If sample size is large o If the true discrepancy from the null hypothesis is large o If variability in the population is low...