Intro to Stats Final Study Guide PDF

Title Intro to Stats Final Study Guide
Course Introduction to Statistical Reasoning
Institution Columbia University in the City of New York
Pages 15
File Size 398.1 KB
File Type PDF
Total Downloads 37
Total Views 157

Summary

Class final exam study guide...


Description

Intro to Statistical Reasoning - Spring 2017 Final Study Guide Margin of error Regression line Issues with asking questions (polling) Random experiments, observational study, causality Only thing you need to memorize is empirical rule and margin of error Box plots, 5 number summary, histograms Z-score, empirical rule, margin of error, iqr Exam= chapters 1-11 minus 6 & 9 Chapters 4, 5, 10, 11 are most important (1, 6, 9 are kind of irrelevant) Lecture 1 -Statistics (the discipline) = procedures and principles for helping people make decisions under uncertainty. -Statistics (plural) = calculations made from data. -Data= numbers that have meaning attached to them -Variable = characteristics recorded about each individual -Categorical variables = values which represent categories -nominal = values have no order (Gender= Male/Female, Favorite Music= Rap, Pop, Country) -ordinal = values have a natural order (strongly agree, agree, neutral, disagree, strongly disagree) -Quantitative variables = values are numerical -continuous = age, income, blood pressure, temperature (can have fractions of measurements -discrete = number of goals in a soccer game, number of accidents (can only have whole values) -Population = larger group from which sample was taken (not always same as population of interest) -Sample = those actually studied to make inference about population of interest -Parameter = number that describes population -Statistic = number that describes sample -Representative sample = sample is representative of population, results can be extended -Random sample = all individuals are given an equal chance to be chosen -The larger the sample the better estimate the sample statistic is of population parameter -Randomized experiment = measure the effect of manipulating the environment in some way, manipulation is assigned to participants on a random basis (like flipping a coin) -To be able to make a causal connection a randomized experiment must be used -Randomization helps make groups equal in all respects except the explanatory variable -Explanatory variable = feature being manipulated -Observational study = manipulation occurs naturally, not imposed, you cannot assume the explanatory variable is the only one responsible for any observed differences in the response variable because groups were not equalled out by randomization

-Sample survey = a subgroup of a large population is questioned on a set of topics, no intervention or manipulation of the respondents Lecture 2 -Questions to ask about a Study: 1. Who funded the study, and who conducted the study (and are they related)? 2. Who were the researchers who had contact with the participants? 3. Who were the individuals (or objects) studied, and how were they selected [population, sample size, response rate]? 4. What was the setting in which the measurements were taken [time, location, method of contact, etc.]? 5. What was the exact nature of the measurements made or questions asked? 6. Were there any other differences in the groups being compared [any confounding variables] besides the factor of interest? 7. What was the magnitude of any claimed effects or differences? Lecture 3 -Interviewer Bias: biases that appear in research findings because of the social nature of the interview -social scientists believe in any situation where ethnic identity is important, it is best to have an interviewer of the same ethnicity as the survey subjects -however, in many cultures, some women will not admit to rape to someone from the same culture because admitting to rape might not only cause humiliation but prevent her subsequent marriage -Deliberate Bias: deliberately using words that suggest the answer you want to hear -appropriate wording should not indicate a desired answer -Unintentional Bias: unclear wording of questions -different wording of questions can elicit different responses -Desire-to-Please: people tend to understate responses about undesirable social habits -Ask the Uninformed: people do not like to admit they don’t know what you are talking about -Unnecessary Complexity: the way the question is asked results in yes or no answers being unclear in whether they agree or disagree with the statement -Ordering of Questions: ordering of questions can plant seed in which first question affects the answer to following questions -first question can create context for the answers to the following questions -Confidentiality versus Anonymity: -people answer differently based on degree to which they are anonymous -confidentiality: researcher promises not to release identifying information about respondents -anonymity: researcher doesn’t know identity of respondents -Closed questions: researcher lists the possible answers (what happens when the list is not complete, should there always be an “other” category) -Open questions: responder gets to write their own answer - in their own words, no choices are given (how to summarize/categorize)

-Reliability (or reproducibility): the scale has been shown that if its re-administered consistent results will be obtained -Validity: actually measures what it claims to measure -Bias measurement: a measurement that is systematically off the mark in the same direction, measurement tool has a systematic internal bias e.g. a broken scale -Measurement Error: amount by which each measurement differs from the true value -problem with measurement tool -problem with technician -Variability: likely to differ from one time to the next or from one individual to the next because of unpredictable errors or differences that are not easily explained, very common to see a variable taking different values across different subjects -Natural variability: differences across time, within individuals, or due to measurement errors -Sampling variation: a random sample almost surely different from another random sample even if they both come from the same population (due to natural variability) Lecture 4 -Sampling frame: a list of units from which the sample is chosen (ideally it includes the whole population) -Census: a survey in which the entire population is measured -Sampling error: the uncertainty, which is inherent in the random process of choosing a sample. A sample cannot guarantee results that are the same as the results of a census of all the individuals in the population of interest. May give a proportion that is exactly the same as, near, or very far from the proportion in population -Sampling methods: -Probability sampling plans: everyone in the population has a specified chance of making it into the sample -Simple Random Sampling (SRS): random selection of data from the entire population so that each possible sample is equally likely to occur -requires a list of units, and a source of random numbers -Stratified Random Sampling: divide population into smaller groups (strata) based on shared characteristics and take a simple random sample from each based on the proportional size of the stratum -advantages: you can set sample sizes to allow separate conclusions for each stratum. -A stratified sample usually has a smaller margin of error than a SRS. Individuals in each stratum are more alike than the population as a whole -Margin of error: measure of accuracy (accounts for miscalculation etc.) -formula:

1 √❑

n=sample size (x100% to get percentage instead of proportion)

-as sample size increases, margin of error decreases -Systematic sampling: sampling is decided by a formula or set of rules (every 10th person etc.) -Convenience sampling: poll on website, use a class you are in, pick people that come out of the mall etc. -Mistakes in sampling:

-Sampling from the wrong population (wrong sampling frame) -Low response rate -Using voluntary response (facebook polls) -Using a convenience sample -What to do when you get a low response rate: -Weighting the responses: One way to do this is take the known percentage of the misrepresented population and weight it by dividing it by the percentage that was represented and multiply that by the misrepresented response Lecture 5 -Response variable: the focus of a question in a study or experiment -An association exists if there is a difference in the distribution of the response variable in populations with different values of the explanatory variable -Association could not imply causation because: -the two populations are different in ways other than explanatory variable -Experiment: a study design that allows potentially to uncover a causal relationship -Four principles of Experimental Design: -Control (of certain variables and the environment) -Try and control sources of variation other than the factors being tested by making conditions as similar as possible for all treatment groups -Making generalizations from the experiment to other levels of controlled factor can be risky -Randomize -Randomization allows us to equalize the effects of unknown or uncontrollable sources of variation -Replicate -Repeat the experiment, applying the treatments to a number of subjects -The outcome of an outcome a single subject doesn’t mean anything -Block -Sometime, attributes of the experimental units that we are not studying and that we can’t control may nevertheless affect the outcomes of an experiment -If we group similar individuals together and then randomized within each of these blocks, we can remove much of the variability -statistically significant: differences that are larger than we’d get just from chance alone -Placebo: the best way to blind patients whether they are receiving a treatment or not -There are two main classes of individuals who can affect the outcome of an experiment: -those who could influence the results (subjects, treatment administrators, technicians) -those who could evaluate the results (judges, treating physicians, etc.) -single-blind: when all are blind in one of these groups -double-blind: when all in both classes are blinded -Effect modifier: a subgroup variable is called an effect modifier if it modifies the effect of the explanatory variable on the outcome (smoking is an effect modifier on the relationship between exercise and blood pressure) -observational study: researchers don't assign choices; they simply observe them.

Manipulation occurs naturally, not imposed. Can’t assume the explanatory variable is the only one responsible for any observed differences in the response variable. -Case-Control Study: attempt to include an appropriate control group in an observational study, does better job reducing confounding variables than observational studies -Prospective: participants followed into future, and events recorded (usually better) -Retrospective: participants are asked to recall past events -The problem with retrospective studies is that some participants may not fully remember accurately the past. Using official records, if they exist, are a partial solution. -reasons why we might have to use observational study: unethical or impossible to assign people to a certain treatment, some explanatory variables cannot be assigned (inherent traits) -confounding variable: related to the explanatory variable, and affects the response variable -The effect of a confounding variable on the response variable cannot be separated from the effect of the explanatory variable on the response variable -Example: seriousness of fire (confounding) is related to number of firefighters (explanatory) and affects amount damage (response) -It is possible to safely assume a causal relationship without a randomized experiment if: the association is strong, the association is consistent, and higher doses are associated with stronger responses. (ie. smoking and lung cancer) Lecture 6 -Histograms: The bins and the counts in each summarize the distribution of the variation in the sample. -Unimodal= one hump (mode) in histogram, bimodal= two humps, uniform= no humps, multimodal= three or more humps -Thinner ends of the distribution are called tails, if one tail stretches out farther than the other, the histogram is said to be skewed to the side of the longer tail -Relative frequency histogram shows percentage instead of count

-Typical value: some sort of measure of central tendency show where the data tend to cluster. -Mean (arithmetic mean, average): the a typical value, sum divided by count -Pros of mean: easy to understand and calculate, uses all values -Cons of mean: can be heavily affected by outliers, can’t do qualitative -Median: the value that about half the population have values below and half have values above

-Pros of median: easily defined/calculated, stable (not affected by outliers) -Cons of median: not based on all observations -mode: most common value -Range: the difference between the maximum and minimum values (max - min) -Con: a single extreme value can make it very large and thus not representative of the data overall -Interquartile range (IQR): lets us ignore extreme data values and concentrate on the middle of the data (IQR= upper quartile - lower quartile) -organize data from lowest to highest, find median of upper half and lower half, subtract those medians (if odd number of data ignore original median and find upper and lower median by averaging two middle values, if even just find lower median and upper median values and subtract) -five number summary: max, min, median, upper and lower quartiles -Box plot: outer parts of box are Q1 and Q3, line in the box is for the median, usually whiskers extend from box out to max and min, other times if outlier is too far they are just marked with asterisk, outlier is defined as observation that is located farther than 1.5 IQR from the closest quartile and outlier is extreme if its is more than 3 IQR from the closest quartile -standard deviation: how widely values in a group vary from mean (better than IQR) The variance is found by summing the squared deviations and almost averaging them -1) Find mean 2) find deviations (value - mean) 3) square deviations 4) add all the squared deviations 5) divide by number of values - 1 6) take the square root 7) check if it makes sense

Lecture 7 -Normal distribution (or normal curve): naturally occurring distribution, most individuals are clumped around the average, with numbers decreasing the farther values are from the average in either direction. (physical measurements for adults of same species and sex) -Characteristics of a normal curve: -Symmetric and bell-shaped -empirical rule: -68% of the values fall within 1 standard deviation of the mean in either direction -95% of the values fall within 2 standard deviations of the mean in either direction -99.7% of the values fall within 3 standard deviations of the mean in either direction

-Z-score: the distance of each data value from the mean in standard deviations (negative means value is below the mean, positive means value is above the mean) -Allows us to compare values that are measured on different scales, with different units, or from different populations. -formula= (ind. Value - mean) / (standard deviation)

Z=

y −μ σ

-standard normal curve: A normal curve with a mean of 0 and a standard deviation of 1. It is the curve that results when any normal curve is converted to standardized scores.

-At what percentile a given individual falls, if you know their value? -if possible, use empirical rule -otherwise calculate standardized score -look up normal tables to find percentile -What proportion of individuals fall into any range of values? -Calculate standardized score for lowest and highest value for range -Lookup normal tables to find percentile below both values Subtract the larger percentile from the smaller percentile -What value corresponds to a given percentile? - Look up the closest percentile in the table Find the corresponding standardized score The value you seek is that many standard deviations from the mean

Lecture 8 -Positive correlation: tend to move together (in same direction) -Negative correlation: tend to move in opposite directions -Scatter Plots: most common and most effective display for data -best way to start observing the relationship and the ideal way top picture associations between two quantitative variables -explanatory variable on x axis, response variable (goes up) on y axis -Correlation: measures the strength of the linear association (relationship) between two quantitative variables -Correlation (r) conditions: -correlation applies only to quantitative variables -correlation measures the strength only of the linear association, and will be misleading if the relationship is not linear -outliers can distort the correlation dramatically -Correlation properties: -the sign of a correlation coefficient gives the direction of the association -always between -1 and +1 -can be exactly equal to -1 or +1 but these values are unusual in real data because they mean that all the data points fall exactly on a single straight line -Correlation of 1 = correlation is exactly a line (flatline=0) -does not depend on units (remains the same) -A coefficient that is not 0 tells us how “line-like” the correlation cloud (of scatter plot data) is -strong correlation: r is greater than 0.8 or less than -0.8 -weak correlation: r is greater than -0.5 and less than 0.5 -moderate correlation: r is between -0.8 and -0.5 or 0.5 and 0.8 -there can be a strong association that is nonlinear (exponential growth etc.) -regression: method for studying the association between two variables using a model -best fit line (regression line): a straight line that comes as close as possible to the points of a scatter plot -Regression effect (or regression toward the mean): an attribute that is extreme on an initial measurement will tend to be closer toward the mean of a group on a subsequent measurement. -Regression (or regressive) fallacy: Assumption that something has returned to normal because of corrective actions taken. This fails to account for natural fluctuations. Lecture 9 -Linear regression model: gives a predicted value for each case in the data. -cannot assume relationship exists beyond range given -farther the value is from mean, less we should trust -outliers can have major impact -extrapolation: when we get into a new “x” territory -dangerous because we have to assume nothing has changed in relationship -used to try to look into future -Simpson’s paradox: trend present in different groups is reversed when the groups are combined ( 2/8 > ⅕, ⅘ > 6/8, 6/13...


Similar Free PDFs