Chapter 1 Notes PDF

Title Chapter 1 Notes
Author Clear Vision
Course Introduction to Statistics and Data Analysis
Institution University of Michigan
Pages 11
File Size 495.6 KB
File Type PDF
Total Downloads 70
Total Views 145

Summary

STATS 250 NOTES MADE EASIER...


Description

STATS 250 NOTES* (condensed) Chapter 1 **Report decimal places 3-4 places after the decimal point.**

Stats investigation process --- ADEUFR: 1. Ask a Research Question 2. Design a study and collect data 3. Explore the data, provide graphical displays, and numerical summaries 4. Use statistical analysis methods to draw inferences from the data 5. Formulate conclusions, communicate findings and answer research questions 6. Reflect and look forward(point out limitations and suggest further studies) When data is not collected for question --- ITEUFR: 1. Import the data 2. Tidy the data 3. Explore the data, provide graphical displays, and numerical summaries 4. Use statistical analysis methods to draw inferences from the data 5. Formulate conclusions, communicate findings and answer research questions 6. Reflect and look forward(point out limitations and suggest further studies)

Categorical Variables --- or qualitative variables ● place an individual or item into one of the several groups or categories, which are called levels. ● Nominal variables vs. Ordinal variables, ordinal follows specific order(small, medium, large) ● They report frequency counts ● Relative frequency - Decimals or Percent ● Graphs used to display -- bar charts, pie chart

● Example: Ice cream flavors are a categorical variable, and how many flavors are the levels, and how many people like which flavor would be frequency counts, and when you report the counts in decimals or percentages its considered relative frequency Numerical Variables --- or quantitative or measurement ● variables that take on a wide range of numerical values, and it is sensible to do math with numerical variables. ● Discrete: numerical variables with jumps. Ex: 1, 2, 3, 4, 5 ● Continuous variables: Can take any value in an interval or collection of intervals. Ex: 2.3, 6.7. 7.808, 8.9999 ● Numerical variables can have subtypes. Ex: ages 18-25, 25-34, 35-44… ● It’s possible to turn numerical values to categorical by grouping them, but not categorical values to numerical ● Graphs: Histograms, boxplots, scatterplots Contingency tables: one-way data is summarized with tables Data Matrix: Common way to organize raw, unprocessed data. They have columns and rows.

Population: The entire group we are interested in learning about. All undergraduate students in the US. ● We dont observe every case in a study. Would be time-consuming and costs too much, and sometimes can destroy the item in the process of measurement Sample: A subset of the cases that is often a small fraction of the overall population. Undergraduate students un UMICH ● It provides an estimate for the overall population, less time-consuming, and less costly. ● There could be biases in a sample so the way we sample is important

● Biases: Convenience sampling, Response bias, Non-response bias Anecdotal Evidence: Typically composed of unusual cases that are recalled based on their striking characteristics.

Sampling from a Population: To draw inferences about a population 1. The sample must be representative of the entire population 2. Use random sampling: subjects of study/experiment should be selected randomly to ensure the sample is representative Explanatory and Response variables ● Explanatory: a variable that predicts the outcome ● Response: a variable that is the outcome --- responds to the explanatory variable

Two types of data collection: Observational and Experimental ● Observational Studies: refer to instances where researchers collect data in a way that does not directly interfere with how the data arise. They simply observe. ○ Interested in looking at the relationship between two or more variables ○ Data is usually collected only by monitoring what occurs ○ Making causal inferences based on observational studies is difficult but not impossible ● Experiments: researcher directly influences the process by which data arise. ○ Subjects are usually assigned to one or more treatments RANDOMLY ○ There is usually a control variable or a placebo effect ○ Require the primary explanatory variable in a study be assigned to each subject by researchers ○ Making causal conclusions is reasonable, depending on the way the explanatory variable is assigned.

Observational Studies Simple Random Sample (SRS) ● Of n observations from a population is one in which each possible sample of that size has the same chance of being in the sample that is selected ● So, every member/case in a population has an equal chance of being included and there is no implied connection between members/cases in the sample. ● The best way of ensuring that the sample is representative of the population it is chosen from. ● For experiments: Makes groups similar as possible(dispersing confounding factors evenly between groups) with only difference being due to the treatment. Stratified Sampling ● Strata: Group of individuals or cases in a population who share characteristics thought to be associated with the variable we want to measure ● A “divide and conquer” sampling strategy ● The population is divided into non-overlapping strata ● Each SRS is taken from each stratum ● Works when there is variability between each stratum but not much variability within each stratum

Convenience Sampling ● Refers to samples that are obtained by measuring whatever or whoever is available to be measured. So nearest cases and subjects --- convenient ● Rarely representative of a larger population

● The sampling method is often biased

BIAS in sampling: ● Results obtained based on a survey are biased if the method used to obtain those results would consistently produce values that are either too high or too low. ● Selection Bias: occurs when the method for selecting participants produces a sample that does not represent the population of interest. Ex: convenience sampling ● Nonresponse Bias: occurs when a representative sample is chosen for a survey, but a subset cannot be contacted or does not respond. Ex: Voluntary response sample, or surveying only a specific people and not the entire population. ● Response Bias: Occurs when participants respond differently from how they truly feel. The way questions are worded, the way the interviewer behaves, as well as many other factors, might lead an individual to provide false information ● Sampling Bias: A type of bias that occurs when the method for selecting participants causes some individuals in the population to more or less likely to be included in the sample than others. Ex: Convenience sampling, any type of sampling that is not random. Limits of observational variables: ● Confounding variables: are variables that are associated with both the explanatory and response variables. These are hidden variables that affect the conclusions we make in a study. ○ Get in the way of making causal conclusions about relationships between the explanatory variables and response variables. ○ No guarantee that all confounding variables can be examined or measured --reason why observational studies are difficult to make causal conclusions based on Why conduct observational studies? *Observational studies are good at showing a relationship exists. It is difficult to use the data from observational studies to show why a relationship exists.* ● Often it’s not feasible or ethical to conduct experiments to make causal conclusions, so we simply make use of observational studies.

● Causal relationships described as a one-way (asymmetrical) relationship between two variables ● Associative relationships described as two-way (symmetrical) relationships between two variables Experiments ● Establishing causal-and-effect relationships is central to science. ● Randomized Controlled Experiments: can demonstrate causal connections between variables. Instead of gathering data from simple observations, researchers randomly assign treatments to different cases and observe the treatment’s effects. Four principles of Experimental Design: 1. Controlling it: a. Have a comparison group compare the effects of treatments b. Control Group lets us understand what the effect of treatment is i.

Also reduces or eliminates the effects of any other variables that might influence the results

c. Expectations of researchers and subjects can influence the results of an experiment d. Control group the Placebo, a supposed treatment that had no effect e. Placebo Effect: a beneficial effect produced by a placebo drug or treatment, which cannot be attributed to the properties of the placebo itself, and must therefore be due to the patient's belief in that treatment f. Single Blind: When only one group is unaware of the treatment. So, it could be that subjects might not know what treatment they got placed in but researchers do or it could be that researchers don’t know but subjects do. g. Double-blind: When both subjects and researchers are unaware of the experiment purpose and or treatment 2. Randomization: a. Subjects must be assigned randomly to treatments to account for confounding variables that cannot be controlled b. Evens out differences and produces groups that are comparable. 3. Replication:

a. More cases = more accuracy in estimating effects the explanatory variable has on response 4. Blocking: a. A group of experimental units that are known to be similar in some way is expected to affect the response. Ex: subjects with heart disease and subjects without heart disease. b. The Randomized Block Design separates subjects into blocks and then randomly assigns treatments Randomized Controlled Design

Randomized Block Design

Summarizing Data Parameters and Statistics: ● Summary values calculated from a population are called Parameters. Population parameters ● Summary values calculated from a sample are called Statistics. Sample - statistics ● We use sample statistics to estimate the population parameters ● Dichotomous/Binary variables: Sometimes we are interested in the proportion of the population(or the sample) that has a specific characteristic ○ We categorize the response into exactly two categories ○ Binary code: 1’s ad 0’s. 1 = category to count and 0 = category not count Calculating proportions

𝑝=

𝑥 𝑛

Summarizing Numerical Data 1. Central Tendency: Mean and Median a. Mean: the sample mean of a numerical variable is the sum of all the observations divided by the number of observations: the numerical average i.

Sensitive to extreme observations

b. Median: the middle number when data is arranged in smallest to largest. Also the 50th percentile.

i.

When even numbers of observations: take average of two middle observations

ii.

resistant/robust to extreme observations

2. Variation: Range, IQR a. Range = maximum - minimum i.

Sensitive to extreme cases in observations

b. Interquartile range or IQR = Q3 - Q1 or 75th percentile - 25th percentile i.

Resistant/robust to extreme cases

c. The 5-number summary: Min, Q1, Median, Q3, and Max 3. Variance and Standard Deviation: a. A deviation tells us how much observations depart from the mean.

i.

Deviation: observation - mean = x - 𝑥

b. We square deviations before adding them all up -- sum of squares where it increases with every additional observation c. Variance is the sum of squares divided by the number of observations minus one d. Standard deviation: square root of the variance i.

Interpreting SD: The response variable in context is __SD___ away from the mean of the response variable in context.

ii.

Sensitive to extreme cases

iii.

s = 0 means there is no variability in the data. All observations are same and equal to the mean

iv.

Describes how close or far the data are from the mean using the units in which the data are recorded

2

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒: 𝑠

=

𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑛−1

=

Σ(𝑥 − 𝑥) 𝑛− 1

𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑆𝐷): 𝑠 = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ● Median uses IQR ● Mean uses Standard deviation

2

Σ(𝑥 − 𝑥) 𝑛− 1

2

Notation:

Graphical Summaries of Data - SOCV 1. Shape: a. Symmetric: 𝑚𝑒𝑑𝑖𝑎𝑛 ≈ 𝑚𝑒𝑎𝑛 b. Skewed left: 𝑚𝑒𝑑𝑖𝑎𝑛 > 𝑚𝑒𝑎𝑛 c. Skewed right: 𝑚𝑒𝑎𝑛 > 𝑚𝑒𝑑𝑖𝑎𝑛 d. Use-ly words - roughly, slightly e. Bimodal (two-peaks in data) or Unimodal (only one peak) 2. Outliers: Data variables that are far away from the majority. 3. Center: measure of central tendency --- mean or the median 4. Variability: Standard deviation or the IQR ● Use median and IQR when the distribution is skewed

● Use mean and standard deviation when the distribution is roughly symmetric.

● Describe histograms and dot-plots using SOCV ● Dotplot: represents each observation in a data set using a single dot plotted along the x-axis ● Histogram: used for larger data sets of quantitative variables ● A boxplot: is a data visualization of the 5-number summary ○ Careful when determining the shape of a distribution from boxplot it might not be accurate. Categorical Variables: ● Contingency tables ● Bar plots : segmented or side-by-side ● To describe the data in these displays: compare the different categories and provide simple observations you notice about the data....


Similar Free PDFs