Midterm pt 1 study guide PDF

Title	Midterm pt 1 study guide
Author	Skyler Lowman
Course	Elementary Statistics
Institution	James Madison University
Pages	20
File Size	473.9 KB
File Type	PDF
Total Downloads	20
Total Views	148

Preview

CLICK TO PREVIEW PDF

Summary

midterm study guide the statsitics half - Professor Arlene Casiple...

Description

Math 220 Midterm Part 1 Chapters 1-6, 11 Chapter 1: Basic Idea Vocabulary Words ● Statistics: The science of collecting, summarizing/ analyzing, and drawing conclusions from data ● Population: The entire collection of individuals or objects the researcher is interested in ● Sample: A subset of the population that are actually observed ○ Simple Random Sample: a sample where each person/ thing in the population is equally likely to make up the sample- lottery- the most basic ■ On Random Digits Chart- every 2 digits ■ � n= population standard deviation ■ � n-1= sample standard deviation ■ *** HOW TO FIND SAMPLE MEAN USING SCIENTIFIC CALC: ● 2nd 7- CSR: Clear Data List ● Put numbers in calc→ press the � + button after each number (end number should be total numbers entered) ● 2nd x- bar (above x^2) ■ *** HOW TO FIND SAMPLE STANDARD DEVIATION ON CALC: ● 2nd 7- CSR: Clear Data List ● Put numbers in clas → press � + button after each number (end number should be total numbers entered) ● 2nd xn-1 (above square root x)- for SAMPLE ○ POPULATION IS xn ● ROUND TO 4 DECIMAL PLACES ○ Convenience Sample: a sample is taken from a group of people easy to contact or reach, not drawn by a well-defined random method- only okay to use when there is no systematic difference between the sample and the population ○ Stratified Sampling: dividing the population into strata (groups) & then taking a simple random sample from each ■ Ie: dividing population into groups based on age and there’s different numbers/ amount of people surveyed from each group ■ Simple random sample of each group ○ Cluster Sampling: divide the population into clusters, randomly select a sample of clusters & sample everyone in those few clusters ■ Ie: only few groups are selected, but everyone in those groups selected are surveyed ■ Simple random sample of groups & sample everyone in the few randomly selected groups ○ Systematic Sampling: population items are ordered and randomly choose a starting point in the first segment then sample at the same point in each segment

○ Voluntary Response Sampling: highly unreliable- relies only on volunteers to take part in the sample→ usually people who have strong opinions ● Parameter: A number that describes the population- in simple random sampling ○ 𝜇 (mu): Population Mean ○ 𝜇 (sigma): Population Standard Deviation ○ 𝜇: Population Proportion ● Statistic: A number that describes the sample- in simple random sampling ○ X- Bar: Sample Mean ○ s: Sample Standard Deviation ○ P- Hat: Sample Proportion ● Variables: Characteristics of individuals/ subjects about which we collect information ○ Data Set: The collection of information- values of variables/ Data: the variables we obtain ○ Qualitative/ Categorical Variables: Classify individuals into categories (ie: gender, color of a car) ■ Ordinal: There is a natural ordering (ie: letter grade, size of drinks) ■ Nominal: There is NO order (ie: gender, state of residence) ○ Quantitative Variables: Tell us how much or how many of a variable there is (ie: age, mileage of a car)- always numerical ■ Discrete: Variables that can be counted/ listed- whole numbers (ie: age, number of siblings) ■ Continuous: Variables that are measured- possible values are not restricted to any list - intervals: can be decimals (ie: height, distance someone commutes to work) ○ Explanatory Variable: A variable that attempts to explain or is prompted to cause (at least partially) differences in the response variable (independent variable) - in experiments ○ Outcome/ Response Variable: What is measured on each individual after a treatment is applied- in experiments ○ Treatment: The procedures applied to each experimental unit- always 2 or morein order to determine whether the choice of treatment affects the outcomedifferent levels of the explanatory variable ○ Confounding Variables/ Confounder: A variable related to both treatment and the outcome ■ Can distort the results of an experiment because they, and not the explanatory variable, may be the agent that actually causes the change in the response variable ■ Solution: RANDOMIZATION!! Types of Studies ● Randomized Experiments: A study in which an investigator assigns the treatment to

subjects at random ○ Can conclude cause/ effect!!! ○ Rules out confounding variables ○ If there are large differences in outcomes among treatment groups, we can conclude that the differences are due to differences in treatment ○ Double Blind: when neither the investigator nor the subjects know who has been assigned to which treatment ○ Randomized Block Experiment: subjects are divided into blocks (subjects in each block are similar with regard to a variable related to the outcome - ie: age or gender) → used to control known sources of variation ● Observational Studies: A study in which the assignment to treatment is NOT made by the investigator ○ CanNOT conclude cause/ effect ○ Results in ASSOCIATION ○ Response variable difference might be caused by confounding variables ○ Cohort Studies: A group of subjects (the cohort) is studied to determine whether various factors of interest are associated with an outcome (Con: cannot be used to study rare diseases) ■ Prospective: subjects are followed overtime- among the best of observational studies (Pros: quality of data is better & confounder information can potentially be collected- reliable results; Cons: expensive and time- consuming) ■ Retrospective: subjects are sampled after the outcome has occurredinvestigators look back over time to determine whether certain factors related to the outcome (Pros: less expensive, quick results; Cons: Confounders) ■ Cross- Sectional: Measurements are taken at one point in time (Pros: Inexpensive, quick results; Cons: confounding variables about past exposures contribution to outcome, no way to determine whether the cause or effect came first) ○ Case- Control Studies: two samples are drawn (one with the disease- cases, and one without the disease- control)- can be used to study rare diseases ■ Investigators look back over time to determine factors of interest that differ between the two groups ■ Always retrospective Bias ● Bias: the degree to which a procedure systematically overestimates or underestimates a population value ○ Simple randoming sampling: UNBIASED ● Sources of Bias:

○ Voluntary Response Bias: people with strong opinions are more likely to participate ■ Most of the time, the survey is optional (people are invited to participate in order to express their opinion) ■ Ie: President asks people to email their opinion ○ Self-Interest Bias: people who have an interest in the outcome of an experiment have an incentive to use biased methods ■ Ie: Advertisers will not report data that shows that their product is inferior - a protein supplement is claimed by its company to yield 15-20% muscle growth in 6 weeks ○ Social Acceptability Bias: people are reluctant to admit to behavior that may reflect negatively on them (ie: Did you vote last election?)- solution: wording ■ Ie: prisoners saying they are innocent when asked ○ Leading Question Bias: questions are sometimes worded in a way that suggests a particular response- solution: wording ○ Nonresponse Bias: Nonresponders: people asked to participate refuse to do so ■ Surveys are sent out & people don’t respond!! Difference from voluntary response bias where participation is optional! ■ Ie: predicting life expectancy based off of death certificates, doesn’t include the people still living OR people simply not responding to a survey ● Ie: questionnaires sent out to all constituents, only some respond ○ Sampling Bias: when members from the population are more likely to be included in the sample than others (ie: convenience sampling)- almost impossible to avoid sampling bias ● *** A sample has to be representative of the population, size does not matter*** Chapter 2: Graphical Summaries of Data Frequency Distributions for Qualitative Data ● Frequency: the number of times a category occurs in the data set ● Frequency Distribution: A table that presents the frequency for each category ○ Categories in one column, frequency is another column ● Relative Frequency: The frequency of the category divided by the sum of all the frequencies- the proportion of items in the category ● Relative Frequency Distribution: A table that presents the relative frequency of each category- often the frequency is presented as well ○ Categories in one column, frequency is another column, relative frequency in a column ○ All relative frequencies should ADD UP TO 1 ● Bar Graphs/ Charts: A graphical representation of a frequency distribution ○ A bar for each category ○ Heights represent the frequencies or relative frequencies of the categories

○ Category names along the horizontal axis, evenly spaced ○ Vertical axis- the frequency or relative frequency ○ Bars do not touch & should all be the same width ○ CAN MOVE BARS- meanings won’t change ● Pareto Charts: Bar graphs which are ordered by size- largest frequency/ relative frequency on the left, smallest on the right ○ Useful when it’s important to see the most frequently occurring categories ● Horizontal Bars: bar graph with axis’ switched- bars are horizontal ○ Helpful when categories have long names ● Side-by- Side Bar Graphs: bar graphs with both bars on the same axes, putting bars that correspond to the same categories next to each other ○ Helpful to compare two bar graphs that have the same categories- ie: difference of frequencies from 2 different years ○ Bars right next to each other on the same axis, different colored bars for different years or other variable ● Pie Charts: Displays relative frequencies in a circle ○ Divided into sectors for each category- sizes match relative frequencies ○ Label each sector with its relative frequency as a percentage Frequency Distributions for Quantitative Data ● Classes: we divide data into classes in order to construct a frequency distributionintervals of equal width that cover all the values that are observed ○ Ie: 0.00-0.99, 1.00-1.99, 2.00-2.99 ○ Lower Class Limit: the smallest value that can appear in that class ○ Upper Class Limit: the largest value that can appear in that class ○ Class Width: the difference between the lower limit & the lower limit of the next class ○ Every observation must fall into a class, classes cannot overlap, classes must be of equal width, no gaps between classes even if there are no observations in a class ■ Computing the class width for a given number of classes: largest data value - smallest data value divided by the number of classes ● Frequency Distribution: a table that presents the frequency for each class ○ Class in one column, frequency in the other column ● Relative Frequency: the frequency of the class divided by the sum of all frequencies ● Relative Frequency Distribution: a table that presents the relative frequency of each class ○ Class in one column, frequency in one column, & relative frequency in another column ○ All frequencies should ADD UP TO 1 ● Histograms: A graphical representation of a frequency or relative frequency distribution ○ On horizontal axis, numbers are set correlating to each class with bars in their class numbers- the left edge should be at the lower limit & right edge at lower

limit of next class ○ Vertical axis- frequency or relative frequency ○ Widths are equal to class widths & bars must touch!! ○ The number of classes should be at least 5 but no more than 20 ○ Cannot move bars ○ Open Ended Classes: when the first class has no lower limit or the last class has no upper limit (ie: 85 and older) ○ Histograms for Discrete Data: ie: drawing a histogram for the variable: number of children (discrete data) ■ Bars have equal width, centered at the values of the variables, bars still touch ○ Shapes: histograms give shape of a data set ■ Symmetric: when it's right half is a mirror image of its left half ■ Skewed Right: more frequency on the left side- tall left, low right ● Right tail is longer than left tail ■ Skewed Left: more frequency on the right side- tall right, low left ● Left tail is longer than right tail ○ Modes: Peaks! ■ Unimodal: only has one mode or peak ■ Bimodal: has 2 modes or peaks More Graphs for Quantitative Data ● Stem- and- Leaf Plots: a vertical list of all the stems in increasing order, vertical line in between, list of leafs on the right (rightmost digit) in increasing order ○ If the leafs will have 2 digits, you have to round the data to make the leafs one digit ○ Tells us more information about extreme values ○ Split Stem-and- Leaf Plot: when one or two stems contain most of the leaves, we use 2 or more lines for each stem ■ Each stem must be given the same number of lines ○ Back to Back Stem-and -Leaf Plot: when 2 data sets have values similar enough that the same stems can be used so that we can compare their shapes ■ Stems go down the middle, leaves from one set go off to the right (smallest lead closest to the stem) and leaves from the other set go to the left ■ Title each data set at the top ● Dotplot: a graph that can be used to give a rough impression of the shape of a data set ○ Useful when data is not too large and there are some repeated values ○ Vertical column of dots is drawn- number of dots in each column is equal to the number of times the value appears in the data set ○ Indicates where values are concentrated and where gaps are ● Time Series Plot: used when the data consists of values of a variable measured at

different points in time ○ Horizontal axis represents time ○ Vertical axis represents the value of variable measuring ○ Plot the values of the variable at each of the times & connect the points with straight lines ● Box and Whisker Plots ○ Good for finding outliers Misleading Graphs: ● Positioning the vertical scale ○ The baseline has to be at zero to not exaggerate the differences between the bars or in a time-series plot ● The Area Principle: When amounts are compared by constructing an image for each amount, the AREAS of the images must be proportional to the amounts ○ Ie: if one amount is twice as much as another, its image should have twice as much area as the other image ● Three Dimensional Graphs & Perspective: 3 dimensional bar graphs are often drawn as if the reader is looking down- if you can see the tops of the bars, they may look shorter than they really are Chapter 3: Numerical Summaries of Data Describing Categorical Variables ● Do NOT write all the categories and their percentages ● If there are only 2 categories, mention only one category proportion (ie: a little more than ½ the class are female) ● If there are multiple categories, compare them using proportions by combining categories that make sense ● Mention patterns of similar bar heights/ slice sizes ● In general, mention the tallest bar/ biggest slice & smallest bar/ smallest slice in proportion ● Just need 1-2 sentences!!! Describing Quantitative Variables: ● Shape: Symmetric, Bell- Shaped ○ Center: Mean- the typical value (mean is sensitive to outliers: not resistant) ○ Spread: Standard Deviation (not resistant)- the typical difference between an observation and its typical value ● Shape: Skewed, Non-symmetric ○ Center: Median- the typical value (median is NOT sensitive to outliers: resistant) ○ Spread: Range & IQR- Min: the smallest value is; Max: the largest value is ■ IQR: the range of the middle 50% of values is between Q1 and Q3 ● BOTH: ○ Outliers: (find from box-plot)- unusually large or unusually small ■ Mild or Extreme- for extreme: exceptionally large or small

■ What are their values ■ Mention if there are no outliers ○ Gaps: Where are the gaps? ■ No observations are recorded in the interval/s between ___ ■ Where the gaps occur in the histogram ○ Peaks: How many? ■ Unimodal or Bimodal? ■ Where are the peaks- give the interval where the peaks occur in the histogram ● Mean & Median Relationship ○ Skewed Right: Mean is noticeably greater than median ○ Symmetric: Mean is approximately equal to the median ○ Skewed Left: Mean is noticeable less than median ● Empirical Rule: for bell-shaped data- using the standard deviation to provide an approximate description of the data ○ Approximately 68% of the data will be within one standard deviation of the mean (between - �and + �) ○ Approximately 95% of the data will be within 2 standard deviations of the mean (between 2 2 - 2�and + 2�) ○ Approximately 99.7% or almost all of the data will be within 3 standard deviations 3 3 of the mean (between - 3�and + 3�) ■ Can use it to describe a data set ■ Only appropriate when the data is bell-shaped ● Chebyshev’s Inequality Rule: used when distribution is unknown or not bell- shaped ○ In any data set, the proportion of the data that will be within K standard deviations of the mean is at least 1- 1/K^2. ○ Set K=2 or K=3: at least ¾ (75%) of the data will be within 2 standard deviations of the mean & at least 8/9 (88.9%) of the data will be within 3 standard deviations of the mean ○ Sample Mean - 2� ; sample mean + 2� → at least 75% of the values are between the results of those equations ○ Sample mean - 3� ; sample mean + 3� → at least 88.9% of the values are between the results of those equations ○ K has to be greater than 1!! ● Coefficient of Variation: tells how large the standard deviation is relative to the mean ○ Used to compare the spreads of data sets whose values have different units ○ CV= �/ ○ Whichever data set has a larger CV, they have a greater spread relative to its mean Measures of Position: ● Z-Score: tells how many standard deviations that value is from its population mean ○ z= x - / �

○ Interpretation: The value of one variable is ___ standard deviations above the mean of the variable ○ Ie: the height of the 73 inch man is 1.16 stand deviations above the mean height for men ○ With the empirical rule: ■ Approximately 68% of the data will have z-scores between -1 and 1 ■ Approximately 95% of the data will have z- scores between -2 and 2 ■ Approximately 99.7% or all the data will have z-scores between -3 and 3 ● Quartiles: ○ First Quartile: Separates the lowest 25% of the data from the highest 75% ○ Second Quartile or Median: Separates the lower 50% of the data from the upper 50% ○ Third Quartile: Separates the lowest 75% of the data from the highest 25% ■ Computing: n= total number of values in the data set ■ Q1= n(0.25) ■ Q3= n(0.75) ○ IQR: used to detect outliers ■ Q3 - Q1 ■ Lower outlier boundary: Q1 - 1.5IQR ■ Upper Outlier Boundary: Q3 + 1.5 IQR ■ Any value that is less than the lower outlier boundary or greater than the upper outlier boundary is an outlier ● Mode: most frequent number, there may be more than one value ● Boxplot: a graph that presents the 5 number summary ○ Whisker to Minimum, Q1 (first line of box), median (second line of box), Q3 (third line of box), whisker to maximum ○ Asterisk any outliers- o: mild outliers; *: extreme outliers ○ If the median is closer to Q1 or the maximum whisker is longer than minimum whisker- data skewed right ○ If the median is closer to Q3 or the minimum whisker is longer than maximum whisker- data skewed left ○ If median is approximately halfway between Q1 and Q3 & 2 whiskers are about equal in length- data is symmetric ○ Comparative Boxplots: plotting 2 or more boxplots above one another to show comparison ● Percentiles: Divide a data set into hundredths ○ The pth percentile separates the lowest p% of the data from the highest (100-p) % ● Five Number Summary: Minimum, Q1, Median, Q3, Maximum ● Outlier: a value that is considerably larger or smaller than most of the values in the data set Chapter 4: Probability

Basic Concepts of Probability: ● Probability: the proportion of times the event occurs in the long run, as a probability experiment is repeated over and over again ○ Law of Large Numbers: As the number of trials increases, the proportion of occurrences of any given outcome approaches a particular number in the long run ■ Ie: flipping a coin 200 times, you’ll get closer and closer to its true probability over time ○ A probability can never be negative and never be greater than 1 ■ If A cannot occur then P (A) =0 ■ If A is certain to occur then P (A) = 1 ○ Probability Model: a sample space along with a probability for each event P(A) ● Random Phenomenon: Any activity or situation in which there is un...