Title | Midterm pt 1 study guide |
---|---|
Author | Skyler Lowman |
Course | Elementary Statistics |
Institution | James Madison University |
Pages | 20 |
File Size | 473.9 KB |
File Type | |
Total Downloads | 20 |
Total Views | 148 |
midterm study guide the statsitics half - Professor Arlene Casiple...
Math 220 Midterm Part 1 Chapters 1-6, 11 Chapter 1: Basic Idea Vocabulary Words ● Statistics: The science of collecting, summarizing/ analyzing, and drawing conclusions from data ● Population: The entire collection of individuals or objects the researcher is interested in ● Sample: A subset of the population that are actually observed ○ Simple Random Sample: a sample where each person/ thing in the population is equally likely to make up the sample- lottery- the most basic ■ On Random Digits Chart- every 2 digits ■ � n= population standard deviation ■ � n-1= sample standard deviation ■ *** HOW TO FIND SAMPLE MEAN USING SCIENTIFIC CALC: ● 2nd 7- CSR: Clear Data List ● Put numbers in calc→ press the � + button after each number (end number should be total numbers entered) ● 2nd x- bar (above x^2) ■ *** HOW TO FIND SAMPLE STANDARD DEVIATION ON CALC: ● 2nd 7- CSR: Clear Data List ● Put numbers in clas → press � + button after each number (end number should be total numbers entered) ● 2nd xn-1 (above square root x)- for SAMPLE ○ POPULATION IS xn ● ROUND TO 4 DECIMAL PLACES ○ Convenience Sample: a sample is taken from a group of people easy to contact or reach, not drawn by a well-defined random method- only okay to use when there is no systematic difference between the sample and the population ○ Stratified Sampling: dividing the population into strata (groups) & then taking a simple random sample from each ■ Ie: dividing population into groups based on age and there’s different numbers/ amount of people surveyed from each group ■ Simple random sample of each group ○ Cluster Sampling: divide the population into clusters, randomly select a sample of clusters & sample everyone in those few clusters ■ Ie: only few groups are selected, but everyone in those groups selected are surveyed ■ Simple random sample of groups & sample everyone in the few randomly selected groups ○ Systematic Sampling: population items are ordered and randomly choose a starting point in the first segment then sample at the same point in each segment
○ Voluntary Response Sampling: highly unreliable- relies only on volunteers to take part in the sample→ usually people who have strong opinions ● Parameter: A number that describes the population- in simple random sampling ○ 𝜇 (mu): Population Mean ○ 𝜇 (sigma): Population Standard Deviation ○ 𝜇: Population Proportion ● Statistic: A number that describes the sample- in simple random sampling ○ X- Bar: Sample Mean ○ s: Sample Standard Deviation ○ P- Hat: Sample Proportion ● Variables: Characteristics of individuals/ subjects about which we collect information ○ Data Set: The collection of information- values of variables/ Data: the variables we obtain ○ Qualitative/ Categorical Variables: Classify individuals into categories (ie: gender, color of a car) ■ Ordinal: There is a natural ordering (ie: letter grade, size of drinks) ■ Nominal: There is NO order (ie: gender, state of residence) ○ Quantitative Variables: Tell us how much or how many of a variable there is (ie: age, mileage of a car)- always numerical ■ Discrete: Variables that can be counted/ listed- whole numbers (ie: age, number of siblings) ■ Continuous: Variables that are measured- possible values are not restricted to any list - intervals: can be decimals (ie: height, distance someone commutes to work) ○ Explanatory Variable: A variable that attempts to explain or is prompted to cause (at least partially) differences in the response variable (independent variable) - in experiments ○ Outcome/ Response Variable: What is measured on each individual after a treatment is applied- in experiments ○ Treatment: The procedures applied to each experimental unit- always 2 or morein order to determine whether the choice of treatment affects the outcomedifferent levels of the explanatory variable ○ Confounding Variables/ Confounder: A variable related to both treatment and the outcome ■ Can distort the results of an experiment because they, and not the explanatory variable, may be the agent that actually causes the change in the response variable ■ Solution: RANDOMIZATION!! Types of Studies ● Randomized Experiments: A study in which an investigator assigns the treatment to
subjects at random ○ Can conclude cause/ effect!!! ○ Rules out confounding variables ○ If there are large differences in outcomes among treatment groups, we can conclude that the differences are due to differences in treatment ○ Double Blind: when neither the investigator nor the subjects know who has been assigned to which treatment ○ Randomized Block Experiment: subjects are divided into blocks (subjects in each block are similar with regard to a variable related to the outcome - ie: age or gender) → used to control known sources of variation ● Observational Studies: A study in which the assignment to treatment is NOT made by the investigator ○ CanNOT conclude cause/ effect ○ Results in ASSOCIATION ○ Response variable difference might be caused by confounding variables ○ Cohort Studies: A group of subjects (the cohort) is studied to determine whether various factors of interest are associated with an outcome (Con: cannot be used to study rare diseases) ■ Prospective: subjects are followed overtime- among the best of observational studies (Pros: quality of data is better & confounder information can potentially be collected- reliable results; Cons: expensive and time- consuming) ■ Retrospective: subjects are sampled after the outcome has occurredinvestigators look back over time to determine whether certain factors related to the outcome (Pros: less expensive, quick results; Cons: Confounders) ■ Cross- Sectional: Measurements are taken at one point in time (Pros: Inexpensive, quick results; Cons: confounding variables about past exposures contribution to outcome, no way to determine whether the cause or effect came first) ○ Case- Control Studies: two samples are drawn (one with the disease- cases, and one without the disease- control)- can be used to study rare diseases ■ Investigators look back over time to determine factors of interest that differ between the two groups ■ Always retrospective Bias ● Bias: the degree to which a procedure systematically overestimates or underestimates a population value ○ Simple randoming sampling: UNBIASED ● Sources of Bias:
○ Voluntary Response Bias: people with strong opinions are more likely to participate ■ Most of the time, the survey is optional (people are invited to participate in order to express their opinion) ■ Ie: President asks people to email their opinion ○ Self-Interest Bias: people who have an interest in the outcome of an experiment have an incentive to use biased methods ■ Ie: Advertisers will not report data that shows that their product is inferior - a protein supplement is claimed by its company to yield 15-20% muscle growth in 6 weeks ○ Social Acceptability Bias: people are reluctant to admit to behavior that may reflect negatively on them (ie: Did you vote last election?)- solution: wording ■ Ie: prisoners saying they are innocent when asked ○ Leading Question Bias: questions are sometimes worded in a way that suggests a particular response- solution: wording ○ Nonresponse Bias: Nonresponders: people asked to participate refuse to do so ■ Surveys are sent out & people don’t respond!! Difference from voluntary response bias where participation is optional! ■ Ie: predicting life expectancy based off of death certificates, doesn’t include the people still living OR people simply not responding to a survey ● Ie: questionnaires sent out to all constituents, only some respond ○ Sampling Bias: when members from the population are more likely to be included in the sample than others (ie: convenience sampling)- almost impossible to avoid sampling bias ● *** A sample has to be representative of the population, size does not matter*** Chapter 2: Graphical Summaries of Data Frequency Distributions for Qualitative Data ● Frequency: the number of times a category occurs in the data set ● Frequency Distribution: A table that presents the frequency for each category ○ Categories in one column, frequency is another column ● Relative Frequency: The frequency of the category divided by the sum of all the frequencies- the proportion of items in the category ● Relative Frequency Distribution: A table that presents the relative frequency of each category- often the frequency is presented as well ○ Categories in one column, frequency is another column, relative frequency in a column ○ All relative frequencies should ADD UP TO 1 ● Bar Graphs/ Charts: A graphical representation of a frequency distribution ○ A bar for each category ○ Heights represent the frequencies or relative frequencies of the categories
○ Category names along the horizontal axis, evenly spaced ○ Vertical axis- the frequency or relative frequency ○ Bars do not touch & should all be the same width ○ CAN MOVE BARS- meanings won’t change ● Pareto Charts: Bar graphs which are ordered by size- largest frequency/ relative frequency on the left, smallest on the right ○ Useful when it’s important to see the most frequently occurring categories ● Horizontal Bars: bar graph with axis’ switched- bars are horizontal ○ Helpful when categories have long names ● Side-by- Side Bar Graphs: bar graphs with both bars on the same axes, putting bars that correspond to the same categories next to each other ○ Helpful to compare two bar graphs that have the same categories- ie: difference of frequencies from 2 different years ○ Bars right next to each other on the same axis, different colored bars for different years or other variable ● Pie Charts: Displays relative frequencies in a circle ○ Divided into sectors for each category- sizes match relative frequencies ○ Label each sector with its relative frequency as a percentage Frequency Distributions for Quantitative Data ● Classes: we divide data into classes in order to construct a frequency distributionintervals of equal width that cover all the values that are observed ○ Ie: 0.00-0.99, 1.00-1.99, 2.00-2.99 ○ Lower Class Limit: the smallest value that can appear in that class ○ Upper Class Limit: the largest value that can appear in that class ○ Class Width: the difference between the lower limit & the lower limit of the next class ○ Every observation must fall into a class, classes cannot overlap, classes must be of equal width, no gaps between classes even if there are no observations in a class ■ Computing the class width for a given number of classes: largest data value - smallest data value divided by the number of classes ● Frequency Distribution: a table that presents the frequency for each class ○ Class in one column, frequency in the other column ● Relative Frequency: the frequency of the class divided by the sum of all frequencies ● Relative Frequency Distribution: a table that presents the relative frequency of each class ○ Class in one column, frequency in one column, & relative frequency in another column ○ All frequencies should ADD UP TO 1 ● Histograms: A graphical representation of a frequency or relative frequency distribution ○ On horizontal axis, numbers are set correlating to each class with bars in their class numbers- the left edge should be at the lower limit & right edge at lower
limit of next class ○ Vertical axis- frequency or relative frequency ○ Widths are equal to class widths & bars must touch!! ○ The number of classes should be at least 5 but no more than 20 ○ Cannot move bars ○ Open Ended Classes: when the first class has no lower limit or the last class has no upper limit (ie: 85 and older) ○ Histograms for Discrete Data: ie: drawing a histogram for the variable: number of children (discrete data) ■ Bars have equal width, centered at the values of the variables, bars still touch ○ Shapes: histograms give shape of a data set ■ Symmetric: when it's right half is a mirror image of its left half ■ Skewed Right: more frequency on the left side- tall left, low right ● Right tail is longer than left tail ■ Skewed Left: more frequency on the right side- tall right, low left ● Left tail is longer than right tail ○ Modes: Peaks! ■ Unimodal: only has one mode or peak ■ Bimodal: has 2 modes or peaks More Graphs for Quantitative Data ● Stem- and- Leaf Plots: a vertical list of all the stems in increasing order, vertical line in between, list of leafs on the right (rightmost digit) in increasing order ○ If the leafs will have 2 digits, you have to round the data to make the leafs one digit ○ Tells us more information about extreme values ○ Split Stem-and- Leaf Plot: when one or two stems contain most of the leaves, we use 2 or more lines for each stem ■ Each stem must be given the same number of lines ○ Back to Back Stem-and -Leaf Plot: when 2 data sets have values similar enough that the same stems can be used so that we can compare their shapes ■ Stems go down the middle, leaves from one set go off to the right (smallest lead closest to the stem) and leaves from the other set go to the left ■ Title each data set at the top ● Dotplot: a graph that can be used to give a rough impression of the shape of a data set ○ Useful when data is not too large and there are some repeated values ○ Vertical column of dots is drawn- number of dots in each column is equal to the number of times the value appears in the data set ○ Indicates where values are concentrated and where gaps are ● Time Series Plot: used when the data consists of values of a variable measured at
different points in time ○ Horizontal axis represents time ○ Vertical axis represents the value of variable measuring ○ Plot the values of the variable at each of the times & connect the points with straight lines ● Box and Whisker Plots ○ Good for finding outliers Misleading Graphs: ● Positioning the vertical scale ○ The baseline has to be at zero to not exaggerate the differences between the bars or in a time-series plot ● The Area Principle: When amounts are compared by constructing an image for each amount, the AREAS of the images must be proportional to the amounts ○ Ie: if one amount is twice as much as another, its image should have twice as much area as the other image ● Three Dimensional Graphs & Perspective: 3 dimensional bar graphs are often drawn as if the reader is looking down- if you can see the tops of the bars, they may look shorter than they really are Chapter 3: Numerical Summaries of Data Describing Categorical Variables ● Do NOT write all the categories and their percentages ● If there are only 2 categories, mention only one category proportion (ie: a little more than ½ the class are female) ● If there are multiple categories, compare them using proportions by combining categories that make sense ● Mention patterns of similar bar heights/ slice sizes ● In general, mention the tallest bar/ biggest slice & smallest bar/ smallest slice in proportion ● Just need 1-2 sentences!!! Describing Quantitative Variables: ● Shape: Symmetric, Bell- Shaped ○ Center: Mean- the typical value (mean is sensitive to outliers: not resistant) ○ Spread: Standard Deviation (not resistant)- the typical difference between an observation and its typical value ● Shape: Skewed, Non-symmetric ○ Center: Median- the typical value (median is NOT sensitive to outliers: resistant) ○ Spread: Range & IQR- Min: the smallest value is; Max: the largest value is ■ IQR: the range of the middle 50% of values is between Q1 and Q3 ● BOTH: ○ Outliers: (find from box-plot)- unusually large or unusually small ■ Mild or Extreme- for extreme: exceptionally large or small
■ What are their values ■ Mention if there are no outliers ○ Gaps: Where are the gaps? ■ No observations are recorded in the interval/s between ___ ■ Where the gaps occur in the histogram ○ Peaks: How many? ■ Unimodal or Bimodal? ■ Where are the peaks- give the interval where the peaks occur in the histogram ● Mean & Median Relationship ○ Skewed Right: Mean is noticeably greater than median ○ Symmetric: Mean is approximately equal to the median ○ Skewed Left: Mean is noticeable less than median ● Empirical Rule: for bell-shaped data- using the standard deviation to provide an approximate description of the data ○ Approximately 68% of the data will be within one standard deviation of the mean (between - �and + �) ○ Approximately 95% of the data will be within 2 standard deviations of the mean (between 2 2 - 2�and + 2�) ○ Approximately 99.7% or almost all of the data will be within 3 standard deviations 3 3 of the mean (between - 3�and + 3�) ■ Can use it to describe a data set ■ Only appropriate when the data is bell-shaped ● Chebyshev’s Inequality Rule: used when distribution is unknown or not bell- shaped ○ In any data set, the proportion of the data that will be within K standard deviations of the mean is at least 1- 1/K^2. ○ Set K=2 or K=3: at least ¾ (75%) of the data will be within 2 standard deviations of the mean & at least 8/9 (88.9%) of the data will be within 3 standard deviations of the mean ○ Sample Mean - 2� ; sample mean + 2� → at least 75% of the values are between the results of those equations ○ Sample mean - 3� ; sample mean + 3� → at least 88.9% of the values are between the results of those equations ○ K has to be greater than 1!! ● Coefficient of Variation: tells how large the standard deviation is relative to the mean ○ Used to compare the spreads of data sets whose values have different units ○ CV= �/ ○ Whichever data set has a larger CV, they have a greater spread relative to its mean Measures of Position: ● Z-Score: tells how many standard deviations that value is from its population mean ○ z= x - / �
○ Interpretation: The value of one variable is ___ standard deviations above the mean of the variable ○ Ie: the height of the 73 inch man is 1.16 stand deviations above the mean height for men ○ With the empirical rule: ■ Approximately 68% of the data will have z-scores between -1 and 1 ■ Approximately 95% of the data will have z- scores between -2 and 2 ■ Approximately 99.7% or all the data will have z-scores between -3 and 3 ● Quartiles: ○ First Quartile: Separates the lowest 25% of the data from the highest 75% ○ Second Quartile or Median: Separates the lower 50% of the data from the upper 50% ○ Third Quartile: Separates the lowest 75% of the data from the highest 25% ■ Computing: n= total number of values in the data set ■ Q1= n(0.25) ■ Q3= n(0.75) ○ IQR: used to detect outliers ■ Q3 - Q1 ■ Lower outlier boundary: Q1 - 1.5IQR ■ Upper Outlier Boundary: Q3 + 1.5 IQR ■ Any value that is less than the lower outlier boundary or greater than the upper outlier boundary is an outlier ● Mode: most frequent number, there may be more than one value ● Boxplot: a graph that presents the 5 number summary ○ Whisker to Minimum, Q1 (first line of box), median (second line of box), Q3 (third line of box), whisker to maximum ○ Asterisk any outliers- o: mild outliers; *: extreme outliers ○ If the median is closer to Q1 or the maximum whisker is longer than minimum whisker- data skewed right ○ If the median is closer to Q3 or the minimum whisker is longer than maximum whisker- data skewed left ○ If median is approximately halfway between Q1 and Q3 & 2 whiskers are about equal in length- data is symmetric ○ Comparative Boxplots: plotting 2 or more boxplots above one another to show comparison ● Percentiles: Divide a data set into hundredths ○ The pth percentile separates the lowest p% of the data from the highest (100-p) % ● Five Number Summary: Minimum, Q1, Median, Q3, Maximum ● Outlier: a value that is considerably larger or smaller than most of the values in the data set Chapter 4: Probability
Basic Concepts of Probability: ● Probability: the proportion of times the event occurs in the long run, as a probability experiment is repeated over and over again ○ Law of Large Numbers: As the number of trials increases, the proportion of occurrences of any given outcome approaches a particular number in the long run ■ Ie: flipping a coin 200 times, you’ll get closer and closer to its true probability over time ○ A probability can never be negative and never be greater than 1 ■ If A cannot occur then P (A) =0 ■ If A is certain to occur then P (A) = 1 ○ Probability Model: a sample space along with a probability for each event P(A) ● Random Phenomenon: Any activity or situation in which there is un...