Q SCI 381 Notes - Instructor: Patrick Tobin PDF

Title Q SCI 381 Notes - Instructor: Patrick Tobin
Course Intermediate Bengali
Institution University of Washington
Pages 10
File Size 380 KB
File Type PDF
Total Downloads 89
Total Views 132

Summary

Instructor: Patrick Tobin...


Description

Q SCI 381 A: Notes (Final) LECTURE 1: SAMPLING AND DATA Observations – species of bird that visit your bird feeder Counts – number of cars that will drive across campus on a given day Measurements – your blood pressure, pulse, temperature, etc. Responses – which candidate do you support for some political office? Statistics – science of collecting, organizing, analyzing, and interpreting data for some purpose  

Making a decision – listing or delisting an endangered species; determining where to put a new Starbucks Determining cause and effect – exposure to a chemical hazard results in an undesired effect

Descriptive statistics – branch of statistics that involves the organization, summary, and display of data Inferential statistics – branch of statistics that involves using data to draw conclusions Population – collection of all observations, measurements, or counts in a population Parameter – numerical description of a population characteristic Sample – subset of the population Statistic – numerical description of a sample statistic Types of qualitative data   

Ordinal – scale for ordering or ranking observations (ex. Some survey responses: strongly disagree to strongly agree) Nominal – observations grouped into unique categories without meaningful ranks (ex. Eye color, hair color) Dichotomous – a specialized nominal variable but with only two categories (ex. Yes or no, present or absent, pass/fail)

Types of quantitative data 

Numerical data – very common data type o Data from a counting process – purely integers o Fractional/percentage data – percentage of popular vote in an election o Continuous data – weight of newborns over a year, length of flower petals in a plant species

Experimental design  

1

Observational studies – involve the collection of data without changing any conditions (ex. Collection of vital signs from patients at a walk-in clinic) Experimental studies – involve the collection of data from two or more groups that are exposed to different conditions (ex. Collecting vital signs from patients given a drug and from patients that did not get a drug)

o

Control group – group that receives no treatment or a neutral treatment  (1 – treatment group results/control group results) * 100

Census – count or measure the entire population (rarely used in studies) Sample – want to randomize (avoid bias) and replicate (ensure data is meaningful)   



Systematic sampling – pick a random number, then sample every nth thing Simple random sampling – randomly select on a case-by-base basis (make a rule like flipping a coin, heads to select tails to not select) Cluster sampling – population is divided into “groups” o Groups often randomly selected then individuals are randomly selected within each randomly selected group o Ex. Divide by zip code then select randomly using simple random, systematic, or strata Stratified sampling – individuals are randomly selected within groups or “strata” that have similar characteristics believed to be important to the question o Some variable is known or thought to be important o Want to collect data across the range of this variable o

Ex. If sampling with consideration to age, want to randomly sample people from different age groups

LECTURE 2: DESCRIPTIVE STATISTICS Histogram – graphical representation of the distribution of a set of data (uses numbers on x-axis to denote groups) Frequency histogram – represents the frequency distribution of a set of data 

Can include qualitative data as the categories (x-axis)

Range – max – min Midpoint – (max + min)/2 Relative frequency – fi/∑f Cumulative proportion – sum of relative frequencies from left to right (can be converted into percentiles

Interquartile Range (IQR) – the “middle” 50% - Q3-Q1 …middle 50% is contained within (IQR) out of total groups Stem-and-leaf plot – similar to histogram but original data is retained 2

Measures of central tendency    

Mean – sum of values/sum of frequencies Median – middle value, 50th percentile, Q2 Mode – most frequent value In a normal distribution, the mean ~ median ~ mode

Variance – how do all the data in the dataset collectively vary from the mean? How far are the data spread about? 





Steps to find variance o Find mean of dataset o Find deviation from the mean (x – mean) o Square each deviation o Sum all the squared values o Divide sum of squares by degree of freedom (n – 1) Standard deviation (SD) = square root of variance o Within 1 SD: 68.3% o Within 2 SD: 95.6% o Within 3 SD: 99.7% The lower the variance and standard deviation, the range of data tends to be smaller and data tend to be closer to the mean

Outliers – observations that greatly deviate from all others  

Can use IQR to detect outliers Potential outliers are values < Q1 - 1.5*IQR or values > Q3 + 1.5*IQR

Coefficient of variation – can be used to compare the variability in one dataset to another dataset  

Standard deviation / mean * 100 The higher the CV, the more variable the dataset

LECTURE 3: PROBABILITY Theoretical probability – assumes that outcomes are equally likely to occur Statistical probability – based on actual observations, such as from experiences P{A} = # of outcomes of event A/total number of possible outcomes

3

Law of large numbers – as you increase your sample size, the statistical probability will approximate the theoretical probability Independent events – when the occurrence of one event has no effect on the occurrence of the second event  

Ex. Flipping a coin over and over Can use the multiplication rule to determine probability of independent event A occurring followed by independent event B (sequence of events) o Ex. what’s the probability of flipping a heads on the first flip AND a heads on the second flip? o Can be used with any independent events (flipping a heads and rolling a 6)

Dependent events – when the occurrence of one event influences the occurrence of the second event 

Have to consider each individual event’s probability o Ex. P{green marble then blue marble} = P{green marble} * P{blue marble | green marble}

Multiplication rule – used when determining the probability of events occurring in sequence Additive rule – used to determine the probability of at least one of two events occurring 



Mutually exclusive – cannot occur at the same time o Add event probabilities together: P{event A or event B} = P{A} + P{B} o Ex. Scoring 0 goals and 1 goal is impossible, therefore mutually exclusive Not mutually exclusive – can occur at the same time o P{event A or event B} = P{A} + P{B} – P{A and B} o Ex. P{brown hair or brown eyes} = P{brown hair} + P{brown eyes} – P{brown hair and brown eyes}

Sample space – all possible outcomes Fundamental counting principle – useful in calculating sample space, if event A can occur X times and event B can occur Y times, then the number of outcomes in which A and B can occur in sequence is XY, use when can use item more than once or just once (order matters)  

Can use more than once: n^n Can use only once: n!

Combinations – order doesn’t matter, must consider distinguishable sets (Blinky and Inky Blinky and Clyde, but Blinky and Inky = Inky and Blinky)  

n!/((n-r)!r!) n: total number of possible choices r: number needed to make the combination How do you know when order matters? – given you can only use a player once, or finding # of possible subsets of a group of people

LECTURE 4: DISCRETE RANDOM VARIABLES Discrete random variable – discrete or finite number of possible outcomes; expressed as an integer Discrete probability distribution – list of all possible outcomes that a random variable can assume with the corresponding probability of each outcome (0 critical F, reject Ho that all means are the same Test F = variation among means / variation within samples o (Sum of squared differences among means / numerator df)/(sum of squared differences within samples/denominator df)

 

Numerator df = # of populations – 1 Denominator df = total # of samples - # of populations

LECTURE 9: LINEAR REGRESSION AND CORRELATION Correlation – measures how two variables co-vary Ex. if one variable increases, what does the other variable do? Regression – uses value of independent variable (predictor variable) to predict the value of a dependent variable (response variable) ex. Number of people in a family could be used to predict a family’s monthly food bill Correlation analysis measures…   

The strength of the correlation (strong to weak) The direction of the correlation (positive to negative) If the relationship is significant (reject or fail to reject null)

Ho: variables x and y are not correlated (rho = 0) Ha: variables x and y are correlated (rho 0) Correlation coefficient (rho) – measures the strength and direction of the correlation Rho = ∑(x – mean x)(y – mean y)/sqrt(∑(x – mean x)^2∑(y – mean y)^2)  

Denominator – adjusts the “scales” of the data in both x and y to put each in more equal units (always returns positive value) Numerator – “centers” the raw values of x and y by subtracting their corresponding mean

Rho ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation)   

Df is (n x + n y)/2 – 2 Number of matched pairs – 2 NOTE: don’t need to find df usually because table that is uses “n” has already adjusted for df

Critical values are listed as absolute values on the table, there are always two critical values (negative and positive) because this test is inherently two-tailed Simple linear regression – one predictor (independent) variable and one response (dependent) variable Multiple linear regression – two or more predictor (independent) variables and one response (dependent) variable Residuals – distance between observed values of y and predicted values of y Sum of squares of the residuals – when the residuals are squared and added up In regression, use the line of best fit. Line of best fit is line where the sum of the squares of the residuals is minimized. y = mx + b + e (m: estimate of the slope, b: y-intercept, e: error) Coefficient of determination – r^2 – proportion of the variation explained by the regression model. Ranges from 0 to 1, in which 1 suggests that the model explains 100% of the variation (no e) 9



Ex. 62% of the variation in petal length (response variable) is explained by the petal width (predictor variable)

Coefficient of determination r^2 (% of the variation in one variable that is explained by another variable) vs. correlation coefficient rho (degree to which two variables covary) r^2 vs. rho – correlation coefficient is the square root of r^2 and coefficient of determination is rho squared Test: Ho: slope estimate = 0, Ha: slope estimate 0  

Whether the critical value is positive or negative is dependent on the slope If the estimate’s value in the table has a P value very close to 0, then the estimate’s value is probably true

Finding improvement with multiple predictor variables: if r^2 shows significant improvement with adding another predictor variable or not LECTURE 10: CHI-SQUARED DISTRIBUTION Used to test whether an observed frequency distribution fits an expected distribution Chi squared is only two-tailed when finding the confidence interval, chi square test is inherently right-tailed Ho: observed frequencies are the same as the expected frequencies Ha: observed frequencies are not the same as the expected frequencies Chi square test statistic = ∑((observed – expected)^2/expected) Chi square critical value = df = k – 1 (k is number of categories) IFF text > critical value, then reject Ho Test of independence – used to test the association between the frequencies of two or more variables Contingency table – r x c, frequencies are arranged in r rows and c columns      

10

Find sums for columns Find sums for rows Expected frequency = (sum of row r)(sum of column c) / sample size Calculate chi square test statistics for each cell: ∑((observed – expected)^2/expected) Calculate the sum of the test statistics Critical value by calculating df = (r – 1)(c – 1)...


Similar Free PDFs