Lecture notes, Lecture 16 - Unit 10b - Final Exam Review STAT E100 PDF

Title Lecture notes, Lecture 16 - Unit 10b - Final Exam Review STAT E100
Course Introduction to Statistical Methods
Institution Harvard University
Pages 24
File Size 1.1 MB
File Type PDF
Total Downloads 100
Total Views 123

Summary

2013/2014...


Description

Stat E100

Final Exam Review Units 1-10 (with an emphasis on regression)

Final Exam Logistics • The Final Exam will be held on Tuesday, Dec 17, 5:30-7:30pm in Science Center Hall C. • Its closed-book. You are allowed three 8.5x11 pages of notes (front-and-back OK). No laptops or cell phones, Remember to bring a calculator. • About 80% of the questions will come from Units 6-10 (Inference), with a higher emphasis on Unit 10 (Regression) and the material on HW 10. There are practice problems and practice exam on the course website which should guide your studying. • Be sure to briefly explain your answers and show calculations. • Don’t forget: if you are doing a project, it’s due on the 17th. • Regular section times and OH’s this week. 2

Old Topics from Midterm I •

• •



Summarizing Data – Univariate – Bivariate Normal Distribution Study Design – Experimental Design and Randomization – Surveys and Sampling Probability & Random Variables

3

Old Topics From Midterm II • Probability & Random Variables (Unit 5) – Binomial Random Variables – Normal Distribution – Law of Large Numbers and the Central Limit Theorem

• Inference for a Population Mean (Unit 6) – Confidence Intervals – Hypothesis Tests – Power and Sample Size Calculations

• Inference for a Proportion (z-based vs. binomial) (Unit 7) • Two Sample Inference for Two Groups (Unit 8) ‒ Means (unpaired vs. paired) ‒ Two Sample Inference for Proportions

• Chi-Squared Tests (Unit 9) 4

New Topics Since Midterm II

• Inference for Linear Regression (Unit 10) • Simple Regression • Multiple Regression • Regression with Binary Predictors

General Recap • So let’s take a breathe. Where do we stand? • We’ve been looking at LOTS of different inferential analysis procedures. • What’s the main difference between them? – The type of data we have! – Is the outcome data quantitative (means) or is it a count/proportion from a yes/no setting (proportions). – Do we have 1 sample, 2 samples, or more than 2? – Or do we have a (or multiple) quantitative predictors? • All of these questions can help identify the situation, and thus lead to the correct analysis technique 6

Road Map to Inference in Stat E100 σ known

Inference for μ (z-based)

σ unknown (use s)

Inference for μ (t-based)

Independent groups

2-sample t-procedure

Paired groups

Paired t-procedure

1 group

2 groups Quantitative Data 2 or more groups

Regression w/ Binary Predictors

One Predictor

Simple linear regression

2+ Predictor(s)

Multiple regression

Quantitative predictor(s)

Start Here!

1 group

Inference for p

2 groups

Inference for p1 – p2

2 or more groups

χ2 test for association

Quantitative predictor(s)

Logistic Regression*

2 or more groups

χ2 test for association

Binary Data (yes/no)

Categorical Data (2+ categories)

7

Linear Regression • Model Statements

t

b1   H0 seb1



0.234  0  2.505 0.094

 y   0  1 xi1  ...   k xik yi  0  1 xi1  ...   k xik   i • Main Inference Concepts – t-test of indiv. β coefficients – F-test of entire model (H0: β1 = β2 = …= 0) • Simple Regression Only Topics – R2 = r2 – Prediction and Confidence intervals at a particular x* • Multiple Regression Only Topics – Interpretation – Step-Down Model Building • Assumptions (4): εi ~ N(0,σ) & independent (transforming your variables could help fix 3 of these assumptions)

8

Multiple Linear Regression

9

Analysis of Variance Tables In Regression • Remember, Total Sums of Squares in y can be decomposed as: SST = SSM + SSE Its all base on the prediction of the observations and the error: n

n

n

 ( y  y)   ( yˆ  y)   ( y  yˆ ) 2

2

i

i

i 1

i 1

i

2

i

i 1

ANOVA table in regression Source Model Error (Residual) Total

SS SSM

DF DFM = k

MS MSM = SSM/DFM

SSE

DFME = n – k – 1

MSE = SSE/DFE

SST

DFT = n – 1

MST = SST/DFT

R2 

SSM SSE  1 SST SST

F MSM/MSE

Regression with Binary Predictors

11

1. For each of the situations described below, select the inference technique that you believe is the most applicable. If it is a statistical hypothesis test, state the null and alternative hypotheses. (Define all terms specific to the example, rather than just giving a response in general terms such as “μ1 = μ2”). Do not go into details of the computations required. a) A biologist wants to determine whether the cavity size of nests is different across 9 different species of rodents. b) A survey of patients who went through hip replacement surgery were asked on two surveys, once before and once after surgery, the level of pain they were experiencing in their hip while walking (on a scale of 0 to 100). You want to know if the average response changed after surgery. 12

c) A Harvard student is interested in determining how much money recent Harvard graduates make in their first job after graduation. He want to determine if this is related to gender. d) A Harvard student is interested in determining how much money recent Harvard graduates make in their first job after graduation. He want to determine if this is related to concentration at Harvard. e) A Harvard student is interested in determining how much money recent Harvard graduates make in their first job after graduation. He want to determine if this is related to GPA at Harvard. f) A Harvard student is interested in determining how much money recent Harvard graduates make in their first job after graduation. He want to determine if this is related to gender, concentration, and GPA all at once. 13

g) A Harvard student is interested in determining whether students are fans of Miley Cyrus or not. She wants to know if this is related to gender. h) A Harvard student is interested in determining whether students are fans of Miley Cyrus or not. She wants to know if this is related to concentration. i) A Harvard student is interested in determining whether students are fans of Miley Cyrus or not. She wants to know if this is related to GPA.

14

2. Kevin’s dog (a mix-breed Akita named Rio) often barks when people are at the front door. If the person at the front door is a stranger, Rio barks 90% of the time. If the person at the front door is Kevin’s friend, Rio barks only 20% of the time. About 75% of people who come to the front door are Kevin’s friends. (Note: for this problem, everyone is either Kevin’s friend or a stranger).

a) What is the probability that Rio barks at the next person at the front door? b) If Rio is barking at someone at the front door, what is the probability that person is Kevin’s friend?

15

16

3. In the survey given in lecture we measured the following variables: looks - the percent of Harvard students that a student think is better looking than him or her relationship - a binary variable indicating whether the student is in a significant relationship (relationship = 1) or single (relationship = 0) female - a binary variable indicating whether the student is female (female = 1) or male (female = 0)

(a) To the right is the histogram of the response variable, looks. Comment on the plot.

17

(b) Based on this model, what is the estimated mean looks for women? (c) Is there a significant difference in average looks between men and women? How you know?

18

(d) What is the interpretation of the coefficient for female in this model?

(e) Kevin is in a relationship. What is the estimated value of looks for Kevin?

19

(h) Briefly comment on the model's assumptions based on these 2 graphs.

20

3. The 2000 US census showed the average number of people per household in the US was 2.5. There is concern that this average has changed since 2000. The government conducted a random sample survey of 400 households in 2007 and found the average number of people per household to be 2.36 in their sample with a standard deviation of 2.01.

(a) Is this sufficient evidence to show a change in the average number of people per household since 2000? Test at level α = 0.05 and include all the usual elements of a test of hypothesis. (b) Calculate the 95% confidence interval for the overall average number of people per household in the entire US in 2007. (c) A sample of 500 families in 2013 found the average to be 2.29 with s2013 = 1.89. Is this evidence of a difference compared to 2007?

4. An investigator is trying to determine whether the Red Sox play better or worse when they play in front of large crowds. He decided to collect the following variables on the first n = 34 games in 2013: runs_diff - the differential in runs of (Red Sox runs) – (their opponent’s runs) attendance – the attendance at the game (in thousands of people) home – a binary variable indicating whether the game was played at home at Fenway (home = 1) or elsewhere (home = 0)

(a) To the right is the histogram of the response variable, runs_diff. Comment on the plot.

22

(b) Does attendance appear to be a significant predictor of score differential? (c) You decide to attend a Red Sox game where 38 thousand fans attend. Based on the regression output above, give an approximate 95% prediction interval for the score differential for this one game. 23

(d) Compare the multiple regression model above to the simple regression model in the previous slide. Is attendance still a significant predictor of Red Sox score differential? (e) Which model do you think is more accurate? Is there really an attendance effect on Red Sox performance? 24...


Similar Free PDFs