Homework-6 - 36-402 Advanced Methods for Data Analysis Spring 2021 Homework 6 PDF

Title	Homework-6 - 36-402 Advanced Methods for Data Analysis Spring 2021 Homework 6
Author	Julia Lee
Course	Advanced Data Analysis
Institution	Carnegie Mellon University
Pages	4
File Size	98.4 KB
File Type	PDF
Total Downloads	40
Total Views	139

Preview

CLICK TO PREVIEW PDF

Summary

36-402 Advanced Methods for Data Analysis Spring 2021 Homework 6...

Description

Homework 6 Advanced Methods for Data Analysis (36-402) Due Monday March 22, 2021 at 3:00 pm ET You should always show all your work and submit both a writeup and R code. • Assignments must be submitted through Gradescope as a PDF. Follow the instructions here: https://www.cmu.edu/teaching/gradescope/ • Gradescope will ask you to mark which parts of your submission correspond to each homework problem. This is mandatory; if you do not, grading will be slowed down, and your assignment will be penalized. • Make sure your work is legible in Gradescope. You may not receive credit for work the TAs cannot read. Note: If you submit a PDF with pages much larger than 8.5 × 11”, they will be blurry and unreadable in Gradescope. • For questions involving R code, we strongly recommend using R Markdown. The relevant code should be included with each question, rather than in an appendix. A template Rmd ﬁle is provided on Canvas. 1. Bootstrapping Regression Models. The goal of this homework problem is to practice using bootstrapping to quantify the uncertainty in regression models. We will reuse part of the data from Homework #1. In particular, we will use the ﬁrst training set of 32 observations on engine displacement and miles-per-gallon. You can ﬁnd these as the ﬁrst columns of engine.xtrain and engine.ytrain respectively from the ﬁrst assignment. Put these two columns into a data.frame with column names eng.disp and mpg. This data frame should consist of 2 columns with 32 rows. Compare the summary of your data frame to the following to make sure you got the correct parts of the data. eng.disp Min. : 71.1 1st Qu.:120.8 Median :196.3 Mean :230.7 3rd Qu.:326.0 Max. :472.0

mpg Min. :10.40 1st Qu.:15.43 Median :19.20 Mean :20.09 3rd Qu.:22.80 Max. :33.90

As always, include the code for the computational parts of each problem in your report, clearly labeling which block of code goes with which problem. 1

36–402

Spring 2021

We saw in Homework #1 that a linear regression of mpg on eng.disp did not provide a very good ﬁt. (a) Plot mpg against eng.disp. Then plot 1/mpg against eng.disp. Does the relationship in either of the plots look more linear than in the other? Is there anything in the plot of 1/mpg against eng.disp to suggest that at least one assumption of the simple linear regression model might not hold here? Indicate what might be wrong and what methods might be used to address the problem. (b) Fit the simple linear regression of 1/mpg on eng.disp, and plot the residuals against the predictor. Also draw a normal q-q plot and add the qqline. Say what if anything in these plots suggests any violations of the regression assumptions. Display the summary of the model ﬁt which includes standard errors for the coeﬃcients. (c) Next, perform a bootstrap analysis to see if we can improve on the standard errors calculated by the simple linear regression ﬁt. We will use the “resampling cases” form of the bootstrap (also called “resampling (X,Y) pairs” or nonparametric bootstrap.) Based on what you saw in parts (a) and (b), give a justiﬁcation for this choice of regression bootstrap. Create B = 10000 bootstrap samples of the (eng.disp,mpg) pairs, each of the same size as the original data. For each bootstrap sample, save the estimate of the slope parameter. (You won’t need the intercept for the rest of the problem.) Report the sample mean and sample standard deviation of the bootstrapped slope parameter estimates. Compare the sample standard deviation to the standard error reported for the slope coeﬃcient in the summary of the model ﬁt. What is the percentage change in the magnitude of the standard error going from the linear model ﬁt to the bootstrap estimate? (d) We would like to be able to say whether the bootstrap analysis gives clear evidence that the standard error for the slope estimate reported by the simple linear regression ﬁt is not appropriate. This requires a measure of uncertainty associated with the calculation of the bootstrap standard error. It almost sounds like we are going to suggest bootstrapping the bootstrap analysis, but this is not necessary. With 10,000 bootstrap samples, we have already replicated a 1,000-sized bootstrap analysis 10 times, or a 500-sized bootstrap analysis 20 times, etc. Suppose now we treat the 10,000 bootstrap samples as 50 replications of B = 200 bootstrap samples. For each of the 50 replications, compute the sample standard deviation of the estimated slope parameters. Treat the 50 resulting sample standard deviations as a sample of size 50 from a normal distribution. Perform a t-test of whether the mean of the distribution of these 50 bootstrap standard errors is equal to the standard error reported by the linear model ﬁt for the estimated slope. Draw a q-q plot of the 50 bootstrap standard errors with the qqline added. Do they look like a normal sample? Why would it make sense to check this? (e) First explain, in one sentence, what one of the 50 sample standard deviations you just calculated actually means. Speciﬁcally, what quantity does it measure the

Page 2

36–402

Spring 2021 variation of, and where does that variation come from? Then, explain in another sentence, what’s diﬀerent between looking at 50 replications of 200 bootstrap samples and 10 replications of 1000 bootstrap samples.

2. Cat Data: The Sequel. To load the data, run library(MASS). We looked at the cat data during Lecture 4B and in the BootstrapRegressionExample.Rmd demo, so refer back to that for context on the data and variables. (a) Fit a linear regression model for Hwt, in which Bwt interacts with Sex and the intercept is forced to be zero. Show the model summary, and explain why forcing the intercept to be zero is a reasonable thing to do. Hint: There are some subtleties regarding how to force the intercept to be zero. Make sure you are not accidentally introducing intercept terms by using the wrong R syntax. (b) Under the assumption that the residuals are normally distributed, conduct a statistical test to ﬁnd out whether there is evidence that the slope coeﬃcient for Bwt diﬀers between female and male cats. Make sure you state the hypotheses (null and alternative), the null distribution, the value of the test statistic and the p value. Hint: Find a way to write the hypothesis test as the comparison of two diﬀerent linear models. (c) Now let’s test the same hypothesis, but without making assumptions about normality of the residuals. To be speciﬁc, we would like to test whether the slope of Bwt is diﬀerent for male and female cats. As you know, statistical hypothesis tests—like the one you did in part (b)—work by comparing a statistic calculated from the data to the distribution of that statistics expected when the null hypothesis is true. The distribution is usually calculated mathematically using model assumptions, but if we don’t believe those assumptions, we need some other way to estimate it. In this problem, we’ll use bootstrapping to do so. Speciﬁcally, the statistic we will use is ˆBwt:Male − βˆBwt:Female )2 . T = (β To calculate the statistic on the observed data, use the model you ﬁt in part (a). Report the value of the test statistic and save it in a variable for use later. (d) Now let’s estimate the null distribution. We will use the bootstrap by resampling residuals to simulate the distribution of T when ˆβBwt:Male = βˆBwt:Female. When this is true, both regression lines are the same, so when bootstrapping, use rˆMale (x) = rˆFemale (x) = βˆ1 x, where βˆ1 is the slope you calculate from a regression that ignores Sex. Write an R function that produces bootstrap samples from the data, but using this model. In case the residual distribution is diﬀerent for male and female cats, it 1 It is possible to prove that the bootstrap by resampling residuals will lead to the same results no matter what βˆ1 you use for computing rˆMale(x) = rˆFemale(x). All that matters is that you use the same ˆβs for both.

Page 3

36–402

Spring 2021 should draw the residuals for male cats from the observed residuals for male cats, and the residuals for female cats from the observed residuals for female cats. Next, for each bootstrap sample b, let T ∗b denote the value of T computed from sample b. Run B = 1000 bootstrap samples and plot a histogram of T ∗b . Mark the T statistic you calculated in (c) as a vertical line on the histogram.

(e) Recall that the p-value is deﬁned by Pr(T ≥ tobs when H0 is true),

(1)

where tobs is the observed value of T . Bootstrap the p value, that is, compute Pr(T ∗ ≥ tobs ) as an estimate of (1). Based on your result, is there a signiﬁcant diﬀerence between the regression lines for male and female cats?

Page 4...