Lab Week 11 - Lab Notes from week 11 PDF

Title Lab Week 11 - Lab Notes from week 11
Course Statistics For the Natural Sciences
Institution University of Wollongong
Pages 6
File Size 615.9 KB
File Type PDF
Total Downloads 1
Total Views 140

Summary

Lab Notes from week 11...


Description

STAT252 Statistics for the Natural Sciences LABORATORY NOTES, Week 11 Linear Regression (Assumptions, Transformations and Estimation) Aim: The aims of this lab are to check assumptions required for linear regression models, to investigate data transformation when assumptions are not satisfied, and to find confidence and prediction intervals. 1. Checking Assumptions for Regression Three main assumptions are required in order to proceed with inference for a regression model.  Linearity: the trend in the scatter plot should be a straight line (possibly horizontal) rather than curved. Residuals of the same sign should not cluster in regions of the residual plot.  Normality: the response variable (y) should be Normally distributed for a fixed value of the explanatory variable x. In most examples there are insufficient observations of y sharing a common value of x to check this directly, but it is equivalent to check that the histogram of residuals is approximately bell-shaped with no extreme skewness or outliers. Regression outliers can also be detected within the original scatter plot.  Homoscedasticity: (constant standard deviation): the amount of scatter about the fitted line should remain similar as the explanatory variable x varies. This can be checked within the original scatter plot, but it is easier to assess within the residual plot. Open the Assumptions data set (assumptions.jmp available for download from eLearning). Within this file, there is a single explanatory variable x, and four different response variables y1, y2, y3 and y4 which have been artificially generated in order to show clear violations of the linear regression model. Use Analyze/Fit Y by X/Fit Line to fit a least squares regression line to each response variable. In each case, select Plot Residuals and Save Residuals from the Linear Fit pop-up menu below the scatter plot. Finally, apply Analyze/Distribution to Residuals y1, …, Residuals y4 to obtain histograms of the residuals. For example, here are the three types of plot for the first response variable y1.

Log book question: On the basis of the graphs, fill in the following table with Yes or No and reasons. Note that if the first assumption of linearity is violated, the residuals are calculated relative to an STAT252

Laboratory Notes Week 11: Regression

1

inappropriate fitted line, which makes it a little more difficult to assess the assumptions of Normality and homoscedasticity.

y1

y2

y3

y4

Line arity ?

No. The shape of the line curves.

No. The data point to a parabolic relationship.

No. The data is scattered far beyond and around the line (spread).

Yes, linear exists.

Nor mali ty?

No. The residual histogram is skewed to the left.

No. The residual histogram is skewed to the left and is not symmetrical.

No. The residual histogram is skewed to the left, not symmetrical.

Yes, as the histogram is bellshaped and symmetrical.

Ho mos ced asti city ?

No. The data is not evenly scattered, more clusters n one side.

No. The data follows in a shape of a curve.

No. The majority of the data fall on one side of the line.

Yes, as the data is evenly scattered above and below the line.

relationship

JMP Output: ??

2. Transformations in Regression When there is doubt about one or more assumptions required for linear regression, transformation of either the response or explanatory variable, or both, is sometimes helpful. The Fit Special… menu item in JMP can be used to help find an appropriate transformation. For example, apply Analyze/Fit Y by X …, using y1 as Y and x as X. The Fit Special… menu item in JMP can be used to help find an appropriate transformation. Under the Bivariate Fit pop-up menu choose Fit Special. Select Natural Logarithm: log(y) as Y Transformation and No Transformation for X. The fitted curve is shown on the original untransformed axes, so there will still be problems evident in the corresponding residual plot and histogram of residuals.

STAT252

Laboratory Notes Week 11: Regression

2

Apply Analyze/Fit Y by X. Highlight the variable y1 as shown below and click the right mouse button, then Transform and select log. Then in the list of variables a new log[y1] will appear. Now proceed with selecting log[y1] and x.

Log book question: Check that all three assumptions of linearity, Normality and homoscedasticity appear to be satisfied. The low outlier shown in the box plot of residuals is not sufficiently extreme to cause concern.

All assumptions appear to be satisfied.

STAT252

Laboratory Notes Week 11: Regression

3

3. Confidence and Prediction Intervals In Part 1, all three assumptions required to proceed with inference appear to be satisfied for the response variable y4. Locate the relevant Fit Y by X window if it is still open, or otherwise apply Fit Y by X/Fit Line using y4 as Y and x as X. The least squares intercept a and slope b can be regarded as point estimates

μ =α+ βx

y|x of the parameters  and  in the regression model for the conditional mean response . Check that the values of a and b listed in the Estimate column within the Parameter Estimates section of the output agree with the corresponding values in the Linear Fit equation. Note that JMP labels the slope by the name of the explanatory variable rather than using the word ‘Slope’. This convention turns out to be very useful when there are multiple explanatory variables.

JMP Output: ?? Log book question: On the basis of the standard errors listed under Std Err in Parameter Estimates, which parameter has been estimated more accurately for this example, the slope or the intercept?

Point of estimate ± Multiplier x Standard Err The multiplier x standard err is known as margin of error.

The smaller the value of standard deviation the more accurate the estimate is. The standard error for the slope is 0.014369 and the standard error for the y- intercept is 0.835785. Since the slope has a lower standard error value, its estimate is more accurate out of the two.

Confidence intervals for each parameter can be found by adding and subtracting a certain number (how many?) of standard errors to the point estimate. To obtain 95% confidence intervals in JMP, move the mouse over the contents of the Parameter Estimates output, right click, and select Columns/Lower 95% and Columns/Upper 95%.

The confidence interval for the intercept provides the range of values of  which are consistent with the data. In particular,  = 0 corresponds to a regression line which passes exactly through the origin (0, 0). If the 95% confidence interval contains 0, there is insufficient evidence to reject H 0:  = 0 versus Ha:  ≠ 0 at the 5% level of significance. An equivalent procedure is to accept the null hypothesis if the P-value (listed under Prob>|t|) is greater than 0.05. Similarly, the 95% confidence interval for the slope provides STAT252

Laboratory Notes Week 11: Regression

4

the range of values of  which are consistent with the data. The special case  = 0 corresponds to a horizontal regression line, such that the explanatory variable has no effect upon the response. Log book question: Do either of the confidence intervals contain 0 for this example? What are the implications?

The slope or x confidence interval does not contain a 0 as it ranges from 0.0751028 to 0.1321305. The y-intercept does contain a 0 as it ranges from -1.444408 to 1.8727691. the implications of this is that there is no statistically meaningful or any statistically significant difference between the groups.

Rather than finding confidence intervals for each parameter separately, it is often of more practical

μ =α+ βx

y|x interest to find a confidence interval for . To achieve this in JMP, select Confid Curves Fit from the Linear Fit pop-up menu below the scatter plot. Note that the vertical separation between the lower and upper confidence curve is smallest around the middle of the graph, when x is close to its mean value. Get the Crosshairs Tool from the Tools menu or toolbar, and drag it over the graph to read the approximate vertical coordinates of the lower and upper 95% confidence limits at x = 0. Check that these values correspond to the confidence limits for  listed under Parameter Estimates. (Why?)

JMP Output: ?? The confidence curves for the mean are designed so that they contain the population regression line with high probability, but for some purposes it is useful to construct bounds which are likely to contain additional individual observations. Select Confid Curves Indiv from the Linear Fit pop-up menu. A second set of dashed curves will appear on the graph. These ‘prediction limits’ always lie further from the fitted line than the confidence curves, in order to allow for the scatter of individual points about the mean. Log book question: How would you expect the 99% confidence and prediction curves to differ from the 95% curves? Select Set Alpha Level/.01 from the Linear Fit menu to change the curves on the graph.

The values correspond respectively to the confidence limits for  listed.

STAT252

Laboratory Notes Week 11: Regression

5

You would expect that at 99% confidence the prediction curves will get narrow and closer to the line. 99% confidence provides a small interval so the predictions are much closer to the actual straight line on the graph. Comparing this to 0.01, the data points would be further away from the linear fit on the graph.

JMP Output: ??

STAT252

Laboratory Notes Week 11: Regression

6...


Similar Free PDFs