SSTB031 2020 guide - Lecture notes 10 PDF

Title	SSTB031 2020 guide - Lecture notes 10
Course	Applied Linear Regression
Institution	University of Limpopo
Pages	55
File Size	1.7 MB
File Type	PDF
Total Downloads	48
Total Views	133

Preview

CLICK TO PREVIEW PDF

Summary

Applied Linear Regression...

Description

UNIVERSITY OF LIMPOPO

SSTB031 STUDY GUIDE APPLIED LINEAR REGRESSION Compiled By Mr Maluleke H 2020

Table of Contents CHAPTER 1: SIMPLE LINEAR REGRESSION AND CORRELATION................. 4 1. Introduction.................................................................................................... 4 1.1 Simple Linear Regression Model ............................................................. 4 1.2 Estimation of the Regression Parameters................................................ 6 1.3. Correlation ............................................................................................ 10 What does the standard error means? ............................................................ 14 The sampling distribution of the slope (  1 ) of the regression model ............... 14 Hypothesis testing for 0 and 1 ................................................................... 16 CHAPTER 2: DIAGNOSTICS FOR SIMLPLE LINEAR REGRESSION .............. 19 2.1 Introduction................................................................................................ 19 2.2 Residual Analysis ...................................................................................... 19 Examination of Residuals ................................................................................ 20 Different Patterns of Residual plots ............................................................. 21 2.3 Identification of Outliers ............................................................................. 22 2.4 Detection of Influential observation ........................................................... 23 2.3.1 Leverage procedure ............................................................................ 23 See an electronic book!!! ............................................................................. 23 2.3.2 Deleted residual method ..................................................................... 23 2.3.3 Cook’s distance ..................................................................................24 CHAPTER 3: MULTIPLE REGRESSION MODEL .............................................. 26 3.1 Multiple Linear Regression Model.......................................................... 26 Matrix Approach to Regression Analysis ..................................................... 26 3.2 Estimation of Regression coefficients .................................................... 28 3.3 Test for the significant of the overall model............................................ 31 3.4 Test for the significant of the regression coefficient ............................... 32 3.5 Inferences about the mean response ....................................................33 3.6 Inferences about the individual response fitted values ..........................34 3.7 Multiple Coefficient of determination (R2)............................................... 34 3.8 Testing Portions of the Multiple Regression Model ................................ 35

2

CHAPTER 4: DIAGNOSTICS FOR MULTIPLE REGRESSION ......................... 40 4.1 Introduction................................................................................................ 40 4.2 Residual Analysis ...................................................................................... 40 Examination of Residuals ................................................................................ 41 different patterns of the residual plots: as in simple linear regression.......... 42 4.3 Identification of Outliers ............................................................................. 43 4.4 Detection of Influential observation ........................................................... 43 4.4.1 Leverage procedure ............................................................................ 43 4.4.2 Deleted residual method ..................................................................... 44 4.4.3 Cook’s distance ..................................................................................44 4.5 Collinearity................................................................................................. 45 CHAPTER 5: MODEL-BUILDING ....................................................................... 47 5.1 Backward elimination ................................................................................47 5.2 Forward elimination ................................................................................... 47 5.3 Stepwise Regression................................................................................. 47 Appendices ......................................................................................................... 48 Appendix A: Class Examples .......................................................................... 48 Appendix B: Time table ....................................................................................... 52 Appendix C: Module Outline ............................................................................... 53

3

CHAPTER 1: SIMPLE LINEAR REGRESSION AND CORRELATION 1.1

Introduction

We have dealt with data that involved a single variable x. In this section, we shall deal with paired variables x and y. Paired variables means that, for each value of y there is a corresponding value of x. Here's an example of paired variables:

x

24

15

17

32

19

18

25

34

y

22

11

14

30

17

12

23

31

When confronted with paired data, we are often confronted with two questions: 

Is there a relationship between the variable x and its counterpart y, and



If so what is the exact nature of the relationship?



Can you predict the value of y given the value of x?

1.2

Simple Linear Regression Model

Regression explores the expression of this relationship with the use of a regression Line. We can establish this statistical relationship in the form of a linear equation. This equation is used to predict the value of one variable given the value of its partner. The equation is known as a regression line. The analysis designed to derive an equation for the line that best models the relationship between the dependent and independent variables is calle d the regression analysis. This equation has the mathematical form:

yi   0  1 xi + i

(0)

where yi is the value of the dependent variable for the ith observation. xi is the value of the independent variable for the ith observation.

i is a random error and

4

1.  i is a random variable with mean zero and variance  2 , that is, E(  i ) = 0, V(  i ) =  2 2.  i and  j are uncorrelated, i  j, so that COV(  i ,  j ) = 0. Thus E ( yi )   0   1 xi and V( y i ) = 2 , and yi and yj for i  j are uncorrelated. 3.  i is a normally independent distributed random variable with mean zero and variance  2 , by (1), that is,  i ~N(0,  2 ).

For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

1.2.1

Interpretation of 0 and  1

Note that 0 is the y-intercept and  1 indicates how much the y-value changes when the x-value increases by 1-unit.

1.2.2 (i)

Assumptions about the Error Term,  i for a linear regression model

The values of  i are independent.

Implication: The values of i for a particular value of x is not related to the value of i for any other value of x; thus, the value of y for a particular value of x is not related to the value of y for any other value of x.

(ii)

The variance of  i , denoted by  2 is the same for all x.

This implies Constant variance (homoscedasticity) of the error term. This assumption requires that the variation around the line of regression be constant for all values of x. Implication: The variance of y at a given value of x equals  2 and is the same for all values of x.

5

(iii)

The error terms are a random variable with mean or expected value of zero.

Implication: Because  0 and  1 are constants, E(  0 ) = 0 and E(  1 ) =  1 ; thus for a given value of x the expected value of y is E ( yi )  0  1 xi

(iv)

The error term  i is a normally distributed random variable.

Implication: Because y is linear function of i , y is also a normally distributed random variable.

In this mathematical equation one dependent variable (y) is examined in relation to only one independent variable (x), and therefore, it is a simple regression analysis. For a given value of x, we can predict the value of y. The variable y, to be predicted is called the dependent variable while the variable x given to predict y is the independent variable. The 0-value is the intercept of the regression line on the yaxis when x = 0, and 1-value is the slope of the regression line. The problem in linear regression is to find values for the regression coefficients 0 and 1 in such a way that we obtain the best fitting line that passes through the observations as closely as possible.

We will look upon two methods of estimation: 1.3

Estimation of the Regression Parameters 1.3.1

Least-Squares Method

The method of least-squares (MLS) will be used to estimate 0 and 1. The least squares method determines the best-fitting straight line as that line which minimizes the sum of the lengths of the vertical-line segments drawn from the observed data points on the scatter diagram to the fitted line. The idea here is that the smaller the

6

deviations of the observed values from this line (and consequently the smaller the squares of these deviations), the closer the best-fitting line will be to the data. Let yˆi denote the estimated response at xi based on the fitted regression line; in other words,

yˆ i  ˆ0  ˆ1 xi   i where ˆ 0 and ˆ1 are the estimates of the regression parameters 0 and 1. The vertical distance between the observed point (xi, yi) and the corresponding point (xi, yi) on the fitted line is given by the absolute value |xi, yi|. The sum of squares of all such distances is then given by n

n

i 1

i 1

2

2 SSE   ( yi  yˆ i )   y i   0   1 x i  .……(1)

Now n  SSE   - 2  yi  ˆ0  ˆ1 xi = 0 ……(2) 0 i 1





n  SSE   - 2  yi  ˆ0  ˆ1 xi xi = 0 ……..(3)  1 i 1





Equations (2) and (3) are called partial derivatives. Simplifying we get

n

n

i1

i 1

 yi  n 0  1  xi  0 n

n

n

x y   x   x i

i

0

i1

(2a)

1

i

i 1

n

n

i 1

i 1

i 1

2 i

0

(3a)

 0n   1  xi   y i

(2b)

7

n

n

n

i 1

i 1

i 1

 0  xi  1  x i2   xi yi

(3b)

Equations (2b) and (3b) are called normal equations.

 n  n    xi   y i   i1  i1  xi y i   n So that ˆ1  and ˆ0  y  ˆ1 x 2 n     xi   xi2   i 1n  Let “SS” stand for sum of squares and the subscript(s) indicate(s) the variable(s) that we are summing. Then the sum of squares will be: 2

SSx =

 n    xi  n 2 xi   i 1  =  n i 1

n

 ( x  x)

2

i

i1

2

 n   y i  n i 1  = 2 SSy =  y i   n i 1

SSxy

n

( y

i

 y)2

i 1

 n  n    xi    yi  n i 1 i 1 =  xi y i       = = n i 1

n

 (x

i

 x )( yi  y)

i 1

So that our regression coefficient ˆ 1 becomes,

ˆ 1 =

SS xy SS x

The estimators ˆ0 and ˆ1 are unbiased estimators of the regression coefficients 0 and 1 respectively.

8

 Estimating the value of 2 Assuming that the straight model; is appropriate, we can obtain an estimate of 2 using SSE. Such an estimate is needed for making statistical inferences concerning the true (i.e. population) straight-line relationship between x and y. This estimate is given by: n

s2 

( y

i

 yˆ i ) 2

i 1

n 2

=

SSE n2

For computation purpose it can be shown that s2 

SS y  ˆ1 SS xy n 1 2 ˆ 2 2 s s ( y  1 x ) = n2 n2

2 [Note that: (n  1)s y  SS y ) ]

where s 2x and s 2y are sample variances of the observed xi’s and yi’s, respectively Thus the estimate of  2 is given by s 2  MSE 

SSE n 2

Exercise 1: Show that SSE = SS y  ˆ1 SS xy



Remember that partitioning the total sum of squares, we get SST = SSR + SSE n

 (y



i 1

i

n

n

i 1

i 1

 y )2   (yˆi  y )2   (y i  yˆ i ) 2

Because SSE = SS y  ˆ1 SS xy , this implies that SST= SSy and SSR = ˆ1SS xy and can n

be used to prove that SSR =  ( yˆ i  y )2 i1

Exercise 2: n



2 Prove that SSR =  ( yˆ i  y ) i1

9

Note: SSE = the sum of squares due to errors or the error sum of squares (or residual sum of squares). SST = the total sums of squares. SSR = the sums of squares due to regression.

Some properties of the fitted line 1.  ei  0

 y  yˆ 3.  x e  0 4.  yˆ e  0  ˆy  y 5. 2.

i

i

i i

i i

i

n

1.3.2 

The method of Maximum likelihood

See electronic book!!!

Is there a relationship between x and y? This can be measured by correlation analysis.

1.4

Correlation

Correlation involves in determining whether a linear relationship actually exists between two variables by using 

a Scatter Diagram,



the linear correlation coefficient, and



hypothesis testing

1.4.1

The Scatter Diagram

The simplest way to reveal a possible correlation between bivariate data is to express it graphically. 10

For example, the graph may show a strong linear relationship on the points. Joining the points with a line, it would possess a positive slope, representing a positive relationship. One advantage of the scatter diagram is that is does not require any calculation. A more precise method of discovering a relationship is to apply the linear correlation coefficient Other examples of scatter diagrams help illustrate relationships between points are given below:

11

1.4.2

The Correlation Coefficient

A more computation relationship requires the usage of the linear correlation coefficient. The Linear Correlation Coefficient measures the strength of the relationship between x and y values in a sample. It is a number between -1 and 1 which measures the degree to which two variables are linearly related. How do we interpret r ?  If there is perfect linear relationship with positive slope between the two variables, we have a correlation coefficient of 1. A positive correlation means that whenever one variable increases (or decreases) value, so does the other. A value closer to 1 indicates a strong positive relationship  If there is a perfect linear relationship with negative slope between the two variables, we have a correlation coefficient of –1. A negative correlation means that whenever one variable increases (or decreases) value, so the other decreases (or increases). A value closer to -1 indicates a strong negative relationship.  A correlation coefficient of 0 means that there is no linear relationship between the variables. A value of r closer to 0, indicates no significant linear correlation. Note: The value of r must always fall between -1 and 1 (inclusive). The equation of correlation coefficient is as follows: 12

r

SS xy SS x SS y

Note: r, ˆ1 and SSxy must have the same sign because

r

SS xy SS xSS y

 1

SS x SS y

But, how do we decide whether a significant relationship exists between variables after calculating an r that does not closely approach any of the extremes (-1 and 1)? This will be done by conducting a relevant hypothesis testing. 1.4.3

The coefficient of determination

It is given by the ratio, SSR/SST It measures the proportion of the total variability of y which is “explained” by the regression line. We have that SSR =  1 SSxy and SST = SSy substituting, we have The coefficient of determination =  1 SSxy/SSy = SS2xy/SSx SSy = r2 Thus the coefficient of determination is equal to r2.

1.4.4

Correlation and Hypothesis Testing

Conducting a hypothesis test of correlation is similar to the methods of hypothesis testing with which we have grown familiar. The rejection region is determined through degrees of freedom and significance level. 1.4.5

Interval estimation and prediction

We will look into two categories of the use of our model: 

The model for estimating the mean value of y, E(y), for a specific value of x.



The model for predicting a particular y value for a given x value.

To assess whether the fitted line helps to predict y form x, and to take into account the uncertainties of using a sample, it is standard practice to compute confidence intervals and/or test statistical hypothesis about the unknown parameters in the

13

assumed linear model. Such confidence intervals and test require, as previously described earlier, the assumption that the random variable y has a normal distribution at each fixed value of x. Under this assumption it can be deduced that the estimators

ˆ0 and ˆ1 are each normally distributed, with respective means 0 and 1 with easily derivable variances. The standard error of the estimate sometimes called the residual standard deviation is given by s e 

SSE n2

What does the standard error of the estimate measures? The standard error of the estimate indicates how much, on average, the predicted values differ from the observed values of the response variable. The sampling distribution of the slope (  1 ) of the regression model Suppose that the variables x and satisfy all the conditions for the regression inferences. Then for samples of size n, each with the same values of x1 , x 2 ,..., x n , for the predictor variable, the following properties hold for the slope,  1 of the sample regression line: 

The mean of  1 equals the slope of the population regression line, that is we have E(1) = ˆ 1 .



The standard deviation of  1 is   1 

 n

 (x i 1



i

 x) 2



 SS x

The variable  1 is normally distributed

Hence the standardize...