Chapter 6 - YU, Chi Wai PDF

Title	Chapter 6 - YU, Chi Wai
Course	Applied Statistics
Institution	香港科技大學
Pages	21
File Size	1.1 MB
File Type	PDF
Total Downloads	445
Total Views	491

Preview

CLICK TO PREVIEW PDF

Summary

MATH2411: Applied Statistics Dr. YU, Chi Wai Chapter 6: SIMPLE LINEAR REGRESSION 1 WHAT IS A SIMPLE LINEAR REGRESSION? Regression is a very useful statistical model used to capture a relationship between related variables of our interest. If the relationship is shown to be then the regression is sai...

Description

MATH2411: Applied Statistics | Dr. YU, Chi Wai

Chapter 6: SIMPLE LINEAR REGRESSION HAT PLE INEAR REG EGRE RESSSIO SION 1 WHA T IS A SIMPL E LINE AR R EG RE N?

Regression is a very useful statistical model used to capture a relationship between related variables of our interest. If the relationship is shown to be “LINEAR”, then the regression is said to be a linear regression. “SIMPLE” means that there is only ONE variable (called explanatory variable labeled by 𝑥 ) used to explain our target variable (called response variable labeled by 𝑦) --- the variable we want to explain or predict.

A simple linear regression

is a statistical model used to study the relationship between 𝒚 and 𝒙 if they are related LINEARLY.

 A scatter plot is a power graphical method used to visualize the relationship between 𝑦 and 𝑥 . To be more precise, we would have a collection of a PAIRED data of 𝑥 and 𝑦, denoted by {(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛} , then plot a graph of 𝑦 against 𝑥 , like the picture on the right.

~1~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

 Note that in Math class, we probably learned to see relationships displayed as GRAPHS. Given 𝑥 , we can predict 𝑦.

 However, in statistics, things are never so clean. Data do not perfectly lie on a line or curve!

If the scatter plot shows a “linear” pattern, then we can FIND a straight line to fit the messy data statistically. Throughout this course, we only consider two variables (a response variable 𝑦 and an explanatory variable 𝑥) which are related linearly, and discuss how to fit their data statistically by the following model:

𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀

Note that 𝜀 is random and is used to measure all uncertainty of the model, like a measurement error, and that the regression coefficients 𝛽0 and 𝛽1 are unknown. ~2~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

2 LEAST SQU QUARE ARESS APP PPROAC ROACH ARE ROAC H How do we fit the data appropriately by a straight line? Which straight line should we use? Or equivalently, what are the “good” estimates of 𝛽0 and 𝛽1 used to fit the data. Intuitively, we would like to find a straight line so that it is “close” to the collected data. In other words, we need a measure of the closeness of the straight line to the data, or equivalently, we need an estimation criterion of finding the “good” estimates of 𝛽0 and 𝛽1 . “LEAST SQUARES approach” is the most commonly used method to find the straight line which is close to the data in statistics.

To be more precise, we have the following function 𝑆(⋅,⋅) of the total squared difference between the data and a straight line:

𝑛

𝑆 (𝑢, 𝑣 ) = ∑[𝑦𝑖 − (𝑢 + 𝑣𝑥𝑖 ))]2 . 𝑖=1

As the name of “LEAST” SQAURE approach indicates, we then would find a point (𝑎, 𝑏) at which the function 𝑆(⋅,⋅) attains its minimum. Finally, we have the following result:

𝑏=

𝑛 𝑖=1𝑥𝑖 𝑦𝑖

−(

𝑛 𝑥 2 𝑖=1 𝑖 −

𝑎 = 𝑦 − 𝑏 𝑥.

𝑛 𝑥𝑖 𝑖=1

)(

𝑛 𝑥𝑖 𝑖=1

𝑛 𝑦𝑖 𝑖=1 2

/𝑛

)/𝑛

=

𝑛( 𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦) 𝑖=1 𝑛 (𝑥 2 𝑖=1 𝑖 − 𝑥 )

=

𝑆𝑋𝑌 , 𝑆𝑋𝑋

Hereafter, 𝑎 and 𝑏 are called the least-squares ESTIMATES of the unknown true values of 𝛽0 and 𝛽1 , respectively. ~3~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

3 FITTED REG EGRES RESSION INE RES SION LIN E Once we have the least-squares estimates, we can write down the so-called fitted regression line (or sometimes called estimated regression line):

𝑦 = 𝑎 + 𝑏𝑥 .

If we substitute 𝑥𝑖 (for 𝑖 = 1, … , 𝑛) to the fitted regression line, then we can get a fitted value, labelled by 𝑦𝑖 , of the 𝑖 th observation 𝑦𝑖 of the response variable, and then have the so-called the residual 𝑒𝑖 of 𝑦𝑖 , where 𝑒𝑖 is defined as

𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 .

Remark that the residual 𝑒𝑖 in practice is regarded as an “actual” value of the unobservable random term 𝜀𝑖 in the simple linear regression, and it plays a very important role in regression analysis because we can use it to do a model diagnostic and quantify the goodness of the regression model.

~4~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

4 RESID SIDUAL QUARES QUARED RRORS UAL SUM OF SQU ARES (OR SUM OF SQUA RED ERR ORS) Residual sum of squares or called Sum of Squared Errors (hereafter, SSE) is one commonly used measure of evaluating the goodness of a simple linear regression model. As its name suggests, this measure is defined as

𝑛

𝑆𝑆𝐸 = ∑[𝑦𝑖 − 𝑦𝑖 ]2 𝑖=1

Suppose that we have two explanatory variables, say 𝑥1 and 𝑥2 . We now want to use a simple linear regression model, i.e. the model with only one of them, to explain 𝑦. So, which one should we use to get a better simple regression model?

We can use SSE to quantify the goodness of the simple linear regression models with 𝑥1 alone and 𝑥2 alone. If the model with 𝑥1 alone has a smaller SSE than the model with 𝑥2 alone, then we would say that 𝑥1 has a more significant effect on 𝑦 than 𝑥2 . So, we prefer to use 𝑥1 . ~5~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

EXAMP XAMP MPLE LE

Consider the relationship between the weight (𝑥 ) of an automobile and fuel consumption (𝑦), where the latter is measured by gpm --- the amount of fuel (in gallons) that is need to drive 100 miles. Suppose we collect paired data of the weight and fuel consumption gpm of 10 cars, and have the following picture to show that there exists a linear relationship between them:

Here are the raw paired data of 𝑥 and 𝑦.

~6~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

According to the above scatter plot, it is reasonable for us to use a simple linear regression to fit the paired data. Thus, we first find

and then by the least-squares approach we have

Finally, we can write down the fitted regression line

𝑦 = −0.363 + 1.639𝑥. ~7~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

If we draw this fitted regression line on the scatter plot, then we have

REMAR MARKS KS:

According to the least-squares estimates 𝑏 = 1.639, we can say that on average each additional unit (1000 pounds) of weight requires an additional 1.639 gallons of fuel to drive 100 miles. That is, increasing one unit in 𝑥 will increase 1.639 units in 𝑦 on average. Note that the regression line always passes through the point (𝑥 , 𝑦). Why?

~8~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

In R (https://www.r-project.org/), we can use (i) the function plot() to create a scatter plot:

(ii) the function lm() to get the least-squares estimates:

~9~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

And (iii) the function abline() to add a straight line to the existing scatter plot

OR

~ 10 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

ATISTICA TICALL IN INFER FEREN ENCE ABOUT AND 5Recall STATIS TICA FER EN CE A BOUT 𝜷𝟎 AN D 𝜷𝟏 that in Section 2, according to the least squares approach, we have

following least-squares ESTIMATES of the unknown true values of 𝛽1 and 𝛽0 , respectively.

𝒃=

𝒏 𝒊=𝟏𝒙𝒊 𝒚𝒊 − ( 𝒏 𝒙 𝟐 𝒊=𝟏 𝒊

𝒂=𝒚  − 𝒃𝒙 .

−(

𝒏 𝒙𝒊 𝒊=𝟏

)(

𝒏 𝒙 𝒊=𝟏 𝒊

𝒏 𝒊=𝟏𝒚𝒊

)𝟐 /𝒏

)/𝒏

=

𝒏 (𝒙𝒊 − 𝒙)(𝒚𝒊 − 𝒊=𝟏 𝒏 (𝒙 𝟐 𝒊=𝟏 𝒊 − 𝒙)

𝒚 )

=

𝑺𝑿𝒀 , 𝑺𝑿𝑿

So, if we want to study the estimation method for the statistical inference about the true values of 𝛽0 and 𝛽1 , then we need their random counterparts. Thus, we have the following respective random variables called least-squares ESTIMATORS for the unknown true values of 𝛽1 and 𝛽0 :

and

 𝜷𝟏 =

𝒏 (𝒙 𝒊=𝟏 𝒊 − 𝒙)(𝒀𝒊 − 𝒏 (𝒙 𝟐 𝒊=𝟏 𝒊 − 𝒙)

) 𝒀

𝟎 = 𝒀 −𝜷  𝟏 𝒙. 𝜷

 𝟎 and 𝜷  𝟏 distributed around 𝜷𝟎 and 𝜷𝟏 , respectively?  How are 𝜷  How do we construct CONFIDENEC INTERVALS and test HYPOTHESES?

We would then ask

~ 11 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

6 MODE ODELL ASSUM SUMPTI PTIONS PTI ONS To make an inference of the true values of 𝜷𝟎 and 𝜷𝟏 , say construct their confidence intervals, or do a hypothesis test, we need to make some assumptions first.

Model assumptions (in terms of the random error term)

1. Errors are independent; 2. Errors have constant variance 𝝈𝟐 ; 3. Errors have zero mean; 4. Errors follow a normal distribution.

So, now we can answer the above questions by the following results:

Under assumptions 1-4, we have

𝜎2 ) 𝛽󰆹1 ~𝑁 (𝛽1 , 𝑆𝑋𝑋

𝑎𝑛𝑑

𝛽󰆹0 ~𝑁 (𝛽0 ,

𝜎2

𝑛 2 𝑥𝑖 𝑖=1

𝑛𝑆𝑋𝑋

).

If 𝝈𝟐 is known, then we can use these results directly to construct a confidence interval and to formulate a test statement of 𝛽0 and 𝛽1 . However, we want to deal with a more practical problem, that is, the problem with Unknown 𝝈𝟐 . First, we need to know how to estimate the common population variance of the random error terms. ~ 12 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

Recall that residual is often regarded as an “actual” value of the unobservable random error term. Thus, we can use the sample variance of the residual to

estimate the unknown population variance 𝜎 2 . In Chapter 4, we studied an unbiasedness to find a “good” estimator of the unknown parameter of our interest. So, let’s find an unbiased estimator of 𝜎 2 . In theory, we can show that

2 𝑛 𝑖=1𝐸𝑖

− 𝑌𝑖 )2 𝑛−2

𝑛 (𝑌𝑖 𝑖=1

the Mean Squared Error (hereafter, MSE) defined as

𝑆 = 2

is our hero!

(Its actual value 𝑠 = 2

𝑛−2

𝑛 2 𝑖=1𝑒𝑖

𝑛−2

=

=

𝑛 (𝑦𝑖 −𝑦𝑖 )2 𝑖=1

𝑛−2

is also called MSE, for simplicity.)

REMAR MARK K: MSE can also be found in the following way:

where

𝑆𝑌𝑌 =

𝑛 (𝑦 𝑖=1 𝑖

𝑠2 =

𝑆𝑌𝑌 − 𝑏𝑆𝑋𝑌 , 𝑛−2

− 𝑦)2 and 𝑆𝑋𝑌 =

~ 13 ~

𝑛 (𝑥𝑖 𝑖=1

− 𝑥 )(𝑦𝑖 − 𝑦)

.

MATH2411: Applied Statistics | Dr. YU, Chi Wai

ONFIDENC ENCE TERVAL VALSS FOR 𝜷𝟎 AN AND 7 CONFID ENC E INTER VAL D 𝜷𝟏 2

After replacing the unknown 𝜎 by MSE, we have the following results

1)

𝑇𝑛−2 =

𝛽󰆹1 − 𝛽1

𝑆 2 ~𝑡 √ 𝑛−2 . 𝑆

𝑋𝑋 Consequently, the 100(1 − 𝛼)% C.I. for 𝛽1 is given by

𝑏 ± 𝑡𝑛−2,

2)

𝑇𝑛−2 =

𝑠2 . 𝛼√ 𝑆 2 𝑋𝑋

𝛽󰆹0 − 𝛽0

𝑆2 √

𝑛 2 𝑖=1𝑥𝑖

𝑛𝑆𝑋𝑋

~𝑡𝑛−2 .

Consequently, the 100(1 − 𝛼)% C.I. for 𝛽0 is given by

( 𝑦− 𝑏𝑥 ) ± 𝑡𝑛−2,

𝑠2 𝛼√ 2

~ 14 ~

𝑛 2 𝑥𝑖 𝑖=1

𝑛𝑆𝑋𝑋

.

MATH2411: Applied Statistics | Dr. YU, Chi Wai

In R (https://www.r-project.org/), we can use the function confint to find the confidence interval of each regression coefficient in the case of

UNKNOWN 𝝈𝟐 , when all collected data are given. EXA XAMP MP MPLE LE Referring to the previous example, we have

~ 15 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

POTHES HESIS ESTING 8 HYPOT HES IS TEST ING FOR 𝜷𝟎 AND 𝜷𝟏

For the slope 𝛽1 ,

1. One-sided right test: Consider {

𝐻0 : 𝛽1 = 𝑏1

𝐻1 : 𝛽1 > 𝑏1

.

Intuitively, we reject 𝐻0 if 𝑏 > 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑏−𝑏1 the t value 𝑠 > 𝑡𝑛−2,𝛼 √𝑆𝑋𝑋

(when 𝜎 2 is UNKNOWN).

2. One-sided left test: Consider {

𝐻0 : 𝛽1 = 𝑏1

𝐻1 : 𝛽1 < 𝑏1

.

Intuitively, we reject 𝐻0 if 𝑏 < 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑏−𝑏1 the t value 𝑠 < −𝑡𝑛−2,𝛼

(when 𝜎 2 is UNKNOWN).

√𝑆𝑋𝑋

~ 16 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

3. Two-sided test: Consider 𝐻0 : 𝛽1 = 𝑏1 . {𝐻1 : 𝛽1 ≠ 𝑏1 Intuitively, we reject 𝐻0 if 𝑏 < 𝑐1 or 𝑏 > 𝑐2 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if the absolute t value (when 𝜎 2 is UNKNOWN).

|

𝑏−𝑏1 𝑠

√𝑆𝑋𝑋

Similarly, for the intercept 𝛽0 , 1. One-sided right test: Consider

{

| > 𝑡𝑛−2,𝛼2

𝐻0 : 𝛽0 = 𝑏0

𝐻1 : 𝛽0 > 𝑏0

.

Intuitively, we reject 𝐻0 if 𝑎 > 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑎−𝑏0 the t value > 𝑡𝑛−2,𝛼 𝑛 𝑥2 𝑠√ 𝑖=1 𝑖

(when 𝜎 2 is UNKNOWN).

𝑛𝑆𝑋𝑋

~ 17 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

2. One-sided left test: Consider 𝐻0 : 𝛽0 = 𝑏0 . {𝐻1 : 𝛽0 < 𝑏0

Intuitively, we reject 𝐻0 if 𝑎 < 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑎−𝑏0 the t value (when 𝜎 2 is UNKNOWN).

𝑛

2

𝑥 𝑠√ 𝑖=1 𝑖

𝑛𝑆𝑋𝑋

3. Two-sided test: Consider {

< −𝑡𝑛−2,𝛼

𝐻0 : 𝛽0 = 𝑏0

𝐻1 : 𝛽0 ≠ 𝑏0

.

Intuitively, we reject 𝐻0 if 𝑎 < 𝑐1 or 𝑎 > 𝑐2 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if the absolute t value |

𝑎−𝑏0

𝑛 𝑥2 𝑖=1 𝑖 𝑠√ 𝑛𝑆 𝑋𝑋

(when 𝜎 2 is UNKNOWN).

~ 18 ~

|

> 𝑡𝑛−2,𝛼 2

MATH2411: Applied Statistics | Dr. YU, Chi Wai

9 PREDI REDICTIO CTION CTIO N Once we get the fitted regression line, what we want to do next is “PREDICTION”! Given a new observation of the explanatory variable, denoted by 𝑥𝑛𝑒𝑤 , the corresponding value, denoted by 𝑦𝑛𝑒𝑤 , of the response is certainly UNKNOWN and it is the term we want to predict.

𝒂 + 𝒃𝑥𝑛𝑒𝑤

According to the fitted regression model derived by our previous data, we have the following result to predict the value of 𝑦𝑛𝑒𝑤 : It is a point estimate (or often called a predicted value) of 𝑦𝑛𝑒𝑤 . We then denote it by 𝑦 𝑛𝑒𝑤 .

Of course, we can use the random counterpart of this point estimate to get a confidence interval of the unknown quantity 𝑦𝑛𝑒𝑤 . Here I skip the details of the proof and show you the following result directly. The 100(1 − 𝛼 )% prediction interval for 𝑦𝑛𝑒𝑤 is given by

𝑦𝑛𝑒𝑤 ± 𝑡𝑛−2,𝛼

2

REMAR MARK K

1 (𝑥𝑛𝑒𝑤 − 𝑥 )2 𝑠√1 + + . 𝑆𝑋𝑋 𝑛

The terms in green in the above results are based on the previous data. ~ 19 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

QUEST ESTION ION Recall that in our example on page 6, we have already found that the fitted regression line based on the previous data set is

𝑦 = −0.363 + 1.639𝑥.

Now, we want to (i) predict the value of the fuel consumption for an automobile weighing 2500 pounds, and (ii) get a 95% prediction interval of this value we predict.

~ 20 ~

MATH2411: Applied Statistics | Dr. YU, Chi Wai

In R (https://www.r-project.org/), we can use the function predict to find the predicted value and to get the prediction interval in the case of UNKNOWN

𝝈𝟐 , when all collected data are given....