Title | Chapter 6 - YU, Chi Wai |
---|---|
Course | Applied Statistics |
Institution | 香港科技大學 |
Pages | 21 |
File Size | 1.1 MB |
File Type | |
Total Downloads | 445 |
Total Views | 491 |
MATH2411: Applied Statistics Dr. YU, Chi Wai Chapter 6: SIMPLE LINEAR REGRESSION 1 WHAT IS A SIMPLE LINEAR REGRESSION? Regression is a very useful statistical model used to capture a relationship between related variables of our interest. If the relationship is shown to be then the regression is sai...
MATH2411: Applied Statistics | Dr. YU, Chi Wai
Chapter 6: SIMPLE LINEAR REGRESSION HAT PLE INEAR REG EGRE RESSSIO SION 1 WHA T IS A SIMPL E LINE AR R EG RE N?
Regression is a very useful statistical model used to capture a relationship between related variables of our interest. If the relationship is shown to be “LINEAR”, then the regression is said to be a linear regression. “SIMPLE” means that there is only ONE variable (called explanatory variable labeled by 𝑥 ) used to explain our target variable (called response variable labeled by 𝑦) --- the variable we want to explain or predict.
A simple linear regression
is a statistical model used to study the relationship between 𝒚 and 𝒙 if they are related LINEARLY.
A scatter plot is a power graphical method used to visualize the relationship between 𝑦 and 𝑥 . To be more precise, we would have a collection of a PAIRED data of 𝑥 and 𝑦, denoted by {(𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, … , 𝑛} , then plot a graph of 𝑦 against 𝑥 , like the picture on the right.
~1~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
Note that in Math class, we probably learned to see relationships displayed as GRAPHS. Given 𝑥 , we can predict 𝑦.
However, in statistics, things are never so clean. Data do not perfectly lie on a line or curve!
If the scatter plot shows a “linear” pattern, then we can FIND a straight line to fit the messy data statistically. Throughout this course, we only consider two variables (a response variable 𝑦 and an explanatory variable 𝑥) which are related linearly, and discuss how to fit their data statistically by the following model:
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Note that 𝜀 is random and is used to measure all uncertainty of the model, like a measurement error, and that the regression coefficients 𝛽0 and 𝛽1 are unknown. ~2~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
2 LEAST SQU QUARE ARESS APP PPROAC ROACH ARE ROAC H How do we fit the data appropriately by a straight line? Which straight line should we use? Or equivalently, what are the “good” estimates of 𝛽0 and 𝛽1 used to fit the data. Intuitively, we would like to find a straight line so that it is “close” to the collected data. In other words, we need a measure of the closeness of the straight line to the data, or equivalently, we need an estimation criterion of finding the “good” estimates of 𝛽0 and 𝛽1 . “LEAST SQUARES approach” is the most commonly used method to find the straight line which is close to the data in statistics.
To be more precise, we have the following function 𝑆(⋅,⋅) of the total squared difference between the data and a straight line:
𝑛
𝑆 (𝑢, 𝑣 ) = ∑[𝑦𝑖 − (𝑢 + 𝑣𝑥𝑖 ))]2 . 𝑖=1
As the name of “LEAST” SQAURE approach indicates, we then would find a point (𝑎, 𝑏) at which the function 𝑆(⋅,⋅) attains its minimum. Finally, we have the following result:
𝑏=
𝑛 𝑖=1𝑥𝑖 𝑦𝑖
−(
𝑛 𝑥 2 𝑖=1 𝑖 −
𝑎 = 𝑦 − 𝑏 𝑥.
𝑛 𝑥𝑖 𝑖=1
)(
𝑛 𝑥𝑖 𝑖=1
𝑛 𝑦𝑖 𝑖=1 2
/𝑛
)/𝑛
=
𝑛( 𝑥𝑖 − 𝑥 )(𝑦𝑖 − 𝑦) 𝑖=1 𝑛 (𝑥 2 𝑖=1 𝑖 − 𝑥 )
=
𝑆𝑋𝑌 , 𝑆𝑋𝑋
Hereafter, 𝑎 and 𝑏 are called the least-squares ESTIMATES of the unknown true values of 𝛽0 and 𝛽1 , respectively. ~3~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
3 FITTED REG EGRES RESSION INE RES SION LIN E Once we have the least-squares estimates, we can write down the so-called fitted regression line (or sometimes called estimated regression line):
𝑦 = 𝑎 + 𝑏𝑥 .
If we substitute 𝑥𝑖 (for 𝑖 = 1, … , 𝑛) to the fitted regression line, then we can get a fitted value, labelled by 𝑦𝑖 , of the 𝑖 th observation 𝑦𝑖 of the response variable, and then have the so-called the residual 𝑒𝑖 of 𝑦𝑖 , where 𝑒𝑖 is defined as
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 .
Remark that the residual 𝑒𝑖 in practice is regarded as an “actual” value of the unobservable random term 𝜀𝑖 in the simple linear regression, and it plays a very important role in regression analysis because we can use it to do a model diagnostic and quantify the goodness of the regression model.
~4~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
4 RESID SIDUAL QUARES QUARED RRORS UAL SUM OF SQU ARES (OR SUM OF SQUA RED ERR ORS) Residual sum of squares or called Sum of Squared Errors (hereafter, SSE) is one commonly used measure of evaluating the goodness of a simple linear regression model. As its name suggests, this measure is defined as
𝑛
𝑆𝑆𝐸 = ∑[𝑦𝑖 − 𝑦𝑖 ]2 𝑖=1
Suppose that we have two explanatory variables, say 𝑥1 and 𝑥2 . We now want to use a simple linear regression model, i.e. the model with only one of them, to explain 𝑦. So, which one should we use to get a better simple regression model?
We can use SSE to quantify the goodness of the simple linear regression models with 𝑥1 alone and 𝑥2 alone. If the model with 𝑥1 alone has a smaller SSE than the model with 𝑥2 alone, then we would say that 𝑥1 has a more significant effect on 𝑦 than 𝑥2 . So, we prefer to use 𝑥1 . ~5~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
EXAMP XAMP MPLE LE
Consider the relationship between the weight (𝑥 ) of an automobile and fuel consumption (𝑦), where the latter is measured by gpm --- the amount of fuel (in gallons) that is need to drive 100 miles. Suppose we collect paired data of the weight and fuel consumption gpm of 10 cars, and have the following picture to show that there exists a linear relationship between them:
Here are the raw paired data of 𝑥 and 𝑦.
~6~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
According to the above scatter plot, it is reasonable for us to use a simple linear regression to fit the paired data. Thus, we first find
and then by the least-squares approach we have
Finally, we can write down the fitted regression line
𝑦 = −0.363 + 1.639𝑥. ~7~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
If we draw this fitted regression line on the scatter plot, then we have
REMAR MARKS KS:
According to the least-squares estimates 𝑏 = 1.639, we can say that on average each additional unit (1000 pounds) of weight requires an additional 1.639 gallons of fuel to drive 100 miles. That is, increasing one unit in 𝑥 will increase 1.639 units in 𝑦 on average. Note that the regression line always passes through the point (𝑥 , 𝑦). Why?
~8~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
In R (https://www.r-project.org/), we can use (i) the function plot() to create a scatter plot:
(ii) the function lm() to get the least-squares estimates:
~9~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
And (iii) the function abline() to add a straight line to the existing scatter plot
OR
~ 10 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
ATISTICA TICALL IN INFER FEREN ENCE ABOUT AND 5Recall STATIS TICA FER EN CE A BOUT 𝜷𝟎 AN D 𝜷𝟏 that in Section 2, according to the least squares approach, we have
following least-squares ESTIMATES of the unknown true values of 𝛽1 and 𝛽0 , respectively.
𝒃=
𝒏 𝒊=𝟏𝒙𝒊 𝒚𝒊 − ( 𝒏 𝒙 𝟐 𝒊=𝟏 𝒊
𝒂=𝒚 − 𝒃𝒙 .
−(
𝒏 𝒙𝒊 𝒊=𝟏
)(
𝒏 𝒙 𝒊=𝟏 𝒊
𝒏 𝒊=𝟏𝒚𝒊
)𝟐 /𝒏
)/𝒏
=
𝒏 (𝒙𝒊 − 𝒙)(𝒚𝒊 − 𝒊=𝟏 𝒏 (𝒙 𝟐 𝒊=𝟏 𝒊 − 𝒙)
𝒚 )
=
𝑺𝑿𝒀 , 𝑺𝑿𝑿
So, if we want to study the estimation method for the statistical inference about the true values of 𝛽0 and 𝛽1 , then we need their random counterparts. Thus, we have the following respective random variables called least-squares ESTIMATORS for the unknown true values of 𝛽1 and 𝛽0 :
and
𝜷𝟏 =
𝒏 (𝒙 𝒊=𝟏 𝒊 − 𝒙)(𝒀𝒊 − 𝒏 (𝒙 𝟐 𝒊=𝟏 𝒊 − 𝒙)
) 𝒀
𝟎 = 𝒀 −𝜷 𝟏 𝒙. 𝜷
𝟎 and 𝜷 𝟏 distributed around 𝜷𝟎 and 𝜷𝟏 , respectively? How are 𝜷 How do we construct CONFIDENEC INTERVALS and test HYPOTHESES?
We would then ask
~ 11 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
6 MODE ODELL ASSUM SUMPTI PTIONS PTI ONS To make an inference of the true values of 𝜷𝟎 and 𝜷𝟏 , say construct their confidence intervals, or do a hypothesis test, we need to make some assumptions first.
Model assumptions (in terms of the random error term)
1. Errors are independent; 2. Errors have constant variance 𝝈𝟐 ; 3. Errors have zero mean; 4. Errors follow a normal distribution.
So, now we can answer the above questions by the following results:
Under assumptions 1-4, we have
𝜎2 ) 𝛽1 ~𝑁 (𝛽1 , 𝑆𝑋𝑋
𝑎𝑛𝑑
𝛽0 ~𝑁 (𝛽0 ,
𝜎2
𝑛 2 𝑥𝑖 𝑖=1
𝑛𝑆𝑋𝑋
).
If 𝝈𝟐 is known, then we can use these results directly to construct a confidence interval and to formulate a test statement of 𝛽0 and 𝛽1 . However, we want to deal with a more practical problem, that is, the problem with Unknown 𝝈𝟐 . First, we need to know how to estimate the common population variance of the random error terms. ~ 12 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
Recall that residual is often regarded as an “actual” value of the unobservable random error term. Thus, we can use the sample variance of the residual to
estimate the unknown population variance 𝜎 2 . In Chapter 4, we studied an unbiasedness to find a “good” estimator of the unknown parameter of our interest. So, let’s find an unbiased estimator of 𝜎 2 . In theory, we can show that
2 𝑛 𝑖=1𝐸𝑖
− 𝑌𝑖 )2 𝑛−2
𝑛 (𝑌𝑖 𝑖=1
the Mean Squared Error (hereafter, MSE) defined as
𝑆 = 2
is our hero!
(Its actual value 𝑠 = 2
𝑛−2
𝑛 2 𝑖=1𝑒𝑖
𝑛−2
=
=
𝑛 (𝑦𝑖 −𝑦𝑖 )2 𝑖=1
𝑛−2
is also called MSE, for simplicity.)
REMAR MARK K: MSE can also be found in the following way:
where
𝑆𝑌𝑌 =
𝑛 (𝑦 𝑖=1 𝑖
𝑠2 =
𝑆𝑌𝑌 − 𝑏𝑆𝑋𝑌 , 𝑛−2
− 𝑦)2 and 𝑆𝑋𝑌 =
~ 13 ~
𝑛 (𝑥𝑖 𝑖=1
− 𝑥 )(𝑦𝑖 − 𝑦)
.
MATH2411: Applied Statistics | Dr. YU, Chi Wai
ONFIDENC ENCE TERVAL VALSS FOR 𝜷𝟎 AN AND 7 CONFID ENC E INTER VAL D 𝜷𝟏 2
After replacing the unknown 𝜎 by MSE, we have the following results
1)
𝑇𝑛−2 =
𝛽1 − 𝛽1
𝑆 2 ~𝑡 √ 𝑛−2 . 𝑆
𝑋𝑋 Consequently, the 100(1 − 𝛼)% C.I. for 𝛽1 is given by
𝑏 ± 𝑡𝑛−2,
2)
𝑇𝑛−2 =
𝑠2 . 𝛼√ 𝑆 2 𝑋𝑋
𝛽0 − 𝛽0
𝑆2 √
𝑛 2 𝑖=1𝑥𝑖
𝑛𝑆𝑋𝑋
~𝑡𝑛−2 .
Consequently, the 100(1 − 𝛼)% C.I. for 𝛽0 is given by
( 𝑦− 𝑏𝑥 ) ± 𝑡𝑛−2,
𝑠2 𝛼√ 2
~ 14 ~
𝑛 2 𝑥𝑖 𝑖=1
𝑛𝑆𝑋𝑋
.
MATH2411: Applied Statistics | Dr. YU, Chi Wai
In R (https://www.r-project.org/), we can use the function confint to find the confidence interval of each regression coefficient in the case of
UNKNOWN 𝝈𝟐 , when all collected data are given. EXA XAMP MP MPLE LE Referring to the previous example, we have
~ 15 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
POTHES HESIS ESTING 8 HYPOT HES IS TEST ING FOR 𝜷𝟎 AND 𝜷𝟏
For the slope 𝛽1 ,
1. One-sided right test: Consider {
𝐻0 : 𝛽1 = 𝑏1
𝐻1 : 𝛽1 > 𝑏1
.
Intuitively, we reject 𝐻0 if 𝑏 > 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑏−𝑏1 the t value 𝑠 > 𝑡𝑛−2,𝛼 √𝑆𝑋𝑋
(when 𝜎 2 is UNKNOWN).
2. One-sided left test: Consider {
𝐻0 : 𝛽1 = 𝑏1
𝐻1 : 𝛽1 < 𝑏1
.
Intuitively, we reject 𝐻0 if 𝑏 < 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑏−𝑏1 the t value 𝑠 < −𝑡𝑛−2,𝛼
(when 𝜎 2 is UNKNOWN).
√𝑆𝑋𝑋
~ 16 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
3. Two-sided test: Consider 𝐻0 : 𝛽1 = 𝑏1 . {𝐻1 : 𝛽1 ≠ 𝑏1 Intuitively, we reject 𝐻0 if 𝑏 < 𝑐1 or 𝑏 > 𝑐2 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if the absolute t value (when 𝜎 2 is UNKNOWN).
|
𝑏−𝑏1 𝑠
√𝑆𝑋𝑋
Similarly, for the intercept 𝛽0 , 1. One-sided right test: Consider
{
| > 𝑡𝑛−2,𝛼2
𝐻0 : 𝛽0 = 𝑏0
𝐻1 : 𝛽0 > 𝑏0
.
Intuitively, we reject 𝐻0 if 𝑎 > 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑎−𝑏0 the t value > 𝑡𝑛−2,𝛼 𝑛 𝑥2 𝑠√ 𝑖=1 𝑖
(when 𝜎 2 is UNKNOWN).
𝑛𝑆𝑋𝑋
~ 17 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
2. One-sided left test: Consider 𝐻0 : 𝛽0 = 𝑏0 . {𝐻1 : 𝛽0 < 𝑏0
Intuitively, we reject 𝐻0 if 𝑎 < 𝑐 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if 𝑎−𝑏0 the t value (when 𝜎 2 is UNKNOWN).
𝑛
2
𝑥 𝑠√ 𝑖=1 𝑖
𝑛𝑆𝑋𝑋
3. Two-sided test: Consider {
< −𝑡𝑛−2,𝛼
𝐻0 : 𝛽0 = 𝑏0
𝐻1 : 𝛽0 ≠ 𝑏0
.
Intuitively, we reject 𝐻0 if 𝑎 < 𝑐1 or 𝑎 > 𝑐2 . Consequently, we would Reject 𝐻0 at a significance level 𝛼 if the absolute t value |
𝑎−𝑏0
𝑛 𝑥2 𝑖=1 𝑖 𝑠√ 𝑛𝑆 𝑋𝑋
(when 𝜎 2 is UNKNOWN).
~ 18 ~
|
> 𝑡𝑛−2,𝛼 2
MATH2411: Applied Statistics | Dr. YU, Chi Wai
9 PREDI REDICTIO CTION CTIO N Once we get the fitted regression line, what we want to do next is “PREDICTION”! Given a new observation of the explanatory variable, denoted by 𝑥𝑛𝑒𝑤 , the corresponding value, denoted by 𝑦𝑛𝑒𝑤 , of the response is certainly UNKNOWN and it is the term we want to predict.
𝒂 + 𝒃𝑥𝑛𝑒𝑤
According to the fitted regression model derived by our previous data, we have the following result to predict the value of 𝑦𝑛𝑒𝑤 : It is a point estimate (or often called a predicted value) of 𝑦𝑛𝑒𝑤 . We then denote it by 𝑦 𝑛𝑒𝑤 .
Of course, we can use the random counterpart of this point estimate to get a confidence interval of the unknown quantity 𝑦𝑛𝑒𝑤 . Here I skip the details of the proof and show you the following result directly. The 100(1 − 𝛼 )% prediction interval for 𝑦𝑛𝑒𝑤 is given by
𝑦𝑛𝑒𝑤 ± 𝑡𝑛−2,𝛼
2
REMAR MARK K
1 (𝑥𝑛𝑒𝑤 − 𝑥 )2 𝑠√1 + + . 𝑆𝑋𝑋 𝑛
The terms in green in the above results are based on the previous data. ~ 19 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
QUEST ESTION ION Recall that in our example on page 6, we have already found that the fitted regression line based on the previous data set is
𝑦 = −0.363 + 1.639𝑥.
Now, we want to (i) predict the value of the fuel consumption for an automobile weighing 2500 pounds, and (ii) get a 95% prediction interval of this value we predict.
~ 20 ~
MATH2411: Applied Statistics | Dr. YU, Chi Wai
In R (https://www.r-project.org/), we can use the function predict to find the predicted value and to get the prediction interval in the case of UNKNOWN
𝝈𝟐 , when all collected data are given....