St3006 [Autosaved] all PDF

Title St3006 [Autosaved] all
Author prasadani karunanayaka
Course Calculus for Finance
Institution University of Colombo
Pages 35
File Size 2.8 MB
File Type PDF
Total Downloads 17
Total Views 154

Summary

notes statistics course note lesssonnnnnnnnnn dajkdjkjk...


Description

ST3006 REGRESSION ANALYSIS NOTE

Week1 -introduction • Statistical technique for investigating and modeling the relationships between variables. • 1st plot data and check whether some relationship. • Identify nature relationship using mathematically equation. • Content –  Correlation and introduction regression analysis.  Simple linear regression.  Parameter estimation, inferences about the model prediction.  Multiple regression  Regression use SPSS and R  Comparison of regression models.  Categorical variable as predictors  Variable selection method  Practical examples of regression analysis using SPSS and R

mathematical equation. Consider following situations; • It is required to find the relationship between the ฀ Height and weight of a group of children. – S.R.M ฀ Family income and family expenditure of households. – S.R.M ฀ Marks of a given pair of Statistics and Mathematics subjects of students of this batch.-S.R.M • A farmer distributes various amounts of fertilizer, pesticides and irrigation water to different plots and investigates the variation in crop growth. –M.R.M • A person is interested in finding out whether the weight of a new born baby changes with the length for the first 20 weeks after birth.-TIME SERIES REGRESSION

Week1 –simple regression model • Consider relationship between revenue and advertising cost there are depend on so many other inflection factor revenue  number of workers.  pricing  transport cost  ….. • simple regression model use find relationship between revenue and advertising cost (only one X variable) • Multiple regression use relationship between revenue and set of X factors. • Dependent/response/Y variable – which the variation being study. • predictor/explanatory/in dependent/X variable – determine the change dependent variable. One Y variable One x Simple R A

More than one X Multiple R A

Relationship between variable

Simple regression model

ε  error made up other inflection factor response Y Regression model assume that : probability distribution of Y for each level of X : means of these probability distributions vary in some systematic fashion with X (means of distribution fall i li )

Error (ε󰇜 term assumption: • Error follow normal distribution 𝑌  𝛽  𝛽 𝑋  𝜀

𝑌  𝑖  observation of the response sample.  𝑋  𝑖 observation of the predictor variable X in the sample 𝜀  𝑌 error in linear approximation Then,

• 𝜀 𝑎𝑛𝑑 𝜀 are uncorrelated so that the covariance between 𝜀 𝑎𝑛𝑑 𝜀 is zero.

The correlation coefficient describes the strength (or degree) of the linear relationship between two quantitative variables using a single value

Week 2 –simple linear regression

r = -1

r = -0.7

r = -0.4

r=0

r = 0.3

r = 0.8

r=1

perfect

negative

negative

no

positive

positive

perfec

negative

relationship

relationship

relationship

relationship

relationship

negati

relationship

relatio

𝑟

∑ 𝑛𝑖1 󰇛𝑥𝑖  𝑥 󰇜󰇛𝑦 𝑖  𝑦󰇜

 ∑𝑛𝑖1 󰇛𝑥 𝑖  𝑥 󰇜 2 ∑𝑛𝑖1 󰇛𝑦𝑖  𝑦󰇜 2

; 𝑖  1, 2, 3, … , 𝑛.

• The correlation coefficient itself does not indicate how the two variable related or nature of the relationship. • The correlation X and Y ,Y and X same. will not provide information to predict values this regression analysis is important. • Note that there can be powerful curvature relationships between two variables with lower correlation coefficients.

Consider the relationship Y= X^2 for the range of -5≤x ≤5. Observe that there is a poor linear relationship that indicates by a lower correlation coefficient. However,Y= X^2 is a perfect curvature relationship.

Why do e need regression analysis? • Obtain an equation describe response variable as function of predator variables that use forecasting and prediction • Determine a set of important explanatory variable. • Compare relationship between variable over several set of data Usual steps that follow in regression analysis: • Statements of the problem • Selection of potentially relevant variables. • Data collection • Model specification • Choice of fitting method • Model fitting • Model validation and criticism • Using the chosen model for the solution of posed problem

Simple linear regression model we consider previously this method Next step is to estimate the unknown parameter 𝛽 and 𝛽 ,in regression model. This process is called fitting the model to the data. Parameter estimation, The best line called the least squares regression line

Error are called the ordinary least squares residuals, Each one, 𝜀  𝑦  𝑦

important future calculation

Week3- parameter estimation Estimate 𝛽 and 𝛽  Deriving an 𝛽

 Deriving an 𝛽

Standard regression assumption: • Y and X variables are linearly related. • 𝜀 is a normally distributed random variable with mean zero (i.e. E[𝜀 ]=0) and constant variance 𝜎  (i.e. Var[𝜀 ]=𝜎  ) for all i=1,2,3,…,n.

• 𝜀 and 𝜀 are uncorrelated so that the covariance between 𝜀 and 𝜀 is zero 𝑖. 𝑒. 𝐶𝑜𝑣 𝜀 , 𝜀  0 for all 𝑖𝑗; 𝑖  𝑗.

Week3 -

Formal statements of the model, • The value of explanatory variable size fixed repeated sample • 𝛽 and 𝛽 are unknown constant parameter but 𝛽   and 𝛽 are random variable( not constant)

Important features of the model • 𝑦 𝑖  observation sum of two term components constant : 𝛽 +𝛽 𝑥 random error term : 𝜀 • 𝑉𝑎𝑟 𝑌  𝑉𝑎𝑟 𝛽  𝛽 𝑋  𝜀 𝑉𝑎𝑟 𝜀  𝑉𝑎𝑟 𝑌  𝜎 then, 𝑉𝑎𝑟 𝛽  𝛽 𝑋  0

𝜎  is unknown general but we have estimate observed data.

 𝛽𝑋 • 𝑌  𝛽   𝛽, the change estimated mean Y(𝑌) unit increase X  , the estimated mean 𝛽 probability distribution Y then X=0 𝛽 ,intercept of regression line 𝛽 ,slope of the regression line

Week3 – Exercise 1. Plastic hardness data: Experiences with a certain type of plastic indicates that a relationship exists between the hardness (measured in Brinell units) of items modeled from the plastic (𝑌) and the elapse d

or

time (in hours) since termination of the molding process (X). The following data have been collected fo r the analysis.

X

Y

16

199

16

205

16

196

16

200

24

218

24

220

24

215

24

223

32

237

2. Age / Strength data: It is suspected that the strength of some material (measure d

32

234

in some related unit) is related to its age (in years). Collected data are given below.

32

235

32

230 Strength

Age

40

250

2158.7

15.5

40

248

1678.15

23.75

40

253

40

246

a) Draw a scatter plot using any of the package you know and monitor the pattern of data.

b) If a linear pattern is observed in the curve that joints the means of the distributions of 𝑌, estimate the coefficients of the line using the given estimators of 𝛽󰆹 0 and 𝛽󰆹1 in last lesson.

c) Interpret the estimates if they have computed.

a)

2316

8

2061.3

17

2207.5

5.5

1708.3

19

1784.7

24

2575

2.5

2357.9

7.5

2356.7

11

2165.2

13

2399.55

3.75

1679.8

25

2336.75

9.75

1765.3

22

2053.5

18

2414.4

6

2200.5

12.5

2654.2

2

1753.7

21.5

Follow the steps of the above example.

b)

Week4 – model prediction Population regression line 𝑌  𝛽  𝛽 𝑋  𝜀 Random sample population, regression line 𝑌  𝛽  𝛽 𝑋  𝜀 i=1,2,3….n

Properties least square method

1. 𝛽󰆹 and 𝛽 are linear  function of 𝑌 values where I = 1,2,3..n

Observed 𝑦 values given 𝑥

𝑖  observation taken value of sample 𝑦  𝛽  𝛽 𝑥  𝜀 i=1,2,3….n

Randomly observed 𝑌 value given 𝑥

2. Two estimated are unbiased. Eg: 𝐸 𝛽^  𝛽 , 𝐸 𝛽^  𝛽

• Variance of least squares estimator

 are best linear 3.  𝛽 and 𝛽 unbiased estimator of 𝛽 and 𝛽 respectively. (gauss marcov theory) Proof කරන නැත,

Some other properties LS estimation

Estimation of 𝜎 

1. The sum of the residuals is zero. i.e. ∑ 𝑒  ∑ 󰇛𝑦  𝑦 󰇜  0.

Population variance,

∑ 󰇛𝑦 μ󰇜 𝜎  𝑁 

2. Mean of predicted values equals mean of response. i.e. n

n

1 1 y i   yˆi .  n i 1 n i 1

3.

The least squares line always passes through the centroid of data .

Sample variance, 𝑆 

∑ 󰇜  󰇛  

Sample size is n Degree of freedom is (n-1) Because of estimate 𝑦 instend of 𝜇. Estimate sum square ESS = ∑  𝑦  𝑦



= ∑ 𝑒

ESS has (n-1) degree of freedom  and 𝛽  estimated. Because 𝛽 Estimate medium square, R/𝐸𝑀𝑆 



 

  ∑    



 ∑  



Theorem: For linear regression models with satisfied assumptions of errors,

𝐸𝑆𝑆 𝜎 2

follows a chi-square

distribution with 󰇛𝑛  2󰇜 degrees of freedom and independent of 𝛽󰆹 0 and 𝛽󰆹1 . i.e. 𝐸𝑆𝑆 2 ~𝜒𝑛 2 𝜎2

Now note that the mean of the 𝜒2𝑛 2 distribution is 󰇛𝑛  2󰇜. Hence, 𝐸 󰇡𝐸𝑆𝑆2 󰇢  𝑛  2. It leads to 𝐸𝑆𝑆

obtain 𝐸 󰇡

𝑛2

󰇢  𝜎 2 . Therefor

𝜎

Thus, EMS is an unbiased estimator for 𝜎2 .

Analysis of variance(anova)

(A) – deviation of the observations from their overall mean which is known as the total sum of squares (TSS) m (B) – deviation of observations from their fitted values which is known as the residual su of squares (ESS) (C) – deviation of fitted regression values from the mean. If the fitted line has a zero slope n (there is no regression), this quantity is zero. Thus, component (C) is call the regressio sum of squares (RSS).

Week5- assumptions true? Good the fit

Residual plot

Week6 –confident interval Hypothesis testing on the slope 𝛽

Week 7 – multiple regression Multiple regressions, several predictor, independent or explanatory variables are used to model a single response (dependent) variable

Assumpition Same assumptions of the linear regression models apply here. Understand that 𝐸󰇛𝜀𝑖 󰇜  0 and 𝑉󰇛𝜀𝑖 󰇜  𝜎 2 for each 𝑖  1, 2, … , 𝑛. 1. E(ε) = 0

We now consider a model with 𝑘 predictor variables 𝑋1 , 𝑋2 , … ,𝑋𝑘 .

𝑌  𝛽0  𝛽1 𝑋1  𝛽2 𝑋2  𝛽3 𝑋3  ⋯  𝛽 𝑘 𝑋𝑘  𝜀; where 𝜀 ~ 𝑁󰇛0, 𝜎 2󰇜.

Here, 𝐸󰇛𝜀 𝑖 󰇜  0 for each 𝑖  1, 2, … , 𝑛 can be given in a matrix form as above.

Now consider each observation as

𝑌𝑖  𝛽0  𝛽1 𝑋𝑖1  𝛽 2𝑋𝑖2  𝛽3 𝑋𝑖3  ⋯  𝛽𝑘 𝑋𝑖𝑘  𝜀𝑖 for 𝑖  1, 2, … , 𝑛.

Then the model can be given in matrix form as Y = X β + ε.

 2   2 2. V(ε)= σ I n =    0  2

The matrices are described below.

0

     2  

Similarly to above 𝑉󰇛𝜀𝑖󰇜  𝜎 2 for each 𝑖  1, 2, … , 𝑛 can also be given as a

diagonal matrix with 𝜎2 in the diagonal and zeros in off-diagonal elements as covariance between error terms are assumed to be zero in the model.

3. ε ~ N(0, σ2 I n) Overall, the errors follow a multivariate normal distribution with mean vector 0 and

variance-covariance matrix σ2 I n. (Here 𝐼𝑛 is the 󰇛𝑛  𝑛󰇜 identity matrix).

Parameter estimation It can be shown that the least squares estimators are βˆ  (X T X)1 X T Y with V ( βˆ )   2 ( X T X) 1 .

Week 8 –analysis of variance Analysis of Variance (ANOVA) 𝑌  𝛽0  𝛽1 𝑋1  𝛽2 𝑋2  𝛽 3𝑋3  ⋯  𝛽𝑘 𝑋𝑘  𝜀.

k explanatory variables

The ANOVA table in multiple regression is; Source

unknown constant variance 𝜎 of error terms. Therefore, once we substitute EMS as 2

𝑉𝛽󰆹. an estimator for 𝜎 , we get the estimated variance covariance matrix 2

𝑉𝛽󰆹   󰇭

𝑉𝛽󰆹0  ⋮ 𝐶𝑜𝑣𝛽󰆹0 , 𝛽󰆹𝑘 

⋯ ⋱ ⋯

𝐶𝑜𝑣𝛽󰆹0 , 𝛽󰆹𝑘  ⋮ 󰇮 𝑉𝛽󰆹𝑘 

model 𝑌  𝛽0  𝛽1 𝑋1  𝛽2𝑋 2  𝛽3𝑋3  ⋯  𝛽𝑘 𝑋 𝑘

 𝜀; where 𝜀 ~ 𝑁󰇛0, 𝜎 2󰇜.

, 𝐸󰇛𝑌󰇜  𝛽0  𝛽 1 𝑋1  𝛽2 𝑋 2  𝛽3 𝑋3  ⋯  𝛽𝑘 𝑋𝑘 as 𝐸󰇛𝜀󰇜  0

Sum of Squares

Regression

RSS

Residual

ESS

Tota l

TSS

d.f.

k

n-󰇛k1󰇜nk-1 n-1

Mean square

RMSRSS/ k EMSESS/󰇛n-k1󰇜

F-test

𝐹  𝑅𝑀𝑆/𝐸𝑀𝑆 ∗

1. For TSS : (n-1)

Here, one degree of freedom is lost because ∑𝑛𝑖1󰇛𝑦𝑖  𝑦󰇜  0 and number of independent quantities is (n-1). 2. For ESS : (n-(k+1))

Here, k+1 degrees of freedom are lost because (k+1) estimates 𝛽󰆹0 , 𝛽󰆹1 , 𝛽󰆹2 , … , 𝛽󰆹𝑘 have to be estimated to obtain the fitted values 𝑦𝑖 .

3. For RSS: k

TSS = RSS + ESS (n-1) = k + (n-k-1)

that the response space is two dimensional for the case of simple linear regression model with a single explanatory variable and three dimensional for the multiple regression model with two explanatory variables. It is not easy to visualize the response space in multiple regression models with three or more explanatory variables.

The meaning of the parameters, is analogous to all situations. The parameter 𝑘𝛽 indicates the change in the mean response 𝐸󰇛𝑌󰇜 with a unit increase in the independent variable𝑋 𝑘, when all the other independent variables in the regression model are held constant.

𝑌  𝛽󰆹0  𝛽󰆹1 𝑋1  𝛽󰆹2 𝑋2  𝛽󰆹 3𝑋3  ⋯  𝛽󰆹𝑘 𝑋𝑘

That is we tested whether 𝛽1  0 or not. Here, we test whether all the regression coefficients (𝛽𝑗𝑠󰇜 are equal to zero. In addition to that testing a particular subset of regression coefficients equal to zero is also possible in this context

Week 9 – test hypotheses marginal reduction in the error sum of squares when one or several independent variables are added to the regression model

we view the marginal increase in the regression sum of squares when one or several variables are added to the regression model.

We can now use this general F test to test the two types of hypothesis that were mentioned above. Notice that in each case, the reduced model changes

Method 1 1. Testing all regression coefficients equal to zero In this case the hypotheses are as follows. 𝑅𝑀: 𝑌  𝛽0  𝜀

𝐹𝑀: 𝑌  𝛽0  𝛽1𝑋 1  𝛽2 𝑋2  𝛽3 𝑋 3  ⋯  𝛽𝑘 𝑋𝑘  𝜀

 as there Note that the least squares estimate of 𝛽0 in the reduced model should be 𝑌

Testing hypotheses in a multiple regression model Let us consider the regression model with 𝑘 explanatory variables 𝑌  𝛽0  𝛽1 𝑋 1  𝛽2𝑋2  𝛽3 𝑋3  ⋯  𝛽 𝑘𝑋𝑘  𝜀 as the full model (FM) reduced model of FM Suppose we wish to test

are no any predictor variables. Therefore, the error sum of squares from the reduced model is 𝐸𝑆𝑆󰇛𝑅𝑀󰇜  𝑇𝑆𝑆.

The reduced model has 1 parameter and the full model has 󰇛𝑘  1󰇜 parameters.

Therefore, the F-test reduces to

𝐹∗ We can write this as 𝐹∗ 

󰇟𝑇𝑆𝑆  𝐸𝑆𝑆󰇠/𝑘

𝐸𝑆𝑆/󰇛𝑛  𝑘  1󰇜

.

󰇟𝑇𝑆𝑆  𝐸𝑆𝑆󰇠/𝑘 𝑅𝑀𝑆 𝑅𝑆𝑆/𝑘   𝐸𝑆𝑆/󰇛𝑛  𝑘  1󰇜 𝐸𝑆𝑆/󰇛𝑛  𝑘  1󰇜 𝐸𝑀𝑆

as 𝑇𝑆𝑆  𝐸𝑆𝑆  𝑅𝑆𝑆.

RMS is the mean square due to regression and EMS is the mean square due to error. This F-test can be used to test whether all parameters (except the constant) are zero.

H0 : reduced model is adequate H1 : full model is adequate

model 󰇟𝐸𝑆𝑆󰇛𝐹𝑀󰇜󰇠 and the reduced model 󰇟𝐸𝑆𝑆󰇛𝑅𝑀󰇜󰇠

Note that 𝐸𝑆𝑆󰇛𝐹𝑀󰇜  𝐸𝑆𝑆󰇛𝑅𝑀󰇜 because the additional parameters in the full model cannot increase the 𝐸𝑆𝑆. Thus, the difference 𝐸𝑆𝑆󰇛𝑅𝑀󰇜  𝐸𝑆𝑆󰇛𝐹𝑀󰇜 represents the increase in the error sum of squares due to fitting the reduced model

𝐹

󰇟𝐸𝑆𝑆󰇛𝑅𝑀󰇜  𝐸𝑆𝑆󰇛𝐹𝑀󰇜󰇠/󰇛𝑘  1  ℎ󰇜 𝐸𝑆𝑆󰇛𝐹𝑀󰇜/󰇛𝑛  𝑘  1󰇜

󰇟𝐸𝑆𝑆󰇛𝑅𝑀󰇜  𝐸𝑆𝑆󰇛𝐹𝑀󰇜󰇠 has 󰇛𝑛  ℎ󰇜  󰇛𝑛  𝑘  1󰇜  𝑘  1  ℎ degrees o f freedom. Consequently, the above defined 𝐹 statistics follows a F distribution with

󰇛𝑘  1  ℎ󰇜 and 󰇛𝑛  𝑘  1󰇜 degrees of freedom when H 0 is true. If the F value is

large when compared to its table value (at 𝛼 level) the result is significant at 𝛼 level

indicating that the reduced model is unsatisfactory.

The above given null and alternative hypotheses in this context can also be stated as 𝐻0 : 𝛽1  𝛽2  𝛽3  ⋯ 𝛽𝑘  0

𝐻0: At least one of 𝛽𝑗𝑠 is not equal to zero

Inference for individual regression coefficients

Method 2

Using the properties of least squares estimators discussed earlier, one can make statistical inferences regarding regression coefficients. The statistic for testing H0 :  j   0j where  0j is a constant (or even zero) chosen by the investigator is t j 

ˆ j  

o j

s .e.(ˆ j )

which has a student t

󰆹 󰆹 distribution with (n-k-1) degrees of freedom. Note that the standard error 𝛽 𝑗 (s.e.(𝛽𝑗 󰇜󰇜 is the square  󰆹𝑗  . root of the estimated variance of 𝛽󰆹𝑗 . Hence, 𝑠. 𝑒. 𝛽󰆹𝑗   𝑉𝛽

The test is carried out by comparing the observed value with the appropriate critical value 𝑡 𝑛𝑘 1,𝛼2, which is obtained from the t-tables. Here, 𝛼 is the significance level and we divide the

significance because we have a two sided alternative hypothesis. Accordingly, 𝐻0 is rejected at a significance level 𝛼 if 𝑡𝑗   𝑡𝑛𝑘 1,𝛼2, where 𝑡𝑗  is the absolute value of 𝑗𝑡.

For example, suppose 𝑘  5 and 𝛼  0.05. Then, each t-test produces a significant result (rejec t the null hypothesis) subject to a maximum type I error 0.05 or 0.95 confidence. It follows that the

overall confidence of making all the decisions are correct is 󰇛0.95󰇜 5  0.77378. Hence, there is a probability of 1  0.77378  0.22622 of making an incorrect decision. Therefore, the overall significance of the test (probability of rejecting 𝐻 0 when it is true for at least one of coefficients)

is considerably higher than the individual value 𝛼  0.05. It follows that we need to think abou t

performing t-tests for individual coefficients with an overall level of significance is set to a n acceptable value especially when the number of coefficients is large.

2. Testing a subset of regression coefficients equals to zero This has two advantages in a given context. First, it enables us to isolate the most important variables and second it provides us an easy to understand simple description of the process. Simplicity of description or the principle of parsimony is one of the important aspects in regression analysis.

To examine whether the variation of 𝑌 can be explained in terms of a fewer<...


Similar Free PDFs