Slides 2 Simple Regression PDF

Title Slides 2 Simple Regression
Author Samal Omarova
Course Econometrics I
Institution Назарбаев Университеті
Pages 95
File Size 1.1 MB
File Type PDF
Total Downloads 91
Total Views 157

Summary

Lecture 2, Simple Regression lecture...


Description

THE SIMPLE REGRESSION MODEL: DEFINITION AND ESTIMATION Introductory Econometrics: A Modern Approach, 5e South-Western, Cengage Learning Jeffrey M. Wooldridge 1. Definition of the Simple Regression Model 2. Deriving the Ordinary Least Squares Estimates 3. Properties of OLS on any Sample of Data 4. Units of Measurement and Functional Form

1

1. Definition of the Simple Regression Model

∙ We begin with cross-sectional analysis and will (eventually) assume we can collect a random sample from the population of interest.

∙ We begin with the following premise, once we have a population in mind. There are two variables, x and y, and we would like to “study how y varies with changes in x.”

∙ We have seen examples: x is amount of fertilizer, and y is soybean yield; x is years of schooling, y is hourly wage.

2

∙ We must confront three issues: 1. How do we allow factors other than x to affect y? There is never an exact relationship between two variables (in interesting cases). 2. What is the functional relationship between y and x? 3. How can we be sure we a capturing a ceteris paribus relationship between y and x (as is so often the goal)?

3

∙ Consider the following equation relating y to x: y  * 0  * 1 x  u, which is assumed to hold in the population of interest.

∙ This equation defines the simple linear regression model (or two-variable regression model).

∙ The term “regression” has historical roots in the “regression-to-the-mean” phenomenon.

4

∙ y and x are not treated symmetrically. We want to explain y in terms of x. From a causality standpoint, it makes no sense to “explain” past educational attainment in terms of future labor earnings.

∙ As another example, we want to explain student performance y in terms of class size x, not the other way around.

5

y x Dependent Variable Independent Variable Explained Variable Explanatory Variable Resonse Variable Control Variable Predicted Variable Predictor Variable Regressand Regressor

∙ The terms “explained” and “explanatory” are probably best, as they are the most descriptive and widely applicable. But “dependent” and “independent” are used often. (“Independent” here should not be confused with the notion of statistical independence.)

6

∙ We mentioned the error term or disturbance, u, before. The equation y  *0  *1x  u explicitly allows for other factors, contained in u, to affect y.

∙ This equation also addresses the functional form issue (in a simple way). Namely, y is assumed to be linearly related to x. We call * 0 the intercept parameter and * 1 the slope parameter. These describe a population, and our ultimate goal is to estimate them.

7

∙ The equation also addresses the ceteris paribus issue. In y  * 0  * 1 x  u, all other factors that affect y are in u. We want to know how y changes when x changes, holding u fixed.

∙ Let Δ denote “change.” Then holding u fixed means Δu  0. So Δy  * 1 Δx  Δu  * 1 Δx when Δu  0.

∙ This equation effectively defines * 1 as a slope, with the only difference being the restriction Δu  0.

8

EXAMPLE: Yield and Fertilizer

∙ A model to explain crop yield to fertilizer use is yield  * 0  * 1 fertilizer  u, where u contains land quality, rainfall on a plot of land, and so on. The slope parameter, * 1 , is of primary interest: it tells us how yield changes when the amount of fertilizer changes, holding all else fixed.

∙ Note: The linear function is probably not realistic here. The effect of fertilizer is likely to diminish at large amounts of fertilizer.

9

EXAMPLE: Wage and Education wage  * 0  * 1 educ  u where u contains somewhat nebulous factors (“ability”) but also past workforce experience and tenure on the current job. Δwage  * 1 Δeduc when Δu  0

∙ Is each year of education really worth the same dollar amount no matter how much education one starts with?

10

∙ We said we must confront three issues: 1. How do we allow factors other than x to affect y? 2. What is the functional relationship between y and x? 3. How can we be sure we a capturing a ceteris paribus relationship between y and x?

∙ We have argued that the simple regression model y  *0  *1x  u addresses each of them.

11

∙ This seems too easy! How can we hope to generally estimate the ceteris paribus effect of y on x when we have assumed all other factors affecting y are unobserved and lumped into u?

∙ The key is that the simple linear regression (SLR) model is a population model. When it comes to estimating * 1 (and * 0 ) using a random sample of data, we must restrict how u and x are related to each other.

12

∙ But x and u are properly viewed as having distributions in the population. For example, if x  educ then, in principle, we could figure out its distribution in the population of adults over, say, 30 years old. Suppose u is cognitive ability. Assuming we can measure what that means, it also has a distribution in the population.

∙ What we must do is restrict the way in when u and x relate to each other in the population.

13

∙ First, we make a simplifying assumption that is without loss of generality: the average, or expected, value of u is zero in the population: Eu  0 where E is the expected value (or averaging) operator.

∙ Normalizing “land quality,” or “ability,” to be zero in the population should be harmless. It is.

14

∙ The presence of * 0 in y  *0  *1x  u allows us to assume Eu  0. If the average of u is different from zero, we just adjust the intercept, leaving the slope the same. If ) 0  Eu then we can write y  * 0  ) 0   * 1 x  u − ) 0 , where the new error, u − ) 0 , has a zero mean.

∙ The new intercept is * 0  ) 0 . The important point is that the slope, * 1 , has not changed.

15

KEY QUESTION: How do we need to restrict the dependence between u and x?

∙ We could assume u and x uncorrelated in the population: Corrx, u  0

∙ Zero correlation actually works for many purposes, but it implies only that u and x are not linearly related. Ruling out only linear dependence can cause problems with interpretation and makes statistical analysis more difficult.

16

∙ An assumption that meshes well with our introductory treatment involves the mean of the error term for each slice of the population determined by values of x: Eu|x  Eu, all values x, where Eu|x means “the expected value of u given x.”

∙ We say u is mean independent of x.

17

∙ Suppose u is “ability” and x is years of education. We need, for example, Eability|x  8  Eability|x  12  Eability|x  16 so that the average ability is the same in the different portions of the population with an 8 th grade education, a 12 th grade education, and a four-year college eduction.

∙ Because people choose education levels partly based on ability, this assumption is almost certainly false.

18

∙ Suppose u is “land quality” and x is fertilizer amount. Then Eu|x  Eu if fertilizer amounts are chosen independently of quality. This assumption is reasonably but assumes fertilizer amounts are assigned at random.

∙ Combining Eu|x  Eu (the substantive assumption) with Eu  0 (a normalization) gives Eu|x  0, all values x

∙ Called the zero conditional mean assumption.

19

∙ Because the expected value is a linear operator, Eu|x  0 implies Ey|x  * 0  * 1 x  Eu|x  * 0  * 1 x, which shows the population regression function is a linear function of x.

∙ A different approach to simple regression ignores the causality issue and just starts with a linear model for Ey|x as a descriptive device.

20

y

x 1

x2

21

x3

x

∙ The straight line in the previous graph is the PRF, Ey|x  * 0  * 1 x. The conditional distribution of y at three different values of x are superimposed.

∙ For a given value of x, we see a range of y values: remember, y  * 0  * 1 x  u, and u has a distribution in the population.

22

EXAMPLE: College versus High School GPA.

∙ Suppose for the population of students attending a university, we (somehow) know EcolGPA|hsGPA  1. 5  0. 5 hsGPA, so y is colGPA, x is hsGPA. (Of course, it is unrealistic to assume we know * 0  1. 5 and * 1  0. 5.)

∙ If hsGPA  3. 6 then the average of colGPA among students with this particular high school GPA is 1. 5  0. 53. 6  3. 3

23

∙ Remember, anyone with hsGPA  3. 6 most likely will not have colGPA  3. 3. The value 3.3 is the average value of colGPA within the slice of the population with hsGPA  3. 6.

∙ Regression analysis is essentially about explaining effects of explanatory variables on average outcomes of y.

24

2. Deriving the Ordinary Least Squares Estimates

∙ Given data on x and y, how can we estimate the population parameters, * 0 and * 1 ?

∙ Let £x i , y i  : i  1, 2, . . . , n¤ be a sample of size n (the number of observations) from the population. Think of this as a random sample.

∙ The next graph shows n  15 families and the population regression of saving on income.

25

sav ing

0 inc ome

0

26

∙ Plug any observation into the population equation: yi  *0  *1xi  ui where the i subscript indicates a particular observation.

∙ We observe y i and x i , but not u i . (However, we know u i is there.)

27

∙ We use the two restrictions Eu  0 Covx, u  0 to obtain estimating equations for * 0 and * 1 .

∙ Remember, the first condition essentially defines the intercept. ∙ The second condition, stated in terms of the covariance, means that x and u are uncorrelated.

∙ Both conditions are implied by the zero conditional mean assumption Eu|x  0

28

∙ With Eu  0, Covx, u  0 is the same as Exu  0 because Covx, u  Exu − ExEu.

∙ Next we plug in for u into the two equations: Ey − * 0 − * 1 x  0 E¡xy − * 0 − * 1 x¢  0

∙ These are the two conditions in the population that determine * 0 and * 1 . So we use their sample analogs, which is a method of moments approach to estimation.

29

∙ In other words, we use n

n −1 ∑y i − * 0 − * 1 x i   0 i1 n

n −1 ∑ x i y i − * 0 − * 1 x i   0 i1

to determine * 0 and * 1 , the estimates from the data.

∙ These are two linear equations in the two unknowns * 0 and * 1 .

30

∙ To solve the equations, pass the summation operator through the first equation: n

n

n

n

i1

i1 n

i1

i1

n −1 ∑y i − * 0 − * 1 x i   n −1 ∑ y i − n −1 ∑ * 0 − n −1 ∑ * 1 x i n

 n −1 ∑ y i − * 0 − * 1 n −1 ∑ x i i1

 0 − * 1 x  y − *

31

i1

n y i for the average of the n ∙ We use the standard notation y  n −1 ∑ i1

numbers £y i : i  1, 2, . . . , n¤. For emphasis, we call y a sample average.

∙ We have shown that the first equation, n

n −1

∑y i − * 0 − * 1 x i   0 i1

implies  1 x y  * 0  *

32

∙ Rewrite this equation so that the intercept is terms of the slope (and the sample averages on y and x): * 0  y − * 1 x and plug this into the second equation (and drop the division by n): n

∑ x i y i − * 0 − * 1 x i   0 i1

so n

∑ x i ¡y i − y − * 1 x  − * 1 x i ¢  0 i1

33

∙ Simple algebra gives n

n

i1

i1

∑ x i y i − y   * 1 ∑ x i x i − x   1. and so we have one linear equation in the one unknown *

34

∙ Showing the solution for * 1 uses three useful facts about the summation operator: n

∑x i − x   0 i1 n

n

n

i1 n

i1 n

i1

i1

i1

∑ x i y i − y   ∑x i − x y i − y   ∑x i − x y i ∑ x i x i − x   ∑x i − x  2

35

∙ So, we can write the equation to solve is n

n

i1

i1

∑x i − x y i − y   * 1 ∑x i − x  2 ∙ If ∑ ni1 x i − x  2

 0, we can write

n ∑ x − x y i − y  Sample Covariancex i , y i  i1 i * 1   n x i − x  2 ∑ i1 Sample Variancex i 

36

∙ The previous formula for * 1 is important. It shows us how to take the data we have and compute the slope estimate. For reasons we will see, * 1 is called the ordinary least squares (OLS) slope estimate. We often refer to it as the slope estimate.

∙ It can be computed whenever the sample variance of the x i is not zero, which only rules out the case where each x i is the same value.

∙ The following graph shows we have no way to determine the slope in a relationship between wage and educ if we observe a sample where everyone has 12 years of schooling.

37

wage

0

12

38

educ

∙ Situations like those in the previous graph are very rare. Except with very small sample sizes we will be able to compute a slope estimate. ∙ Once we have * 1 , we compute * 0  y − * 1 x . This is the OLS intercept estimate.

∙ These days, one lets a computer do the calculations, which can be tedious even if n is small.

39

∙ Where does the name “ordinary least squares” come from? ∙ For any candidates * 0 and * 1 , define a fitted value for each data point i as  1xi ŷ i  * 0  * We have n of these. It is the value we predict for y i given that x has taken on the value x i .

∙ The mistake we make is the residual: û i  y i − ŷ i  y i − * 0 − * 1 x i , and we have n residuals.

40

∙ Suppose we measure the size of the mistake, for each i, by squaring the residual: û i2 . Then we add them all up: n

n

i1

i1

∑ û i2  ∑y i − * 0 − * 1 x i  2 ∙ This quantity is called the sum of squared residuals. ∙ If we choose * 0 and * 1 to minimize the sum of squared residuals it can be shown (using calculus or other arguments) that the solutions are the slope and intercept estimates we obtained before.

41

∙ Once we have the numbers * 0 and * 1 for a given data set, we write the OLS regression line as a function of x:  1x ŷ  * 0  *

∙ The OLS regression line allows us to predict y for any (sensible) value of x. It is also called the sample regression function. ∙ The intercept, * 0 , is the predicted y when x  0. (The prediction is usually meaningless if x  0 is not possible.)

42

∙ The slope, * 1 , allows us to predict changes in y for any (reasonable) change in x: Δŷ  * 1 Δx

∙ If Δx  1, so that x increases by one unit, then Δŷ  * 1 .

43

EXAMPLE: Effects of Education on Hourly Wage (WAGE2.DTA)

∙ Data are from 1991 on men only. wage is reported in dollars per hour, educ is highest grade completed.

∙ The estimated equation is wage  −5. 12  1. 43 educ n  759

∙ Below we discuss the negative intercept. Literally, it says that wage is predicted to be −$5. 12 when educ  0!

∙ Each additional year of schooling is estimated to be worth $1.43.

44

∙ In the Stata output, * 0

 −5. 12 is the Coef. labeled “_cons.”

* 1  1. 43 is the Coef. labeled “educ.”

∙ We will learn about the other numbers as we go. ∙ General form of the Stata command: reg y x The order of y and x is critical!

45

. use wage2.dta . des Contains data from wage2.dta obs: 759 vars: 13 8 Sep 2006 13:47 size: 23,529 (99.9% of memory free) ------------------------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------wage float %9.0g hourly wage, 1991 IQ float %9.0g IQ score educ byte %9.0g highest grade completed by 1991 ne byte %9.0g 1 if in northeast, 1991 nc byte %9.0g 1 if in north central, 1991 west byte %9.0g 1 if in west, 1991 south byte %9.0g 1 if in south, 1991 exper byte %9.0g potential experience motheduc byte %9.0g highest grade, mother fatheduc byte %9.0g highest grade, father urban byte %9.0g 1 if in urban area, 1991 lwage float %9.0g log(wage) expersq int %9.0g exper^2 ------------------------------------------------------------------------------------------------Sorted by:

46

. sum wage educ IQ Variable | Obs Mean Std. Dev. Min Max --------------------------------------------------------------------wage | 759 13.84755 9.485582 1.02 81.91 educ | 759 13.22925 2.411524 8 20 IQ | 759 100.0646 14.57751 54 131 . tab educ highest | grade | completed | by 1991 | Freq. Percent Cum. ----------------------------------------------8 | 17 2.24 2.24 9 | 21 2.77 5.01 10 | 27 3.56 8.56 11 | 37 4.87 13.44 12 | 325 42.82 56.26 13 | 56 7.38 63.64 14 | 43 5.67 69.30 15 | 40 5.27 74.57 16 | 129 17.00 91.57 17 | 32 4.22 95.78 18 | 17 2.24 98.02 19 | 5 0.66 98.68 20 | 10 1.32 100.00 ----------------------------------------------Total | 759 100.00

47

. reg wage educ Source | SS df MS ------------------------------------------Model | 9065.37819 1 9065.37819 Residual | 59136.6266 757 78.1197181 ------------------------------------------Total | 68202.0048 758 89.9762597

Number of obs F( 1, 757) Prob  F R-squared Adj R-squared Root MSE

     

759 116.04 0.0000 0.1329 0.1318 8.8385

-----------------------------------------------------------------------------wage | Coef. Std. Err. t P|t| [95% Conf. Interval] ----------------------------------------------------------------------------educ | 1.434058 .1331233 10.77 0.000 1.172723 1.695393 _cons | -5.123961 1.790104 -2.86 0.004 -8.638119 -1.609803 ------------------------------------------------------------------------------

48

∙ Reminder: When we write the population regression line, wage  * 0  * 1 educ  u,  1  1. 43 are our we do not know * 0 and * 1 . Rather, * 0  −5. 12 and * estimates from this particular sample of 759 men. These estimates may or may not be close to the population values. If we obtain another sample of 759 men the estimates would almost certainly change.

49

∙ The function wage  −5. 12  1. 43 educ is the OLS (or sample) regression line.

∙ Plugging in educ  0 gives the silly prediction wage  −5. 12. Extrapolating outside the range of the data can produce strange predictions. There are no men in the sample with educ  8.

50

∙ When educ  8, wage  −5. 12  1. 438  6. 32

∙ The predicted hourly wage at eight years of education is $6.32, which we can think of as our estimate of the average wage in the population when educ  8. But no one in the sample earns exactl...


Similar Free PDFs