Econometrics Notes Marked-up PDF

Title	Econometrics Notes Marked-up
Author	Charlie Forbes-Leith
Course	Econometrics
Institution	University of Exeter
Pages	237
File Size	9.1 MB
File Type	PDF
Total Downloads	84
Total Views	725

Preview

CLICK TO PREVIEW PDF

Summary

Description

BEE2031: Econometrics Handout 1: A population approach to regression Climent Quintana-Domeque University of Exeter Last version: 18 September 2020

Abstract This topic has five main sections. Section 1 motivates why we focus on the population (of interest) rather than on a sample from that population. Section 2 reviews the core probability tools to understand this module, from expectation (or population mean) to mean independence and uncorrelatedness. Section 3 discusses bivariate regression from a population perspective, shows the key insight of any regression –regression residuals are uncorrelated with the explanatory variable(s) by construction– and derive the formulae for the population coefficients of any bivariate regression. Section 4 explains how regression can be useful (or not) to recover the parameters of an economic model. Finally Section 5 concludes with some questions for you to test your knowledge on the this topic.

Keywords: population, probability, expectation, mean independence, uncorrelatedness, regression, regression residuals, economic model.

First version: September 6, 2020. This handout is based on Angrist and Pischke (2015), Ashenfelter et al. (2002), notes from Master Joshway, and Stock and Watson (2011, 2020). Thanks to Sebastian Kripfganz and Xiaohui Zhang for helpful comments and suggestions. All errors are my own. Please, email me at [email protected] if you find typos.

1

Population vs. Sample Our focus in BEE2031 Econometrics will be the population. Our interest relies on

describing and understanding relationships at the population level. Some of these questions are purely descriptive: What is the wage gap between men and women in the economy? Some others are causal questions: What is the effect of the minimum wage on infant health? Regardless of whether these questions are descriptive or causal, they are important. Only after establishing what the object of interest in the population is (e.g. the mean difference in wages between men and women in the economy) and whether we can appropriately measure it with data (e.g. required information on men and women to measure the wage gap), one can actually start thinking of using a sample to actually try and learn about the population. Our focus this week is on the population. This means that, even when we conduct an empirical data exercise (e.g. running a regression with a sample), we will assume that we have information on all the units (e.g. individuals, schools, countries, etc.) from the population, and not just one sample. Why? For one thing, pedagogical purposes.

2 2.1

Probability tools Random variables and probability distributions

Randomness. Randomness is a key building block of BEE2031 Econometrics and a very important factor in our lives. The result of your job application (successful or not successful/unsuccessful) involves an element of chance or randomness: It is unknown but eventually revealed.

1

The mutually exclusive potential results of a random process are called the outcomes: your application can be either successful or unsuccessful. Of course, only one of these outcomes will actually occur: either it is successful (and you get the job) or it is not (and you do not). These outcomes need not be equally likely (e.g. the likelihood of getting a job is lower in an economic recession). ?

It is the proportion of the time that the

outcome occurs in the long run. For instance, if the probability of your job application being successful is

job applications would

be successful. What is a random variable?

It is a numerical summary of a random outcome. The

result of your job application is either successful or not successful. Of course, we can assign a number to each of these outcomes, say 1 if successful, 0 if not successful. Random variables can be

(taking on only a discrete set of values) or

(taking on a continuum of possible values). Probability distribution of a discrete random variable. It is the l

e

. These probabilities sum to 1. Following on our example of the result of your job application. Let R be the result of your job application. Then: The probability that R = 1, denoted by P (R = 1), is the probability that your application is successful. The probability that R = 0, denoted by P (R = 0), is the probability that your application is not successful. This is the simplest case of a binary random variable, aka Bernoulli random variable, and 2

its probability distribution is called the Bernoulli distribution. The outcomes of R and their probabilities can be denoted as

R=

8 > > < 1 w.p. P (R = 1) = p

> > : 0 w.p. P (R = 0) = 1  p,

where w.p. signifies with probability. In general, let X be a discrete random variable, that is, a variable that can take on J different values x1 , x2 , ..., x J . A discrete probability distribution provides the probability that X = xj for each of the values xj assumed by the random variable X:

P (X = xj ).

Probability distribution of a continuous random variable. A continuous random variable can take on a continuum of possible values, hence the probability distribution used for discrete variables is not suitable for continuous variables. The probability that a random variable falls between any two points is given by the area under the probability density function (p.d.f., density function or density) between those two points. In general, a p.d.f. for a continuous random variable X is given by f , and the probability that X falls between two values u and v is given by

P (u  X  v) =

Z

v

f (x)dx.

u

The most well-known continuous distribution is the normal distribution. A continuous random variable with a normal distribution has the familiar bell-shaped probability density. We will get back to this important distribution in topic 2: A large sample approach to regression. Figure 1 plots the density of a standard normal distribution.

3

Figure 1: Standard Normal Distribution.

0

.1

.2

.3

.4

Standard Normal Distribution

−4

−3

−2

−1

0

1

2

3

4

Z

Cumulative probability distribution. It is the probability that a random variable is less than or equal to a particular value. It is aka cumulative distribution function, c.d.f, or cumulative distribution. Given our discrete random variable X, a cumulative probability distribution provides the probability that X  xj for each of the values xj assumed by the random variable X: F (xj ) = P (X  xj ). If the random variable is continuous, we write:

F (x) = P (X  x).

Hence, the probability that X falls between two values u and v is given by

P (u  X  v) = P (X  v)  P (X  u) = F (v)  F (u).

4

2.2

Mean and Variance of a random variable

Expected value of a random variable. The expected value of a random variable Y (expectation of Y or mean of Y ), denoted E(Y ) (or µY ), is the long-run average value of the random variable over many repeated trials or occurrences. A nice example is the number you get after rolling a fair die. What is the average number you get after rolling a fair die many, many times? The probability of getting any number from 1 to 6 on a fair die is 1/6. Hence, if Y denotes the number you get after rolling a fair die, the expectation of Y is given by

E(Y ) = P (Y = 1)⇥1+P (Y = 2)⇥2+P (Y = 3)⇥3+P (Y = 4)⇥4+P (Y = 5)⇥5+P (Y = 6)⇥6 = 1 = (1 + 2 + 3 + 4 + 5 + 6) = 3.5 6 Figure 2 displays the average obtained after rolling a fair die from 1 to 1000 times (thin line), and the expectation of Y (solid line), which is 3.5.

1

Average after rolling a fair die ... 2 3 4

5

Figure 2: Average outcome when rolling a fair die n times.

1

201

401 601 Number of times

5

801

1001

What is the meaning of 3.5? The 3.5 is the average you would get if you had rolled the fair die many, many times... an infinite number of times. One random variable and many individuals. So far, when talking about a random variable, we had in mind one random variable for one individual, and we did not use any subscript other than the one to indicate the different values that can be assumed by the discrete random variable. Going back to our example of the result of the job application:

E(R) = P (R = 0) ⇥ 0 + P (R = 1) ⇥ 1 = (1  p) ⇥ 0 + p ⇥ 1 = p.

Let’s say that p = 0.75. This means that the expectation of R –the probability of your job application being successful– is 75%. In other words, if you could repeat the application 1000 times, you would get the job on approximately 750 occasions. Or, more accurately, if you could repeat the application infinite many times, 75% of them would be successful. We now think of a population of N individuals, i = 1, ..., N . Each individual job application is a random variable. Thus, let Ri be the random variable “result of the job application” for individual i in the population. If every and each individual has the same probability p of succeeding in their application, the expectation of Ri is:

E(Ri ) = P (Ri = 0) ⇥ 0 + P (Ri = 1) ⇥ 1 = (1  p) ⇥ 0 + p ⇥ 1 = p.

If p = 0.75, then 75% of the applications in the population will be successful –will get the job– in expectation, but not exactly. We can see then why the population average is a representative value of the distribution of Ri in the population. In general, the mathematical expectation, expected value or mean of a random variable Yi is the population average of Yi . If Yi is a discrete random variable that can take on L different values yi1 , yi2 , ..., y iL and the probability that Yi takes on yi1 is P (Yi = yi1 ), the probability that Yi takes on yi2 is P (Yi = yi2 ), and so on and so forth, then the expected

6

value of Yi is E(Yi ) =

L X

P (Yi = yil )yil ,

l=1

where E(Yi ) is also denoted µY . e. In addition to representative values of the population, we are also interested in variability/dispersion. The variance measures the dispersion of a probability distribution, which measures the average square deviations from the mean. The population variance of Yi is given by V (Yi ) = E [(Yi  E [Yi ])2 ] = E[(Yi  µY )2 ].

The population variance of Yi is a measure of dispersion, variability or inequality of Yi around its population mean. The population standard deviation is given by the square root of the population variance: SD(Yi ) =

p

V (Yi ).

The variance of a Bernoulli random variable is very easy to compute. Why? If only because a Bernoulli random variable can take only on two values: 0 or 1. Consider the case of our random variable Ri , its variance is given by V (Ri ) = (1 p) ⇥ (0  p)2 +p⇥(1 p)2 = (1 p)p2 +p(1  p)2 = (1 p)(p2 + p(1p)) = p(1  p),

where we use the fact that the E[Ri ] = p. Its standard deviation is then

SD(Ri ) =

p

p(1  p).

In general, if Yi is a discrete random variable that can take on L different values yi1 , yi2 , ..., y iL and the probability that Yi takes on yi1 is P (Yi = yi1 ), the probability that Yi

7

takes on yi2 is P (Yi = yi2 ), and so on and so forth, then the variance of Yi is

V (Yi ) =

L X

P (Yi = yil )(yil  µY )2 ,

l=1

where V (Yi ) is also denoted σY2 , and its square root, the SD(Yi ) is also denoted σY .

2.3

Probabilities, expectations and correlation: a world with two variables

We now consider a world with two random variables Xi and Yi . Our interest is in two random variables and their relationship in a population. Remember the two example questions in the first section? They involved at least two variables each, namely sex and wages, and minimum wages and infant health, respectively. Working with more than one variable requires additional concepts and tools. J

. The joint probability distribution of two discrete ran-

dom variables, say Xi and Yi , is the probability that the random variables simultaneously take on certain values, say xij and yil . The probabilities of all possible (xij , yil ) combinations sum to 1. The joint probability distribution can be written as the function

P (Xi = xij , Yi = yil ).

M

The adjective marginal is used to distinguish be-

tween probability distributions when there is more than one variable. If Xi can take on J different values xi1 , xi2 , ..., x iJ , then the marginal probability that Yi takes on the value yil is P (Yi = yil ) =

J X

P r(Xi = xij , Yi = yil ).

j=1

8

n. The conditional probability distribution of Yi given Xi is the distribution of a random variable Yi conditional on another random variable Xi taking on a specific value. The conditional probability that Yi = yil when Xi = xij is written P (Yi = yil |Xi = xij ). The conditional probability of Yi = yil when Xi = xij is given by

P (Yi = yil |Xi = xij ) =

P (Xi = xij , Yi = yil ) . P (Xi = xij )

. The conditional expectation of Yi given Xi , also called the conditional mean of Yi given Xi , is the mean of the conditional distribution of Yi given Xi . If Yi takes on L values yi1 , yi2 , ..., y iL , then the conditional mean of Yi given Xi = xij is

E(Yi |Xi = xij ) =

L X

P (Yi = yil |Xi = xij )yil .

l=1

The law of iterated expectations. The mean of Yi is the weighted average of the conditional expectation of Yi given Xi , weighted by the probability distribution of Xi . For example, the mean hourly wage in the population is the weighted average of the hourly wage of men and the hourly wage of women, weighted by the proportions of men and women. If Xi takes on J different values xi1 , xi2 , ..., x iJ , then

E(Yi ) =

J X

P (Xi = xij )E(Yi |Xi = xij ).

j=1

The expectation of Yi is the expectation of the conditional expectation of Yi given Xi ,

E (Yi ) = E [E (Yi |Xi )],

9

where the inner expectation on the RHS is computed using the conditional distribution of Yi given Xi and the outer expectation is computed using the marginal distribution of Xi . An interesting implication of this law is that if the conditional expectation is zero, the unconditional must be zero too:

E (Yi |Xi ) = 0 ) E (Yi ) = 0.

An example illustrates clearly this point. If both the unemployment rate of men is zero and the unemployment rate of women is zero, the unemployment rate in the population is zero, since a weighted average of zeros must be zero! Independence, mean independence and uncorrelatedness. Independence. Two discrete random variables Xi and Yi are independently distributed or independent if and only if

P (Xi = xij , Yi = yil ) = P (Xi = xij )P (Yi = yil ) 8xij , yil .

We can also write that two discrete random variables Xi and Yi are independently distributed or independent if and only if

P (Xi = xij |Yi = yil ) = P (Xi = xij ) 8xij , yil .

Or P (Yi = yil |Xi = xij ) = P (Yi = yil ) 8xij , yil . Independence is a symmetric property: If Xi (resp. Yi ) is independent of Yi (resp. Xi ), Yi (resp. Xi ) must be independent of Xi (resp. Yi ). Mean independence. Yi is mean independent of Xi , if the mean of Yi does not

10

depend on Xi : E(Yi |Xi ) = E(Yi ). Similarly, Xi is mean independent of Yi , if the mean of Xi does not depend on Yi :

E (Xi |Yi ) = E (Xi ).

Note that mean independence is not a symmetric property: Yi (resp. Xi ) can be mean independent of Xi (resp. Yi ), while Xi (resp. Yi ) is not mean independent of Yi (resp. Xi ). Uncorrelatedness. Yi and Xi are uncorrelated (i.e. they are not linearly related) when their correlation coefficient is zero. The correlation coefficient of two random variables is given by Cov(Xi , Yi ) , Corr(Xi , Yi ) = p V (Xi )V (Si )

where Cov(Xi , Yi ) is the covariance between Xi and Yi , which is a measure of the extent to which Xi and Yi move together (in a linear sense) and is given by

Cov(Xi , Yi ) = E[(Xi  E(Xi ))(Yi  E(Yi ))] = E[(Xi  µX )(Yi  µY )],

where Cov(Xi , Y i ) is also denoted by σXY . Note that the correlation coefficient varies between -1 (perfect linear negative relationship) and 1 (perfect linear positive relationship). If both Xi and Yi are discrete random variables, then the covariance between them is given by

Cov(Xi , Yi ) =

L J X X

P (Xi = xij , Yi = yil )(xij  µX )(yil  µY ).

j=1 l=1

Note that Cov(Xi , Yi ) = Cov(Yi , Xi ), so that uncorrelatedness is a symmetric prop11

erty: If Xi (resp. Yi ) is (un)correlated with Yi (resp. Xi ), Yi (resp. Xi ) must be (un)correlated with Xi (resp. Yi ). Things to note: 1. Independence ! Mean independence ! Uncorrelatedness 2. Uncorrelatedness does not imply mean independence. 3. Mean independence does not imply independence. Basic properties of the expectation operator. Let Xi , Yi and Zi be three random variables, and a and b two constants. 1. Expectation of the sum of a random variable and a constant:

E [Xi ± a] = E [Xi ] ± a

2. Expectation of the sum of two random variables:

E [Xi ± Yi ] = E [Xi ] ± E [Yi ]

3. Expectation of a constant times a random variable:

E[aXi ] = aE[Xi ]

Basic properties of the variance and covariance operators 1. Variance of the sum of a random variable and a constant:

V [Xi ± a] = V [Xi ]

12

2. Variance of a constant times a random variable: V [aXi ] = a2 V [Xi ]

3. Variance of the sum of two random variables:

V [Xi ± Yi ] = V [Xi ] + V [Yi ] ± 2Cov(Xi , Yi )

4. Covariance of a random variable with a constant:

Cov(Xi , a) = 0

5. Covariance of a random variable with itself:

Cov(Xi , Xi ) = V (Xi )

6. Covariance of the sum of two random variables with a random variable:

Cov(Xi ± Yi , Zi ) = Cov(Xi , Zi ) ± Cov(Yi , Zi )

Mean independence and uncorrelatedness are key in BEE2031 Econometrics. Here we want to show that mean independence implies uncorrelatedness. Let Yi be mean independent of Xi : E(Yi |Xi ) = µY . Note that the covariance between Yi and Xi can always be written as

Cov(Yi , Xi ) = E(Yi Xi )  E(Yi )E(Xi ) = E(Yi Xi )  µY µX .

13

Because of the law of iterated expectations: Cov(Yi , Xi ) = E (E [Yi Xi |Xi ])  µY µX = E(Xi E[Yi |Xi ])  µY µX = = E(Xi µY )  µY µX = µY E (Xi )  µY µX = µY µX  µY µX = 0. Thus Xi and Yi are uncorrelated:

Corr(Yi , Xi ) = p

3

0 V (Yi )V (Xi )

= 0.

Bivariate Population Regression Having introduced all the basic probability tools to work in a world with two random

variables, we now introduce bivariate regression from a population perspective.

3.1

Definition

The population bivariate linear regression is a tool which allows us to decompose the random variable Yi into a linear function of another random variable Xi and a (regression) residual ei : Yi = a + bXi + ei , where a + bXi is the popu...