Solutions to Stata Exercise 5 IV PDF

Title	Solutions to Stata Exercise 5 IV
Course	Econometerics
Institution	University of Reading
Pages	23
File Size	351.2 KB
File Type	PDF
Total Downloads	54
Total Views	128

Preview

CLICK TO PREVIEW PDF

Summary

Solutions with elaborate explanations....

Description

Solutions to Stata Exercise 5: Instrumental Variables Question 1 a. Why might you expect packs to be correlated with u? Smoking behaviour during pregnancy may be correlated with other unobserved health related behaviours that may also affect infant birth weight. For example, women who smoke may, on average, engage in other activities such as drinking coffee or alcohol or eating less nutritious meals, which could also impact on infant birth weight. Therefore packs may be picking up the effect of other omitted variables (which will enter the error term) related to birth weight, and hence be biased i.e. packs is correlated with the error term and therefore endogenous. b. Suppose that you have data on average cigarette price in each woman’s state of residence. Discuss whether this information is likely to satisfy the properties of a good instrumental variable for packs. The properties of a good instrument are: a good instrument should be strongly related to the instrumented variable and uncorrelated with the error term in the structural equation, in this case equation 1. Basic economics says the number of packs should be negatively correlated with cigarette price, since basic supply and demand theory suggests a higher price of a product will lead to lower demand. However we may expect the correlation between price and the number of packs of cigarettes smoked to be quite small, for several reasons. Firstly smoking is addictive so the demand for cigarettes may be relatively price inelastic and secondly the price in this example is aggregated at the state level. At first glance it seems that cigarette price should be exogenous in equation 1 but further consideration may suggest otherwise. One component of cigarette price is the state tax on cigarettes and states that have lower taxes on cigarettes may also have lower quality of health care, on average. Quality of health care will be captured in u, since it is unobserved. Therefore cigarette price may fail the exogeneity requirement for an IV if it is related to other factors, such as the quality of health care that will enter the error term. c. Use the data in BWIGHT.dta to estimate equation 1. First, use OLS. Then use 2SLS, where cigprice is an instrument for packs. Discuss any important differences in the OLS and 2SLS estimates. (the Stata command for IV regression is ivregress) As always start by describing and summarising your data; at least the variables of interest. I will leave you to check you understand the variables of interest. . des

lbwght male parity lfaminc packs

storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------------------------------------------------lbwght float %9.0g log of bwght male byte %8.0g =1 if male child parity byte %8.0g birth order of child lfaminc float %9.0g log(faminc)

1

packs float %9.0g . su lbwght male parity lfaminc packs

packs smked per day while preg

Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------lbwght | 1388 4.760031 .1906622 3.135494 5.602119 male | 1388 .5208934 .4997433 0 1 parity | 1388 1.632565 .8940273 1 6 lfaminc | 1388 3.071271 .9180645 -.6931472 4.174387 packs | 1388 .1043588 .2986344 0 2.5

Given it is the variable we want to instrument, it is worth tabulating the number of packs. We can see that 85% of the sample did not smoke any cigarettes during their pregnancy. . tab packs packs smked | per day | while preg | Freq. Percent Cum. ------------+----------------------------------0 | 1,176 84.73 84.73 .05 | 3 0.22 84.94 .1 | 4 0.29 85.23 .15 | 7 0.50 85.73 .2 | 9 0.65 86.38 .25 | 19 1.37 87.75 .3 | 6 0.43 88.18 .35 | 4 0.29 88.47 .4 | 5 0.36 88.83 .45 | 1 0.07 88.90 .5 | 55 3.96 92.87 .6 | 5 0.36 93.23 .75 | 19 1.37 94.60 1 | 62 4.47 99.06 1.5 | 5 0.36 99.42 2 | 6 0.43 99.86 2.3 | 1 0.07 99.93 2.5 | 1 0.07 100.00 ------------+----------------------------------Total | 1,388 100.00

We start by running an OLS regression with robust standard errors, to correct for heteroskedasticity, given we are using cross-sectional data.

2

. reg lbwght male parity lfaminc packs, robust Linear regression

Number of obs F( 4, 1383) Prob > F R-squared Root MSE

= = = = =

1388 14.69 0.0000 0.0350 .18756

-----------------------------------------------------------------------------| Robust lbwght | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------male | .0262407 .0100102 2.62 0.009 .006604 .0458775 parity | .0147292 .0054163 2.72 0.007 .0041041 .0253543 lfaminc | .0180498 .0053079 3.40 0.001 .0076373 .0284623 packs | -.0837281 .0174464 -4.80 0.000 -.1179523 -.0495039 _cons | 4.675618 .0204597 228.53 0.000 4.635482 4.715753 ------------------------------------------------------------------------------

Since we have one endogenous variable (packs) and one instrument (cigprice) the model is just-identified (i.e. the number of instruments is equal to the number of variables – L=K). The syntax for Stata IV regress command ivregress requires you, firstly, to specify the estimator, with a choice of 2sls, gmm and liml. We use 2sls for this example since the 2LSL and IV regression estimates are identical in the case of 1 instrument. The instrumented variable(s) should be placed in brackets and set equal to its instrument(s) – note if you have more than one endogenous variable then you list all your endogenous variables before the equals sign and then list all of the additional instruments after the equals sign. Remember the exogenous variables serve as their own instruments so do not have to be included in the brackets, only the additional instruments. I have asked Stata to report the first stage regression so we can see the first stage prediction model for the number of cigarette packs smoked per day. . ivregress 2sls lbwght male parity lfaminc (packs=cigprice), robust first First-stage regressions ----------------------Number of obs F( 4, 1383) Prob > F R-squared Adj R-squared Root MSE

= = = = = =

1388 7.02 0.0000 0.0305 0.0276 0.2945

-----------------------------------------------------------------------------| Robust packs | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------male | -.0047261 .0157547 -0.30 0.764 -.0356319 .0261796 parity | .0181491 .01151 1.58 0.115 -.0044298 .0407281 lfaminc | -.0526374 .0111516 -4.72 0.000 -.0745131 -.0307616 cigprice | .000777 .0008219 0.95 0.345 -.0008354 .0023894 _cons | .1374075 .1021316 1.35 0.179 -.0629421 .3377572 ------------------------------------------------------------------------------

Instrumental variables (2SLS) regression

Number of obs =

1388

3

Wald chi2(4) Prob > chi2 R-squared Root MSE

= = = =

10.02 0.0401 . .31959

-----------------------------------------------------------------------------| Robust lbwght | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------packs | .7971063 1.111214 0.72 0.473 -1.380833 2.975045 male | .0298205 .0171898 1.73 0.083 -.0038709 .063512 parity | -.0012391 .0253297 -0.05 0.961 -.0508844 .0484062 lfaminc | .063646 .0569698 1.12 0.264 -.0480128 .1753048 _cons | 4.467861 .255852 17.46 0.000 3.966401 4.969322 -----------------------------------------------------------------------------Instrumented: packs Instruments: male parity lfaminc cigprice

At the bottom of the output Stata lists the instrumented variables and the instruments – which include all the exogenous variables (which serve as their own instruments) and the additional instrument cigprice for packs. If we now look at the first stage output we can see that cigprice is not significant and hence cigprice is not strongly related to packs, so fails at least one of the requirements for a valid instrument and we will return to this in part d. The second set of results is the reduced form equation for birth weight. The difference between OLS and IV in the estimated effect of packs on bwght is huge. Firstly, note the difference in the width of the confidence interval for packs in the IV regression of -1.38 and 2.98 and 1.38, compared to the interval range of -0.12 and -0.5 for the OLS. The standard error is 1.11 in the IV regression which is much higher than that of 0.017 in the OLS. The OLS estimate is inconsistent in the presences of endogeneity so greater efficiency is of no value, and we know that IV estimates are always less efficient, but the imprecision of these IV estimates are a cause for concern! For the OLS estimate, one more pack of cigarettes is estimated to reduce bwght by about 8.4%, and is statistically significant. The IV estimate has the opposite sign, is huge in magnitude (79.7%!), and is not statistically significant. The sign and size of the smoking effect are not realistic. We would not expect smoking to have a positive effect on the birth weight of the child, let alone an effect that big! This result should send alarm bells ringing and demonstrates how far off a poor IV regression can be…… d. Estimate the reduced form for packs. What do you conclude about identification of equation 1 using cigprice as an instrument for packs? What bearing does this conclusion have on your answer to part c? We have already seen the reduced form for packs, as this is the first stage results in our IV regression in part c. The reduced form estimates show that cigprice does not significantly affect packs and in fact has a positive sign, which is the opposite of what we might expect according to economic theory. Therefore, even without more formal testing, we can see that cigprice is a poor instrument for packs, as a) it is not significantly correlated with packs and b) does not even have the expected sign. Note this is separate from the problem that cigprice may not truly be exogenous (i.e. is uncorrelated with u in equation 1) in the birth weight 4

equation, which would be the second thing you would want to consider when evaluating whether an instrument is a good one, but not something we can formally test in part e as we only have one instrument. However, we discussed earlier potential concerns about the endogeneity of cigprice as considered from a theoretical/intuitive point of view.

e. Test formally whether cigprice is a good instrument for packs. We could test whether an instrument is weak more formally through the command “estat firststage”. To obtain the Stock and Yogo tests we could include the option forcenonrobust (since they do not work with robust standard errors) and use the option all to report all possible available tests. . estat firststage, all forcenonrobust First-stage regression summary statistics -------------------------------------------------------------------------| Adjusted Partial Robust Variable | R-sq. R-sq. R-sq. F(1,1383) Prob > F -------------+-----------------------------------------------------------packs | 0.0305 0.0276 0.0007 .893693 0.3446 --------------------------------------------------------------------------

Shea's partial R-squared -------------------------------------------------| Shea's Shea's Variable | Partial R-sq. Adj. Partial R-sq. -------------+-----------------------------------packs | 0.0007 -0.0014 --------------------------------------------------

Minimum eigenvalue statistic = 1.0018 Critical Values # of endogenous regressors: 1 Ho: Instruments are weak # of excluded instruments: 1 --------------------------------------------------------------------| 5% 10% 20% 30% 2SLS relative bias | (not available) -----------------------------------+--------------------------------| 10% 15% 20% 25% 2SLS Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53 LIML Size of nominal 5% Wald test | 16.38 8.96 6.66 5.53 --------------------------------------------------------------------

Firstly the f test statistic (0.893), testing that the effect of the additional instrument (cigprice) is zero versus non zero, is less than the rule of thumb value of 10 and insignificant (p=0.34), which provides strong evidence our instrument is weak. Shea’s partial r-squared suggests also that cigprice has little influence on packs (adds 0.0007 to the r-squared). The third table (which only appears with the option forcenonrobust) provide results of the tests of Stock and Yogo, discussed in the lecture notes. As we only have one additional instrument we cannot run the first test comparing the 2SLS estimator to the OLS (see the lecture notes) – so Stata reports this test as not available. The second test provides critical values for the 2SLS and 5

LIML estimators and examines the potential distortion of T tests (or Wald tests), since we know the standard errors will be higher and t statistics lower. The reported test statistic is the minimum eigenvalue of 1.0018 and suggests we cannot reject the null of weak instruments at any rejection level (since this eigenvalue is smaller than the critical value at each rejection level reported) – so again there is strong evidence cigprice is a weak instrument. We cannot test whether cigprice is endogenous since we only have one instrument but can consider this theoretically. Note that the user written command ivreg2 will provide some tests, and a wider set of specification tests, directly in the output:

. ivreg2 lbwght male parity lfaminc (packs=cigprice), robust IV (2SLS) estimation -------------------Estimates efficient for homoskedasticity only Statistics robust to heteroskedasticity

Total (centered) SS Total (uncentered) SS Residual SS

= = =

50.42033363 31499.57971 141.7703606

Number of obs F( 4, 1383) Prob > F Centered R2 Uncentered R2 Root MSE

= = = = = =

1388 2.50 0.0412 -1.8118 0.9955 .3196

-----------------------------------------------------------------------------| Robust lbwght | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------packs | .7971063 1.111214 0.72 0.473 -1.380833 2.975045 male | .0298205 .0171898 1.73 0.083 -.0038709 .063512 parity | -.0012391 .0253297 -0.05 0.961 -.0508844 .0484062 lfaminc | .063646 .0569698 1.12 0.264 -.0480128 .1753048 _cons | 4.467861 .255852 17.46 0.000 3.966401 4.969322 -----------------------------------------------------------------------------Underidentification test (Kleibergen-Paap rk LM statistic): 0.898 Chi-sq(1) P-val = 0.3434 -----------------------------------------------------------------------------Weak identification test (Kleibergen-Paap rk Wald F statistic): 0.894 Stock-Yogo weak ID test critical values: 10% maximal IV size 16.38 15% maximal IV size 8.96 20% maximal IV size 6.66 25% maximal IV size 5.53 Source: Stock-Yogo (2005). Reproduced by permission. NB: Critical values are for Cragg-Donald F statistic and i.i.d. errors. -----------------------------------------------------------------------------Hansen J statistic (overidentification test of all instruments): 0.000 (equation exactly identified) -----------------------------------------------------------------------------Instrumented: packs Included instruments: male parity lfaminc Excluded instruments: cigprice ------------------------------------------------------------------------------

You may notice that ivreg2 reports both a centered and uncentered r-squared – ivregress reports a centered r-squared if a constant included in the regression (the default) and an 6

uncentered r-squared if a constant is not included. Remember whilst an r-squared value is reported for an IV regression it is not particularly useful. In OLS the r-squared value is between 0 and 1 but in IV regression it can be negative as the residual sum of the squares (RSS) can theoretically be larger than the total sum of the squares! IV regression is about producing a consistent estimate of the partial effect of an endogenous variable, rather than trying to estimate a model with the best fit.

I have not reported the option first (for brevity) but if you add the option first, ivreg2 will report the first stage and include a number of further specification tests. You will notice that ivreg2 reports some of the tests reported by ivregress i.e. such as the partial r-squared (reported if you use the option first) and the Stock and Yogo tests. In addition to these tests ivreg2 also reports tests for underidentification (remember we need L1 (the number of instruments) to be at least equal to K1 (the number of endogenous variables) i.e. we need as many instruments as endogenous variables) by testing the rank (we want all columns and rows to be linearly independent i.e. we cannot write one vector as a function of another) of the matrix (see the ivreg2 help command for more information). The null hypothesis is that the equation is underidentified and that we do not have enough valid instruments. The degrees of freedom is equal to the number of instruments minus the number of endogenous variables plus 1 so in this case we have 1 endogenous variable and 1 instrument so the degrees of freedom = L1-K1-1= 1-1+1=1 (remember that the exogenous variables serve as their own instruments). The null hypothesis is the rank is equal to K1-1 which in this case=0. You can see from the output that we have a p value (since we use robust standard errors the Kleibergen-Paap tests are reported) of 0.34 so we cannot reject the null and conclude we have an under-identified equation and hence our instrument cigprice is not valid. Ivreg2 also reports a number of other tests that relate to weak identification and weak instruments (some are only reported with the first option) – see the ivreg2 help file for more information about individual tests. All the tests point to cigprice being a weak instrument.

FYI if you try to export IV results using outreg2 it will only export the second stage results, as these will be the results stored in Stata’s temporary store. However, the user written command ivregress2 gets round this. To install ivregress2 use the command “ssc install ivregress2”. Using the command ivregress2 instead of ivregress stores both stages and you need to inform Stata which estimates you want to store (either the first or second stage) before each outreg2 command, using the command est restore. To export the OLS and IV results to an Excel file you can use the following set of commands (the Excel file will be saved to the working directory which you should have set to your preferred directory earlier): reg lbwght male parity lfaminc packs, robust outreg2 using ivresults.xls, ctitle(ols) ivregress2 2sls lbwght male parity lfaminc (packs=cigprice), robust first est restore first

7

outreg2 using ivresults.xls, ctitle(first) est restore second outreg2 using ivresults.xls, ctitle(second)

You should then be able to open the file ivresults.xls in Excel and if Excel asks if you sure you want to open the file, select yes. The output will look something like: VARIABLES male parity lfaminc packs

(1) ols

(2) first

(3) second

0.0262*** (0.0100) 0.0147*** (0.00542) 0.0180*** (0.00531) -0.0837*** (0.0174)

-0.00473 (0.0158) 0.0181 (0.0115) -0.0526*** (0.0112)

0.0298* (0.0172) -0.00124 (0.0253) 0.0636 (0.0570) 0.797 (1.111)

cigprice Constant

4.676*** (0.0205)

0.000777 (0.000822) 0.137 (0.102)

Observations 1,388 1,388 R-squared 0.035 0.030 Robust standard errors in parentheses *** p F R-squared Adj R-squared Root MSE

= = = = = =

935 83.48 0.0000 0.2642 0.2610 1.8883

-----------------------------------------------------------------------------educ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+-----------------------------------------------------...