Metrics MT Notes PDF

Title Metrics MT Notes
Author Ding Lin Chng
Course Principles of Econometrics
Institution The London School of Economics and Political Science
Pages 32
File Size 541.3 KB
File Type PDF
Total Downloads 226
Total Views 357

Summary

Introduction Econometrics exists to answer counterfactual questions  what would happen if we do something o YTi / Y1i and YCi / Y0i are the potential outcomes, but we only get to see one of them (Yi) o We are interested in YTi - YCi, but we cannot observe this, therefore we use the average effect E...


Description

1. Introduction - Econometrics exists to answer counterfactual questions  what would happen if we do something o YTi / Y1i and YCi / Y0i are the potential outcomes, but we only get to see one of them (Yi) o We are interested in YTi - YCi, but we cannot observe this, therefore we use the average effect E[YTi - YCi] T C T C - We can calculate D=E [ Y i |T ] −E [ Y i |C ] =E [ Y i |D i=1 ] −E [ Y i |D i=0 ] o But this does not show the treatment effect as the two groups might be essentially different o Math - SELECTION BIAS o Captures the difference in potential outcomes between treated and untreated individuals 2. Counterfactuals and Causality - Usually, we want to assess the causal relationship between 2 variables o Correlation doesn’t mean causation o Sometimes, timing helps to establish causality o Three components: treatment, outcome and confounder  Confounder is a third variable affecting both ice cream consumption and drowning o We need a counterfactual: eg. What drowning rates would have been for the same ___ if there was no ice cream production - There are 4 reasons why 2 variables might be correlated o A causes B o B causes A o A third variable C, causes both A and B (omitted variable) o Dumb luck (usually can be quantified/ruled out by statistical techniques) - Potential outcomes o Y1i if Di = 1, Y0i if Di = 0 - Observed outcome o Yi = Y1i if Di = 1, Yi = Y0i if Di = 0  Yi = Y0i + (Y1i - Y0i)Di o In practice, we often look at average outcomes for particular groups in the data  E[Yi|Di = 1] and E[Yi|Di = 0] (ave of each group, not the whole pop) - The selection problem o Observed difference in expected outcomes = treatment effect + selection bias  E[Yi|Di = 1] - E[Yi|Di = 0] = E[Y1i – Y0i|Di = 1] + E[Y0i|Di = 1] - E[Y0i|Di = 0] o Selection bias might be caused by  Reverse causation: the less healthy are more likely to buy insurance  Confounders/omitted variables: eg. More educated buy more insurance and would also be healthier - Comparing outcomes between treated and untreated observations may not reveal causal effect of treatment

3. Experiments - Selection bias is related to the assignment rule o We want: E[Y0i|Di = 1] - E[Y0i|Di = 0] = 0 o If those who are treated are essentially different from who are not, the equation doesn’t hold (we have a selection bias problem) - Experiments use random assignment to solve the selection problem o Treatment (Di) is randomly assigned  therefore treatment is statistically independent of potential outcomes  selection bias is 0 in expectation (untreated are a valid counterfactual)  E[Y0i|Di = 1] - E[Y0i|Di = 0] = 0  E[Yi|Di = 1] - E[Yi|Di = 0] = E[Y1i – Y0i|Di = 1] + E[Y0i|Di = 1] - E[Y0i|Di = 0] = E[Y1i – Y0i] o Random assignment balances potential outcomes in expectation between treatment and control groups  gives us a valid counterfactual  This balancing extends to both observed and unobserved characteristics (motivations, incentives, behavious etc)  There will be no systematic differences in characteristics between the 2 groups (eg. Similar share of men and women) o Random assignment balances characteristics (idiosyncratic differences) in expectation  if sample size is too small, it does not resolve the selection bias problem - Randomized experiments have had a long history (scurvy experiments, agricultural) o Influential in economics relatively recently (RAND Health Insurance Experiment to evaluate impact of health insurance on health care usage, Tennessee STAR-project to evaluate effect of school class sizes) o Experiments are common in the business world (known as A/B testing) - Experiments cannot help us answer everything (ethics – choosing a control group to experience life without vaccines, feasibility) - Quantitative experiments require data from both groups, qualitative experiments require data only on treated individuals 4. Standard Error of the Mean - We usually want to compare mean outcomes in 2 different groups o Compare means when estimating treatment effects or checking for balance across treatment groups (eg. Gender balance)  how likely is it that the difference in means arose by chance o Our estimates can vary because of  Systematic differences  Imprecision in measurement  Sampling/idiosyncratic differences - We use samples to learn about populations o Therefore, we need to decide how much sampling variation is acceptable o Under general conditions, as the sample size grows to infinity, the sample mean will get arbitrarily close to the population mean  idiosyncratic differences will disappear is the group is large enough o Law of large numbers: sample mean approaches population mean as sample size increases

-

-

o Target is the population mean: expectation of the population distribution of the data  μ ≔ E [Y i] o Sample mean is an estimator for the population mean: 1 Avg n ( Y i) ≔ Y´ = ∑Y i n o Sampling variation is about assessing the distribution of ´Y −μ The sample average has some desirable properties o Unbiasedness: E [ Y´ ] =E [ Y i ] =μ  if we take many repeated random samples, the mean of the sample means = population mean  Unbiasedness holds for samples of any size, no need to be large  On average, sample mean does not deviate from pop mean in any direction Unbiasedness implies that E [ Y´ −μ ]=0 , but for idiosyncratic reasons, our estimated sample mean will differ from the population mean 2 o For a random variable, Var ( X i ) =E [ X i−μ x ] 2 o We are interested variance of the sample mean Var (Y´ )=E [Y´ −E [Y´ ]]  Ie. Variance of sample averages if we drew many samples 1  Var (Y´ )=Var ( ∑Y i ) n  Since we are taking random samples, Cov (Y i ,Y j ) =0  Therefore, using rules of a linear combination, 1 Var (Y´ )= 2 ∑ Var (Y i ) n σ2 o Var ( Y´ )= n o Variance of the sample mean depends on  Variance of the underlying observations  Sample size σ o Sampling standard deviation of the sample mean is SE( Y´ ) = √n 2 1 2 ∑( Y i−μ ) E [ Y i −μ ]  Since , we can estimate n = SE ( Y´ )= √n √n 2 1 ∑ (Y i −Y´ ) n−1 ^ ´ SE( Y )=s ´Y = √n We want to assess how likely it is that observed difference between means arose due to chance o Do this by calculating SE of ´Y 1−Y´ 0 o Since they are random assignments  independent samples, ^ Var ( Y´ 1−Y´ 0) =^ Var ( Y´ 1) + ^ Var ( Y´ 0) =¿





-

5. Inference about Means



-

-

-

-

We want to know if an observed difference in means is the effect of the treatment (systematic difference) or just idiosyncratic variation o While random assignment ensures that on average other differences should be equally balanced, in practice some differences can arise from noise o We use hypothesis testing to quantify how likely/unlikely a difference is  Typically, the null hypothesis is that the treatment had no effect Eg. Test the hypothesis that providing free care did not affect total health expenditures o H0: µ0 = 0 Y´ 1−Y´ 2 o Form a t-statistic: (¿)−μ0 ¿ t=¿ o Central limit theorem: for a large enough sample, the sample mean of n i.i.d. random variables with finite mean and variance is normal  underlying RV does not need to follow any particular distribution o This means that in large samples, the t-statistic for the sample average will have a standard normal distribution, allowing for further discussions  We can use the t statistic for classical hypothesis tests, p-values for the hypothesis and confidence levels o The significance level determines critical values for our test  reject the null when the t-statistic is further away from zero than the critical value  Pr (|t |>t crit ) =2 ϕ( −t crit )= p value o When we are testing our hypothesis, we can make 2 types of mistakes  Type-I error (size): rejecting the null when it is in fact true  Given by the significance level  Type-II error (power): fail to reject the null when it is false  Power = 1 – Pr(type-II error)  Usually, type-I errors are considered more serious Classical hypo-testing can be dissatisfying  we reject whether its 2.0 or 2.6 if the critical value is 1.96  wastes information that we have P values compute the probability of observing an estimate at least as extreme as the actual estimate  p=2 ϕ( −|t|) o P-values are a more informative reparameterization of the test results, rather than simply reject/do not reject Confidence intervals report the set of all values of µ0 which could not be rejected at a particular level o NOTE: we are solving the CI for the estimate, not the t-statistic o CI 1−α =[ ( Y´ 1− Y´ 2 ) −t crit ( ^ SE) , ( Y´ 1−Y´ 2 ) + tcrit ( ^ SE ) ]  CI ≈ (Y´ −Y´ ) ± 2 SE 0.95

-

1

2

Economic and statistical significance are different  even if we reject H0, we cant tell if the difference in means are large or small When comparing differences in means o Larger samples are more precise  mean of the sample will be closer to the true population mean

o Bigger effects are easier to find  the smaller the effect we are trying to find, the more observations we will need to detect it o A given effect size is harder to detect in noisier environments  when there is less residual variation in an outcome, it is easier to detect an effect

6. Statistics Review - The mean is a useful measure of central tendency, but o It is sensitive to outliers o It can be easily misinterpreted o Eg. Incidence of kidney cancer in the US is lowest in rural, sparsely populated countries that are traditionally Republican  but might be due to low population, or because people are just not diagnosed law of small numbers (1/10 vs 1/1000 are very different) - What is the probability of landing another heads after 3 heads o Hot hands fallacy: expect winning streaks (im on a roll…) o Gamblers fallacy: unreasonably expect losing streaks to reverse (tails was because…) - Punishment vs praise  Daniel Kahneman said that rewards for improved performance work better than punishment for mistakes - Most outcomes are the combination of some underlying ability and some luck  if we score well now, we probably knew the material well + were a little lucky  will probably do worse next time 7. Bivariate Regression - We want to use regression as a tool to describe data o Simple way is to draw a line through all the data points that fits as closely as possibly  Eg. TestScore=α + βSTRatio  has the functional form of a straight line  β tells us the difference in test scores for each unit difference in the student teacher ratio - We can summarize the relationship between 2 variables in a simple linear regression model TestScorei=α+β STRatioi +ε i ε i is the error term: how far away the actual test score for district i is from o the test score we would predict using the fitted line  captures all other factors influencing test scores besides our estimated linear relationship o Yi  dependent variable/outcome, Xi  independent variable/regressor/covariate, ε i  residual, ^α  estimate of the intercept, ^β  estimate of the slope coefficient - The most common method: ordinary least squares estimation (OLS)  finds the regression coefficients that make the fitted line as close as possible to the data

o OLS estimations minimizes the sum of the squared deviations of the data from the regression line o e i=TestScorei−^ TestScore i=TestScorei −( α + β STRatioi)  residual  distance of each point from our fitted line o The regression line minimizes the sum of these residuals squared 2  Residual Sum of Squares (RSS) = E {( Y i −α −β X i ) } 

-

-

After taking the derivatives to derive the FOC, the solutions are Cov (Y i , X i ) ^β=  , ^α =E (Y i ) − ^β E( X i) Var ( X i )

Remember o Residuals have mean 0, residuals are uncorrelated with regressors  The regression partitions the variation in Y i into 2 orthogonal (uncorrelated) parts: the regression line and the residual When interpreting the regression, say that something is ASSOCIATED with something, be careful not to imply causation Stata command for regressing Y on X: reg testscr str

8. Bivariate Inference - Once we have estimated parameters in a regression model, we typically want to test hypotheses, construct confidence intervals and calculate p-values - The OLS estimator has sampling variation o Estimated values from different samples might not match the population regression values o Therefore, the estimator for the regression slope is a RV (dependent on the sample we drew) and has sampling variation o To calculate sampling variation for the OLS slope  For the sample average, we could calculate the precise distribution of ´Y across repeated samples and then calculate the variance of the mean  but this method only works with strong assumptions  We can use CLT to derive sampling variance (LT)  Recall: sampling standard deviation for the sample average Var (e i ) SE( Y´ )= n Var ( e i)  For the bivariate regression, SE( ^β )= nVar (X i )  In practice, we look at the estimated standard error ^( ei ) Var  SE( ^β )= ^ ( X i) n Var  From the equation, ^β will be more precise (lower SE) when o There is less noise in the model (var(e) is smaller) o We have more observations o We have more variation on our explanatory variable (var(x) is larger)







-

-

Conventional estimates of SE assume residual variance is unrelated to the regressor, but it is often related o Homoskedasticity: dispersion of residuals is unrelated to Xi 2  E [ e i| X i ] =Var (e i ) =σ 2 , a constant o Heteroskedasticity: dispersion of the residuals is related to Xi  In such cases, conventional SE are no longer valid, use robust SE o Traditional econometrics take homoskedasticity as a starting point, but modern applied work is cautious and treats residuals as heteroskedastic unless proven otherwise Robust SE add on a potential interaction between X and e o o o o

o

RSE ( ^β )=



[

Var ( X i−E [ X i ]) ei

]

2

n[ Var ( X i ) ] Stata command: reg testscr str, robust If there is not much difference between conventional SE and robust SE, there is no relationship between e i and x i If there is a large difference, there is some form of relationship between e i and x i  we would have overstated the confidence in our correlation relationship (would have reject H0 at a lower sig level) Note: RSE does not work well for relatively small data sets

9. Multivariate Regressions - To understand how regression can be used to quantify a causal effect, we require a valid counterfactual (observations that would have been the same but for some treatment) - Multivariate regression allows us to control for other factors that might affect the outcome o Adding more explanatory variables allows us to (i) explain more of the variation in outcomes, (ii) incorporate more general functional form relationships and more towards establishing causality - Example: TestScorei=α+β 1 STRatioi+ β2 ESLi +ε i β 1 represents the expected difference in test scores between observations o with one unit difference in class size, holding ESLi constant β 1 is the partial derivative of TestScorei wrt STRatioi o - The relationship between the long regression and the short regression is governed by the omitted variable bias formula o Long regression: Y i=α +β Si + γ X i +e i o Short regression: : Y i=α s + β s S i +esi Cov ( Y i , S i ) Cov( X i , Si ) s =β+ γ δ XS , where δ XS = o OVB formula: β = Var ( Si ) Var ( Si ) (derivation) δ XS is the regression coefficient of the omitted variable X i on S i o X i=δ0 + δ XS S i+u i  o If the omitted variable is uncorrelated with the included regressor, there will be no OVB



-

-

It is okay to exclude covariates uncorrelated with the regressor of interest  might affect precision of our estimates, but it is not going to affect the bias of the estimate OVB is the mathematical relationship between the coefficients in any 2 regression that are the same, expect one regression contains one or more additional regressors not included in the other o Sometimes we talk about OVB as referring to selection bias  Selection bias is the bias that results from not having a valid counterfactual (when the determinants of the outcome are not the same in the treatment and control groups)  eg. Bigger classes may contain students that would have performed poorly even in a smaller class  We use OVB to think about selection bias in regressions o OVB formula gives us a guide to possible bias when we cannot include a relevant variable However, it is more convenient to work with bivariate regressions  we can always turn a multiple regression into a bivariate regression using the regression anatomy formula (makes it easier to work with multivariate data) Y i=α +β 1 X 1i + β 2 X 2 i+ei o ~ ~ o X 1 i= ^π 0 +^π 1 X 2 i+ X 1 i  calculate the residuals, X 1 i ~ Cov ( Y i , X 1i ) β = o Then, (derivation) 1 Var ( ~ X 1 i) o The process of first regressing X 1 on other covariates and then running a bivariate regression is called partialling out covariates ~ ~ Cov ( Y i , X 1i ) o The regression anatomy can also be written as , β 1= , Var ~ X

(

1i

)

meaning we partial X 2 out of the dependent variable as well o After we partial out a variable from the regression, we can plot it (meaning we take away variation in the regressor that is explained by the other variable) 10. Multivariate Inference - The standard error changes when going from a short to a long regression  the long regression coefficient is more accurate because (i) less noise as the other variable explains some variation and (ii) some variation in X 1 is explained by X 2 o Long regression: Y i=α + β X 1i + γ X 2 i +e i s o Short regression: : Y i=α s + β s X 1 i +e i o

-

s SE( ^β ) =



Var( e si ) nVar ( X 1 i )

, SE ( ^β ) =



Var ( ei ) ~ nVar ( X 1i)

 Var ( e) ≤ Var (e ) ~  Var ( X 1 )≤ Var (X 1)  Therefore, we cant tell which SE is larger In multivariate regression, we can test single hypotheses involving multiple coefficients, and test multiple hypotheses at the same time s i

-

-

o Single hypothesis with multiple coefficients  H0: γ 1=γ 2 , t-statistic: ( γ 1−γ 2 ¿/( SE( ^γ 1−^γ 2 ) ) ^γ  Since Var (¿ ¿ 1− ^γ 2)=Var ( ^γ 1 )+Var (^γ 2)−2Cov ( ^γ 1 , ^γ 2 ) ¿ ^γ 1−γ^2  T-statistic= √ Var (γ^ 1 ) +Var ( ^γ 2) −2 Cov ( ^γ 1 , ^γ2) o Multiple hypotheses at the same time  We cannot just combine single t-statistics  the estimated coefficients for ^γ 1∧^γ 2 will be correlated and we need to account for it  Even if they are not correlated, rejecting the joint hypothesis in either of the 2 t-test rejected would reject too ofter  We need to perform an F-test for test a joint hypothesis  First, estimate the model of interest: U class ¿ ¿ i+ γ 1 pct . e . learners+ γ 2 pct . free . lunch+ e i  TestScorei=α + β ¿ unrestricted model  Re-estimate the model enforcing the null hypothesis R class ¿ ¿ i+e i γ 1=γ 2=0 :  restricted model TestScorei=α+β ¿  Intuition behind the F-test: we assess how much worse our fit gets when we force H0 to be true  if our null hypothesis is true, how surprising would it be for the sum of our squared residuals to increase by this much? SSRR − SSRU q F test statistic: , follows an F-distribution with dof q,n-k-1 SSRU n−k −1 o Stata command: test el_pct = meal_pct (test that the 2 coefficients are equal) o Stata command: test el_pct meal_pct (default as testing coefficients = 0) o Stata automatically reports the F-test for excluding all regressors The regression splits each observed value of y i into an explained part ( ^y i ) and the unexplained residual ( e i )  this leads to a natural decomposition of the variance o SST (total sum of squares) = ∑ ( y i− ´y ) 2 2

o SSE (explained sum of squares) = ∑ ( ^y i− ´y ) 2 o SSR (sum of squared residuals) = ∑ e2i = ∑ ( y i− ^y i)  SST = SSR + SSE o R-squared reports the fraction of the total variation (SST) that is...


Similar Free PDFs