TransWikia.com

Omited Variable Bias and descriptive statements

Economics Asked by robot112 on February 23, 2021

Let’s say $y=c+ax+by+error$
(where the error term fulfills all the assumptions) descibes reality. If we have $z=vx$ than $y=h+max+s$ will also describe reality. When reading about OVB I have seen statemets like “if the true modell is $y=c+ax+by+error$
and x and z are correlated than regressing only on x will overestimate(or underestimate) the true coefficient of ?.” But this seems to be trying to establish causality? If we are only talking about descriptive models than there could be many different correct coefficients for ?
depending on the variables included in the model couldn’t there?

One Answer

This is not about causality. Causality is a statement that $x$ causes $y$ but it is not the value of the coefficient ($a$ in your model). Even if you care just about coefficient estimates you will get wrong answer. This is because the formula for estimate of the coefficient $a$, that is $hat{a}$, in simple OLS can be expressed as:

$$hat{a} = a + beta frac{text{COV}(x,z) }{VAR(x)} $$

That is the estimate of $a$ you get from OLS is the true estimate $a$ + the true value of $beta$ which would be the coefficient of $z$ if you would actually include it in a model by estimating $y=c+ax+beta z + error$ multiplied by the covariance of $x$ and $z$ and divided by variance of $x$. Now variance of $x$ must be non-zero just to be able to estimate OLS in the first place and variance is always positive so the over or underestimation will be given by the sign of the covariance and the $beta$.

Now whether you care about causality or no this bias is still important consider the following example. You want to estimate the effect of education of parents on number of children they have. You dont care about causality you just want to calculate whats the coefficient inside your sample i.e. you only care about whats the average number of children for a couple with average years of education, and in addition suppose that education is also correlated with income. Also to make calculations easy lets assume that VAR$(x)=0.5$ and COV$(x,z)=0.5$ and that average years of education of parents are 10 years. Now lets suppose the true model looks like this (I already replaced $c,a,beta$ with actual numbers):

$$y_i= 10 -0.3x_i-0.2z_i + u_i$$

In this true model a family with average education of 10 would have on average 3 less kids compared to family with no education. Note I am not saying anything about causation this is just purely caring about descriptively describing the household with average education.

However, note what happens if we now omit $z$ from the regression model and calculate $a$ using the formula above:

$$y_i= 5 - 0.42x_i+ u_i$$

now the estimate suggest that family with average education has 4 less kids which is plain wrong - this estimate underestimates the number of kids for average family by $25%$. In this case the sign stayed the same but thats just due to the numbers I chose in real life bias could easily change the sign of the coefficient altogether.

Hence, even if you dont care about causality and you just want to use regression for some description of dataset you care about omitted variable bias.

PS: Also in my answer I left out $by$ from your example: $ y=c+ax+by+error$ since there must be some typo you cant regress $y$ on itself, yet your formula includes $y$ on both sides. Or to be specific you could do that but mathematically coefficient is always set to $1$ when you regress dep variable on itself so there is never a reason to do that.

Answered by 1muflon1 on February 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP