Cross Validated Asked by Zhenkai Ran on December 3, 2021
After implementing an IV probit model, the signs of many exogenous covariates’ coefficients have been flipped, compared to those in the baseline probit model. These signs are now at odds with the past literature. I originally thought the IV approach would only significantly impact the coefficient estimates of the endogenous variables. What does this indicate? Is it because my instrument is not valid?
This is not a complete answer, but I thought I would organize some of my suggestions from the comments and give an empirical example of sign reversal, and two simulation examples that give the intuition for why sign reversal is not not necessarily a problem.
I think it is a common misconception that if only one RHS variable is correlated with the error, that the other coefficients are still consistently estimated. This will only be true if the endogenous variable is uncorrelated with the other included variables. So it's not surprising to me that this happens, especially in a nonlinear model, once you fix the endogeneity issue. I don't think that applies anything about instrument validity.
Here is an example where we want to estimate the effect of household income on supplemental insurance. This is a type of private health insurance plans sold to supplement Medicare in the United States. Medicare is primarily provides health insurance for Americans aged 65 and older, but also some disabled or very sick people. It is "borrowed" from Cameron and Trivedi's MUS book.
Note that the coefficient on White is negative in the baseline probit
but become positive in the ivprobit
:
. use "http://cameron.econ.ucdavis.edu/musbook/mus14data.dta", clear
. generate linc = log(hhincome)
(9 missing values generated)
. global xlist2 female age age2 educyear married hisp white chronic adl hstatusg
. global ivlist2 retire sretire
. probit ins linc $xlist2, nolog
Probit regression Number of obs = 3,197
LR chi2(11) = 403.86
Prob > chi2 = 0.0000
Log likelihood = -1933.4275 Pseudo R2 = 0.0946
------------------------------------------------------------------------------
ins | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
linc | .3466893 .0341499 10.15 0.000 .2797567 .4136219
female | -.0815374 .0511504 -1.59 0.111 -.1817904 .0187156
age | .1162879 .1221709 0.95 0.341 -.1231627 .3557384
age2 | -.0009395 .0009061 -1.04 0.300 -.0027153 .0008363
educyear | .0464387 .0089404 5.19 0.000 .0289159 .0639616
married | .1044152 .0640169 1.63 0.103 -.0210556 .2298859
hisp | -.3977334 .1140795 -3.49 0.000 -.621325 -.1741418
white | -.0418296 .0667783 -0.63 0.531 -.1727127 .0890535
chronic | .0472903 .0189362 2.50 0.013 .0101759 .0844047
adl | -.0945039 .0357786 -2.64 0.008 -.1646286 -.0243791
hstatusg | .1138708 .0639877 1.78 0.075 -.0115429 .2392845
_cons | -5.744548 4.114725 -1.40 0.163 -13.80926 2.320165
------------------------------------------------------------------------------
. ivprobit ins $xlist2 (linc = $ivlist2), nolog
Probit model with endogenous regressors Number of obs = 3,197
Wald chi2(11) = 354.58
Log likelihood = -5407.7151 Prob > chi2 = 0.0000
------------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
linc | -.5338252 .3726921 -1.43 0.152 -1.264288 .1966379
female | -.1394072 .0495162 -2.82 0.005 -.2364571 -.0423572
age | .2862293 .1240301 2.31 0.021 .0431347 .5293238
age2 | -.0021472 .000908 -2.36 0.018 -.0039269 -.0003675
educyear | .1136881 .0231877 4.90 0.000 .0682411 .1591351
married | .7058309 .2303767 3.06 0.002 .254301 1.157361
hisp | -.5094514 .1020497 -4.99 0.000 -.7094651 -.3094378
white | .1563454 .100883 1.55 0.121 -.0413817 .3540726
chronic | .0061939 .0266342 0.23 0.816 -.0460082 .058396
adl | -.1347664 .0330215 -4.08 0.000 -.1994873 -.0700456
hstatusg | .2341789 .0688939 3.40 0.001 .0991493 .3692085
_cons | -10.00787 3.907857 -2.56 0.010 -17.66713 -2.348612
-------------------+----------------------------------------------------------------
corr(e.linc,e.ins)| .5879559 .2267331 -.0046375 .8749261
sd(e.linc)| .7177787 .0089768 .7003984 .7355902
------------------------------------------------------------------------------------
Instrumented: linc
Instruments: female age age2 educyear married hisp white chronic adl hstatusg
retire sretire
------------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 3.79 Prob > chi2 = 0.0516
One possible response is to say that neither one is significant and leave it there. You should probably do this in terms of average marginal effects with margins, dydx(white)
, rather than index function coefficients. For probit, this means -1.4 percentage points less likely and for the biprobit
it is +16.5 percentage points:
. qui probit ins linc $xlist2, nolog
. margins, dydx(white)
Average marginal effects Number of obs = 3,197
Model VCE : OIM
Expression : Pr(ins), predict()
dy/dx w.r.t. : white
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
white | -.0143965 .0229799 -0.63 0.531 -.0594363 .0306432
------------------------------------------------------------------------------
. qui ivprobit ins $xlist2 (linc = $ivlist2), nolog first
. margins, dydx(white)
Average marginal effects Number of obs = 3,197
Model VCE : OIM
Expression : Fitted values, predict()
dy/dx w.r.t. : white
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
white | .1563454 .100883 1.55 0.121 -.0413817 .3540726
------------------------------------------------------------------------------
You can also appeal to the logic above to argue that this is does not mean that White is endogenous (also see simulation below).
You can also argue that if previous research on White insurance take-up did not adjust the effect for HH income at all or correctly, that could explain the difference in sign now. For example, the price of race horses is positively related to the number of races won and with total winnings, individually. Controlling for winnings, the races coefficient can become negative since it now corresponds to requiring more races to win the same amount of money. Here it is tougher to come up with an intuitive ceteris paribus
explanation, but you problem might be different. In general, the bivariate intuitions about sign that come from previous work and theory are trickier when you have lots of covariates, because they are not always the same comparison. This paper by Peter Kennedy covers some of the other cases.
You could also argue that own and spouse retirement status are bad instruments, because they probably alter supplemental insurance buying through channels other than HH income, perhaps through health.
Here are some simulation examples showing the "smearing bias" that can happen when you have an exogenous variable that is correlated with an endogenous one and how that can alter the sign on the exogenous variable when the endogeneity is fixed. I did the linear IV case first, and then the control function version since that is more involved:
. /* Linear Example */
. set seed 07202020
. matrix C = (1, .25, .8 .25, 1, 0 .8, 0, 1)
. drawnorm x c u, n(10000) corr(C) clear
(obs 10,000)
. gen e = rnormal()
. gen y = 1 - 2*x + 0.5*c + 4.66*u + e
. gen E = 4.66*u + e
. corr x u c E
(obs=10,000)
| x u c E
-------------+------------------------------------
x | 1.0000
u | 0.8038 1.0000
c | 0.2422 -0.0055 1.0000
E | 0.7877 0.9777 -0.0034 1.0000
.
. /* fabricate an IV and solve things */
. reg x u
Source | SS df MS Number of obs = 10,000
-------------+---------------------------------- F(1, 9998) = 18250.83
Model | 6461.11847 1 6461.11847 Prob > F = 0.0000
Residual | 3539.46983 9,998 .354017787 R-squared = 0.6461
-------------+---------------------------------- Adj R-squared = 0.6460
Total | 10000.5883 9,999 1.00015885 Root MSE = .59499
------------------------------------------------------------------------------
x | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
u | .8022365 .0059383 135.10 0.000 .7905963 .8138768
_cons | -.0031226 .00595 -0.52 0.600 -.0147857 .0085406
------------------------------------------------------------------------------
. predict z, resid
. replace z = z + rnormal()
(10,000 real changes made)
. reg y x c u // true DGP
Source | SS df MS Number of obs = 10,000
-------------+---------------------------------- F(3, 9996) = 34830.25
Model | 104726.904 3 34908.968 Prob > F = 0.0000
Residual | 10018.5921 9,996 1.00226012 R-squared = 0.9127
-------------+---------------------------------- Adj R-squared = 0.9127
Total | 114745.496 9,999 11.4756972 Root MSE = 1.0011
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -1.978855 .018491 -107.02 0.000 -2.015102 -1.942609
c | .5040411 .010988 45.87 0.000 .4825024 .5255798
u | 4.637909 .0179062 259.01 0.000 4.602809 4.673008
_cons | .9775556 .0100116 97.64 0.000 .9579308 .9971804
------------------------------------------------------------------------------
. reg y x c // baseline spec (omit u)
Source | SS df MS Number of obs = 10,000
-------------+---------------------------------- F(2, 9997) = 2425.47
Model | 37488.3041 2 18744.152 Prob > F = 0.0000
Residual | 77257.1921 9,997 7.72803762 R-squared = 0.3267
-------------+---------------------------------- Adj R-squared = 0.3266
Total | 114745.496 9,999 11.4756972 Root MSE = 2.7799
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.995542 .0286515 69.65 0.000 1.93938 2.051705
c | -.4828779 .0286182 -16.87 0.000 -.5389752 -.4267805
_cons | .9864041 .0278001 35.48 0.000 .9319103 1.040898
------------------------------------------------------------------------------
. ivreg y (x=z) c // fix with IV
Instrumental variables (2SLS) regression
Source | SS df MS Number of obs = 10,000
-------------+---------------------------------- F(2, 9997) = 60.92
Model | -114035.469 2 -57017.7343 Prob > F = 0.0000
Residual | 228780.965 9,997 22.884962 R-squared = .
-------------+---------------------------------- Adj R-squared = .
Total | 114745.496 9,999 11.4756972 Root MSE = 4.7838
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -2.016389 .1826737 -11.04 0.000 -2.374466 -1.658311
c | .4876413 .0650832 7.49 0.000 .3600652 .6152174
_cons | .9584056 .0478553 20.03 0.000 .8645995 1.052212
------------------------------------------------------------------------------
Instrumented: x
Instruments: c z
------------------------------------------------------------------------------
.
. /* IV Probit Simulation: control function version */
. set seed 07232020
. matrix sigma = (1,-.25-.25,1)
. drawnorm e1 e2, n(10000) corr(sigma) clear
(obs 10,000)
. matrix C = (1,-.25-.25, 1)
. drawnorm x1 x2, n(10000) corr(C)
. gen y2 = x2 + x1 + e2
. gen y1star = 1 + .05*y2 - .01*x2 + e1
. gen y1 = cond(y1star >= 1,1,0)
. corr x2 y2 e1
(obs=10,000)
| x2 y2 e1
-------------+---------------------------
x2 | 1.0000
y2 | 0.4716 1.0000
e1 | -0.0128 -0.1540 1.0000
. probit y1 y2 x2, nolog
Probit regression Number of obs = 10,000
LR chi2(2) = 76.22
Prob > chi2 = 0.0000
Log likelihood = -6893.103 Pseudo R2 = 0.0055
------------------------------------------------------------------------------
y1 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y2 | -.0775504 .0090042 -8.61 0.000 -.0951983 -.0599025
x2 | .0747156 .0143114 5.22 0.000 .0466658 .1027655
_cons | .0103127 .0125617 0.82 0.412 -.0143077 .0349331
------------------------------------------------------------------------------
. ivprobit y1 (y2=x1) x2, nolog
Probit model with endogenous regressors Number of obs = 10,000
Wald chi2(2) = 18.14
Log likelihood = -20957.715 Prob > chi2 = 0.0001
---------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------+----------------------------------------------------------------
y2 | .0509049 .0125527 4.06 0.000 .0263021 .0755076
x2 | -.0221555 .0157077 -1.41 0.158 -.0529419 .008631
_cons | .0081544 .0124891 0.65 0.514 -.0163237 .0326325
----------------+----------------------------------------------------------------
corr(e.y2,e.y1)| -.2502787 .0167809 -.2828692 -.217111
sd(e.y2)| .9977357 .0070551 .9840034 1.01166
---------------------------------------------------------------------------------
Instrumented: y2
Instruments: x2 x1
---------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 204.02 Prob > chi2 = 0.0000
Stata Code:
cls
use "http://cameron.econ.ucdavis.edu/musbook/mus14data.dta", clear
generate linc = log(hhincome)
global xlist2 female age age2 educyear married hisp white chronic adl hstatusg
global ivlist2 retire sretire
probit ins linc $xlist2, nolog
margins, dydx(white)
ivprobit ins $xlist2 (linc = $ivlist2), nolog first
margins, dydx(white)
/* Linear Example */
set seed 07202020
matrix C = (1, .25, .8 .25, 1, 0 .8, 0, 1)
drawnorm x c u, n(10000) corr(C) clear
gen e = rnormal()
gen y = 1 - 2*x + 0.5*c + 4.66*u + e
gen E = 4.66*u + e
corr x u c E
/* fabricate an IV and solve things */
reg x u
predict z, resid
replace z = z + rnormal()
reg y x c u // true DGP
reg y x c // baseline spec (omit u)
ivreg y (x=z) c // fix with IV
/* IV Probit Simulation: control function version */
set seed 07232020
matrix sigma = (1,-.25-.25,1)
drawnorm e1 e2, n(10000) corr(sigma) clear
matrix C = (1,-.25-.25, 1)
drawnorm x1 x2, n(10000) corr(C)
gen y2 = x2 + x1 + e2
gen y1star = 1 + .05*y2 - .01*x2 + e1
gen y1 = cond(y1star >= 1,1,0)
corr x2 y2 e1
probit y1 y2 x2, nolog
ivprobit y1 (y2=x1) x2, nolog
Answered by dimitriy on December 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP