At what point in analysis do you perform imputation for missing variables?

Question

There is a dataset with 30 variables and over 5 million observations. We plan to use a subsample of the data for analysis. Around .02 - 2.5% of EACH variable are missing. I plan imputation in Stata for this, but I'm not sure if we should do the imputation for ALL 50 variables at once, or at different stages.
We will use 11 of the variables to create a subsample. As such, we plan to use imputation prior to this stage in order for the exclusion criteria to be applied correctly. However, once this is done, 3 different regressions will be run (OLS and logistic models). All 30 of the variables will be used at some point in these.
Here is the problem: should the imputation for the other (the 19 variables NOT used for the exclusion criteria) be conducted AFTER the exclusion criteria is applied, or should the imputation be done for ALL variables at the same time (prior to application of exclusion criteria).
The command in stata, hotdeckis what we were going to use.

Mark Ebden · Answer

Since you’ve decided on an imputation method relying on MCAR (missing completely at random) data, I infer that your data are indeed MCAR. In this case, you should impute the missing values after the exclusion criteria are applied, for two reasons:

Speed (because there are fewer data points to process, downstream of exclusion criteria);
Bespoke imputation for your data of interest. (Whereas, imputing all 30 variables before exclusion would tap into a larger, less specific population than the one under study.)

The caveat in the above is that it’s based on my inference that because you've chosen hotdeck you have MCAR data. If I’m mistaken, then:

Don’t impute any data using hotdeck; use something such as multiple imputation by chained equations (MICE), for which there are toolboxes.
Impute the data before the exclusion criteria are applied. Basically, see the other answer here by Robert Long.

Good luck!
References:

Missing Data Problems in Machine Learning by B. Marlin (2008)
Section 9.6 of The Elements of Statistical Learning, arguing for multiple imputation when data are not MCAR

Robert Long · Answer

You should do all the imputations first, otherwise you may get biased results.
I don't know what hotdeck in Stata does exactly, but if it is a single imputation method (ie you get one completed/imputed dataset) then I would advise against it. At the very least I would advise creating several completeted datasets, if the algorithm allows a different seed to create different imputations. I don't know what your reasons for choosing hot decking are, but I have always found multiple imputation to be superior and has desirable statistical properties, when certain assumptions hold, namely that the data missingness being MAR (missing at random) or MCAR (missing completely at random) and not MNAR (missing not at random). Roughly, this means that, for any particular variable, if the missing data can be predicted from the other variables, or if the missing values are simple a random sample, multiple imputation will produce unbiased results.

At what point in analysis do you perform imputation for missing variables?

2 Answers

Add your own answers!

Ask a Question