Cross Validated Asked by iPlexipen on January 3, 2022
There is a dataset with 30 variables and over 5 million observations. We plan to use a subsample of the data for analysis. Around .02 – 2.5% of EACH variable are missing. I plan imputation in Stata for this, but I’m not sure if we should do the imputation for ALL 50 variables at once, or at different stages.
We will use 11 of the variables to create a subsample. As such, we plan to use imputation prior to this stage in order for the exclusion criteria to be applied correctly. However, once this is done, 3 different regressions will be run (OLS and logistic models). All 30 of the variables will be used at some point in these.
Here is the problem: should the imputation for the other (the 19 variables NOT used for the exclusion criteria) be conducted AFTER the exclusion criteria is applied, or should the imputation be done for ALL variables at the same time (prior to application of exclusion criteria).
The command in stata, hotdeck
is what we were going to use.
Since you’ve decided on an imputation method relying on MCAR (missing completely at random) data, I infer that your data are indeed MCAR. In this case, you should impute the missing values after the exclusion criteria are applied, for two reasons:
The caveat in the above is that it’s based on my inference that because you've chosen hotdeck you have MCAR data. If I’m mistaken, then:
Good luck!
References:
Answered by Mark Ebden on January 3, 2022
You should do all the imputations first, otherwise you may get biased results.
I don't know what hotdeck
in Stata does exactly, but if it is a single imputation method (ie you get one completed/imputed dataset) then I would advise against it. At the very least I would advise creating several completeted datasets, if the algorithm allows a different seed to create different imputations. I don't know what your reasons for choosing hot decking are, but I have always found multiple imputation to be superior and has desirable statistical properties, when certain assumptions hold, namely that the data missingness being MAR (missing at random) or MCAR (missing completely at random) and not MNAR (missing not at random). Roughly, this means that, for any particular variable, if the missing data can be predicted from the other variables, or if the missing values are simple a random sample, multiple imputation will produce unbiased results.
Answered by Robert Long on January 3, 2022
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP