TransWikia.com

Confidence intervals and multiple regression for a multiply imputed data set

Cross Validated Asked by AppleSeed on November 24, 2021

I generated 20 imputed data sets to handle missing values. Using SPSS I can’t carry out a multiple regression and calculate confidence intervals with this multiply imputated data set.
Is there a way to carry out these tests in SPSS? If yes, how? If no, which other software could I use? or is there an alternative?

Background information about the study: I have 156 cases. Each case/person had to answer 202 items. 0.75% of the overall 31512 item answers are missing.

2 Answers

There is a new solution: the new R package (hmi, Speidel et al 2020). It handles missing data imputation + pooling of coefficients like mice, but also handles hierarchical designs. This would address the issue of multiple measures/person (a.k.a. longitudinal data, repeated measures, random effects).

A simple demonstration using the sleepstudy dataset would look something like this:

# load required libraries
library(hmi)    # imputation
library(lme4)   # mixed effects models

# simulate 10% missing cells (not 10% missing rows)
# while avoiding rows missing both variables
set.seed(321321)
sleep <- sleepstudy
iremove <- sample(nrow(sleepstudy), size = 0.1*nrow(sleepstudy))
sleep$Reaction[head(iremove, length(iremove)/2)] <- NA
sleep$Days[tail(iremove, length(iremove)/2)] <- NA

# Impute missing data and fit the model
# ! this takes a few minutes to run !
myForm <- formula(Reaction ~ 1 + Days + (1|Subject))
sleep.imp <- hmi(data = sleep, 
                 model_formula = myForm,
                 # family = Gamma(link = "log"),
                 m = 10)

# Get the pooled model coefficients
summary(sleep.imp$pooling)

According to the documentation, pooling is also following Ruben's rule. (Note: I left out diagnostics checks for conciseness but they should be done!)

As an additional side note: If the 202 items are used as predictors in the multiple regression, collinearity is likely an issue.

I hope that helps!

Answered by Mick on November 24, 2021

Although the information in your question does not quite make sense (0.75% missingness with 156 observations) the answer to your question is that you need to apply Rubin's rules for pooling the analysis on the 20 imputed datasets. Essentially, this involves simply averaging the point estimates but the standard errors are not averaged because there is variation between imputations as well as within inmputations.

You can apply Rubin's rules using any statistical software, but you will have to run the analysis model on each of the imputed datasets and then calculate the pooled values manually (or programatically).

Alternatively you could just use the mice package in R and use the the supplied pool function. For example using the nhanes dataset:

> library(mice)    
> imp <- mice(nhanes, seed = 15)  # perform imputations with defaults

> fit <- with(imp, lm(chl ~ age + bmi))   # fit the analysis model to each imputed dataset
> round(summary(pool(fit), conf.int = TRUE), 3)
            estimate std.error statistic     df p.value    2.5 %  97.5 %
(Intercept)   -7.222    68.901    -0.105  9.462   0.919 -161.933 147.490
age           36.655    10.351     3.541 11.529   0.004   13.998  59.311
bmi            5.199     2.156     2.412  9.332   0.038    0.349  10.050

and this produces the pooled estimates using Rubin's rules, including standard errors, confidence intervals and p values.

Answered by Robert Long on November 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP