Cross Validated Asked by Roccer on February 20, 2021
Is there a rational for the number of observations per cluster in a random effect model? I have a sample size of 1,500 with 700 clusters modeled as exchangeable random effect. I have the option to merge clusters in order to build fewer, but larger clusters. I wonder how can I choose the minimum sample size per cluster as to have meaningful results in predicting the random effect for each cluster? Is there a good paper that explains this?
TL;DR: The minimum sample size per cluster in a mixed-effecs model is 1, provided that the number of clusters is adequate, and the proportion of singleton cluster is not "too high"
Longer version:
In general, the number of clusters is more important than the number of observations per cluster. With 700, clearly you have no problem there.
Small cluster sizes are quite common, especially in social science surveys that follow stratified sampling designs, and there is a body of research that has investigated cluster-level sample size.
While increasing the cluster size increases statistical power to estimate the random effects (Austin & Leckie, 2018), small cluster sizes do not lead to serious bias (Bell et al, 2008; Clarke, 2008; Clarke & Wheaton, 2007; Maas & Hox, 2005). Thus, the minimum sample size per cluster is 1.
In particular, Bell, et al (2008) performed a Monte Carlo simulation study with proportions of singleton clusters (clusters containing only a single observation) ranging from 0% to 70%, and found that, provided the number of clusters was large (~500) the small cluster sizes had almost no impact on bias and Type 1 error control.
They also reported very few problems with model convergence under any of their modelling scenarios.
For the particular scenario in the OP, I would suggest running the model with 700 clusters in the first instance. Unless there was a clear problem with this, I would be disinclined to merge clusters. I ran a simple simulation in R:
Here we create a clustered dataset with with a residual variance of 1, a single fixed effect also of 1, 700 clusters, of which 690 are singletons and 10 have just 2 observations. We run the simulation 1000 times and observe the histograms of the estimated fixed and residual random effects.
> set.seed(15)
> dtB <- expand.grid(Subject = 1:700, measure = c(1))
> dtB <- rbind(dtB, dtB[691:700, ])
> fixef.v <- numeric(1000)
> ranef.v <- numeric(1000)
> for (i in 1:1000) {
dtB$x <- rnorm(nrow(dtB), 0, 1)
dtB$y <- dtB$Subject/100 + rnorm(nrow(dtB), 0, 1) + dtB$x * 1
fm0B <- lmer(y ~ x + (1|Subject), data = dtB)
fixef.v[i] <- fixef(fm0B)[[2]]
ranef.v[i] <- attr(VarCorr(fm0B), "sc")
}
> hist(fixef.v, breaks = 15)
> hist(ranef.v, breaks = 15)
As you can see, the fixed effects are very well estimated, while the residual random effects appear to be a little downward-biased, but not drastically so:
> summary(fixef.v)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.6479 0.9439 0.9992 1.0005 1.0578 1.2544
> summary(ranef.v)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2796 0.7745 0.9004 0.8993 1.0212 1.4837
The OP specifically mentions the estimation of cluster-level random effects. In the simulation above, the random effects were created simply as the value of each Subject
's ID (scaled down by a factor of 100). Obviously these are not normally distributed, which is the assumption of linear mixed effects models, however, we can extract the (conditional modes of) the cluster level effects and plot them against the actual Subject
IDs:
> re <- ranef(fm0B)[[1]][, 1]
> dtB$re <- append(re, re[691:700])
> hist(dtB$re)
> plot(dtB$re, dtB$Subject)
The histogram departs from normality somewhat, but this is due to the way we simulated the data. There is still a reasonable relationship between the estimated and actual random effects.
References:
Peter C. Austin & George Leckie (2018) The effect of number of clusters and cluster size on statistical power and Type I error rates when testing random effects variance components in multilevel linear and logistic regression models, Journal of Statistical Computation and Simulation, 88:16, 3151-3163, DOI: 10.1080/00949655.2018.1504945
Bell, B. A., Ferron, J. M., & Kromrey, J. D. (2008). Cluster size in multilevel models: the impact of sparse data structures on point and interval estimates in two-level models. JSM Proceedings, Section on Survey Research Methods, 1122-1129.
Clarke, P. (2008). When can group level clustering be ignored? Multilevel models versus single-level models with sparse data. Journal of Epidemiology and Community Health, 62(8), 752-758.
Clarke, P., & Wheaton, B. (2007). Addressing data sparseness in contextual population research using cluster analysis to create synthetic neighborhoods. Sociological Methods & Research, 35(3), 311-351.
Maas, C. J., & Hox, J. J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3), 86-92.
Correct answer by Robert Long on February 20, 2021
In mixed models the random effects are most often estimated using empirical Bayes methodology. A feature of this methodology is shrinkage. Namely, the estimated random effects are shrunk towards the overall mean of the model described by the fixed-effects part. The degree of shrinkage depends on two components:
The magnitude of the variance of the random effects compared to the magnitude of the variance of the error terms. The larger the variance of the random effects in relation to the variance of the error terms, the smaller the degree of shrinkage.
The number of repeated measurements in the clusters. Random effects estimates of clusters with more repeated measurements are shrunk less towards the overall mean compared to clusters with fewer measurements.
In your case, the second point is more relevant. However, note that your suggested solution of merging clusters may impact the first point as well.
Answered by Dimitris Rizopoulos on February 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP