Cross Validated Asked by antifrax on December 13, 2021
Most examples for Gaussian Mixture Models (GMMs) employ datasets with fairly obvious underlying structure (well-separated clusters). How should one determine the order of a GMM (and interpret the result) when components overlap strongly?
For example, consider a dataset where the true data-generating process is 4 bivariate normal distributions. Working in R:
#generate data as a mixture of four bivariate Guassians
library(MASS)
set.seed(123)
reds<-mvrnorm(n=1000,mu=c(2,2),Sigma=matrix(c(0.2,0.01,0.01,0.2),2))
greens<-mvrnorm(n=1000,mu=c(4,4),Sigma=matrix(c(2,0.9,0.9,2),2))
blues<-mvrnorm(n=1000,mu=c(2,3),Sigma=matrix(c(0.9,0.3,0.3,0.9),2))
blacks<-mvrnorm(n=1000,mu=c(2,3),Sigma=matrix(c(0.1,0.01,0.01,0.1),2))
#the data we observe:
dat<-rbind(reds,blues,greens,blacks)
#visualize
plot(NULL,xlim=c(-2,10),ylim=c(-2,10),
xlab="dimension 1",ylab="dimension 2",main="Observations")
points(reds,col="red",pch=".",cex=2)
points(blues,col="blue",pch=".",cex=2)
points(greens,col="green",pch=".",cex=2)
points(blacks,col="black",pch=".",cex=2)
Here is a plot of the data we observe:
Now we want to fit a GMM to this. We will use the {mclust} package (Scrucca et al., 2016), which has by default using the Bayesian Information Criterion (BIC) to select both the model (i.e. various constraints on volume, shape, and orientation of the component Gaussians) and the order of the model (the number of components. The package also implements the Integrated Complete Likelihood (ICL) criterion, and bootstrap-based Likelihood Ratio Tests to select model order. Let’s run them!
library(mclust)
#assess candidate models...
#...via BIC criterion
BIC<-mclustBIC(dat)
#...via ICL criterion
ICL<-mclustICL(dat)
#since VVE performs best for both, we will apply a likelihood ratio test using that model
LRT<-mclustBootstrapLRT(dat,modelName="VVE") #heads up, this is slow
Here are the results:
> summary(BIC)
Best BIC values:
VVE,3 VVV,3 VEE,3
BIC -20693.33 -20704.26965 -20711.07961
BIC diff 0.00 -10.94192 -17.75188
> summary(ICL)
Best ICL values:
VVV,2 VVE,2 VEV,2
ICL -21621.86 -2.162203e+04 -21779.9036
ICL diff 0.00 -1.667867e-01 -158.0386
> LRT
-------------------------------------------------------------
Bootstrap sequential LRT for the number of mixture components
-------------------------------------------------------------
Model = VVE
Replications = 999
LRTS bootstrap p-value
1 vs 2 3242.884063 0.001
2 vs 3 158.729763 0.001
3 vs 4 2.290044 0.490
These answers aren’t the same. BIC indicates 3 components and a VVE model (ellipsoidal Gaussians with variable volume and shape but equal orientation), while ICL indicates 2 components and a VVV mdodel (orientation can vary too). The LRT test with a VVE model indicates 3 components. Assuming we don’t know the data-generating process, what is the correct order of this model?
There isn't a "correct" order of the model. There is only a "most likely" order of the model, and what "most likely" means depends on the criterion you use to evaluate likelihood.
There is a good paper that overviews the methods for determining model order, but it won't help unless you understand the fundamentals of Bayesian statistics. But looking at the paper for the mclust package by Scrucca et al we learn a basic difference between the criteria we are using (p. 297):
BIC tends to select the number of mixture components needed to reasonably approximate the density, rather than the number of clusters as such...ICL penalises the BIC through an entropy term which measures clusters overlap
Scrucca et al give a simplified formula for ICL as:
where BIC_{M,G} is the BIC for model M with G components, and the term on the right is the entropy term referenced above. That term sums over observations 1...n and clusters 1...G; c_{ik} an indicator function that is 1 if the ith observation is assigned to cluster G, and z_{ik} is the conditional probability that the observation arises from the kth mixture component. Since we have log(z_{ik}) this term gets smaller (i.e. more negative) with the more clusters we assign each observation to and the lower the probability of assignment of each cluster. Since we are selecting the model that maximizes the ICL, the effect is to favour solutions where points are more clearly assigned to particular clusters (i.e., more separated solutions).
What can we make of BIC vs ICL? Baudry et al (2010:333):
Biernacki, Celeux, and Govaert (2000) argued that the goal of clustering is not same as that of estimating the best approximating mixture model, and so BIC may not the best way of determining the number of clusters, even though it does perform well in selecting the number of components in a mixture model. Instead they proposed the ICL criterion, whose purpose is to assess the number of mixture components that leads to best clustering. This turns out to be equivalent to BIC penalized by the entropy of the corresponding clustering. We argue here that the goal of selecting the number of mixture components for estimating the underlying probability density is well met by BIC, but that the goal of selecting the number of clusters may not be. Even when a multivariate Gaussian mixture model is used for clustering, the number of mixture components is not necessarily the same as number of clusters. This is because a cluster may be better represented by a mixture normals than by a single normal distribution.
The last line makes sense if you think about trying to to approximate a bivariate lognormal distribution using bivariate normals. If your data came from one bivariate normal distribution you'd have to fit at least two bivariate normals to get a reasonable approximation of the density (one for the "spike" and one for the "tail").
So how do we interpret these results (assuming we don't know the DGP)?
Answered by antifrax on December 13, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP