Order and interpretation of Gaussian Mixture Model with strong overlap between components

Question

Most examples for Gaussian Mixture Models (GMMs) employ datasets with fairly obvious underlying structure (well-separated clusters). How should one determine the order of a GMM (and interpret the result) when components overlap strongly?

For example, consider a dataset where the true data-generating process is 4 bivariate normal distributions. Working in R:

#generate data as a mixture of four bivariate Guassians
library(MASS)
set.seed(123)
reds<-mvrnorm(n=1000,mu=c(2,2),Sigma=matrix(c(0.2,0.01,0.01,0.2),2))
greens<-mvrnorm(n=1000,mu=c(4,4),Sigma=matrix(c(2,0.9,0.9,2),2))
blues<-mvrnorm(n=1000,mu=c(2,3),Sigma=matrix(c(0.9,0.3,0.3,0.9),2))
blacks<-mvrnorm(n=1000,mu=c(2,3),Sigma=matrix(c(0.1,0.01,0.01,0.1),2))

#the data we observe:
dat<-rbind(reds,blues,greens,blacks)  

#visualize
plot(NULL,xlim=c(-2,10),ylim=c(-2,10),
     xlab="dimension 1",ylab="dimension 2",main="Observations")
points(reds,col="red",pch=".",cex=2)
points(blues,col="blue",pch=".",cex=2)
points(greens,col="green",pch=".",cex=2)
points(blacks,col="black",pch=".",cex=2)

Here is a plot of the data we observe:

Now we want to fit a GMM to this. We will use the {mclust} package (Scrucca et al., 2016), which has by default using the Bayesian Information Criterion (BIC) to select both the model (i.e. various constraints on volume, shape, and orientation of the component Gaussians) and the order of the model (the number of components. The package also implements the Integrated Complete Likelihood (ICL) criterion, and bootstrap-based Likelihood Ratio Tests to select model order. Let’s run them!

library(mclust)
#assess candidate models...
#...via BIC criterion
BIC<-mclustBIC(dat)
#...via ICL criterion
ICL<-mclustICL(dat)
#since VVE  performs best for both, we will apply a likelihood ratio test using that model
LRT<-mclustBootstrapLRT(dat,modelName="VVE") #heads up, this is slow

Here are the results:

> summary(BIC)
Best BIC values:
             VVE,3        VVV,3        VEE,3
BIC      -20693.33 -20704.26965 -20711.07961
BIC diff      0.00    -10.94192    -17.75188

> summary(ICL)
Best ICL values:
             VVV,2         VVE,2       VEV,2
ICL      -21621.86 -2.162203e+04 -21779.9036
ICL diff      0.00 -1.667867e-01   -158.0386

> LRT
------------------------------------------------------------- 
Bootstrap sequential LRT for the number of mixture components 
------------------------------------------------------------- 
Model        = VVE 
Replications = 999 
                LRTS bootstrap p-value
1 vs 2   3242.884063             0.001
2 vs 3    158.729763             0.001
3 vs 4      2.290044             0.490

These answers aren’t the same. BIC indicates 3 components and a VVE model (ellipsoidal Gaussians with variable volume and shape but equal orientation), while ICL indicates 2 components and a VVV mdodel (orientation can vary too). The LRT test with a VVE model indicates 3 components. Assuming we don’t know the data-generating process, what is the correct order of this model?

bic gaussian mixture distribution r

antifrax · Answer

There isn't a "correct" order of the model. There is only a "most likely" order of the model, and what "most likely" means depends on the criterion you use to evaluate likelihood.
There is a good paper that overviews the methods for determining model order, but it won't help unless you understand the fundamentals of  Bayesian statistics. But looking at the paper for the mclust package by Scrucca et al we learn a basic difference between the criteria we are using (p. 297):

BIC tends to select the number of mixture components needed to
reasonably approximate the density, rather than the number of clusters
as such...ICL penalises the BIC through an entropy term which measures
clusters overlap

Scrucca et al give a simplified formula for ICL as:

where BIC_{M,G} is the BIC for model M with G components, and the term on the right is the entropy term referenced above. That term sums over observations 1...n and clusters 1...G;  c_{ik} an indicator function that is 1 if the ith observation is assigned to cluster G, and z_{ik} is the conditional probability that the observation arises from the kth mixture component. Since we have log(z_{ik}) this term gets smaller (i.e. more negative) with the more clusters we assign each observation to and the lower the probability of assignment of each cluster. Since we are selecting the model that maximizes the ICL, the effect is to favour solutions where points are more clearly assigned to particular clusters (i.e., more separated solutions).
What can we make of BIC vs ICL? Baudry et al (2010:333):

Biernacki, Celeux, and Govaert (2000) argued that the goal of
clustering is not same as that of estimating the best approximating
mixture model, and so BIC may not the best way of determining the
number of clusters, even though it does perform well in selecting the
number of components in a mixture model. Instead they proposed the ICL
criterion, whose purpose is to assess the number of mixture components
that leads to best clustering. This turns out to be equivalent to BIC
penalized by the entropy of the corresponding clustering. We argue here
that the goal of selecting the number of mixture components for estimating
the underlying probability density is well met by BIC, but that
the goal of selecting the number of clusters may not be. Even when a
multivariate Gaussian mixture model is used for clustering, the number of
mixture components is not necessarily the same as number of clusters.
This is because a cluster may be better represented by a mixture
normals than by a single normal distribution.

The last line makes sense if you think about trying to to approximate a bivariate lognormal distribution using bivariate normals. If your data came from one bivariate normal distribution you'd have to fit at least two bivariate normals to get a reasonable approximation of the density (one for the "spike" and one for the "tail").
So how do we interpret these results (assuming we don't know the DGP)?

Look at the data and recognize we have highly overlapping components. If modelling via GMM still seems reasonable, recognize that recovering the parameters of the underlying Gaussians is possible but unambiguous assignment of observations to clusters isn't. (We could still assign observations to the most likely cluster if we want to go on and gather more data about the resulting "groups").
Because we have highly overlapping clusters, we shouldn't use the ICL criterion. It's designed to favour separation, which  we don't have.
Using the BIC criterion, we identify the VVE model with three components as best. Happily the LRT confirms this. Consequently we interpret these results as indicating  an order 3 VVE model as the most likely (Gaussian) mixture given our observed data. It's reasonable to study the characteristics of these components, but we should be wary of making strong claims (beyond, say, "exploratory analysis"). Any application that requires unambiguously assigning an observation to a particular cluster is clearly a bad idea.

Order and interpretation of Gaussian Mixture Model with strong overlap between components

One Answer

Add your own answers!

Ask a Question