TransWikia.com

How to compare two clustering solutions when their labelling differs

Data Science Asked by fffrost on December 7, 2020

I am planning to test the reliability of a clustering approach for some data. My plan is to repeatedly (with replacement) draw a number of random subsample pairs (e.g. 2x 10% of the total data), run the clustering on both individually, and then compare the results. The issue is that I am using HDBSCAN, which not only creates a non-fixed number of clusters (for different sets of data but same params), but it also therefore labels clusters differently since k is not defined, and the input data will always have slightly different structure due to variability.

I tested this by using the same HDBSCAN parameters on two subsamples (A, B) of my data, and my issue is quite easy to see. The cluster labels with corresponding samples for A were:
{-1: 4306, 0: 1737, 1: 2999, 2: 72068, 3: 20628, 4: 3120}

while for B they were:
{-1: 4478, 0: 1711, 1: 3048, 2: 72089, 3: 3123, 4: 20408}.

From this, it seems that the solution is very close until we compare label 3. It looks like label 3 of A corresponds to label 4 of B.

My initial thought was that I could just relabel them both in order of each cluster’s sample size. But this assumes that the two solutions will be similar across many tests (which is ultimately the whole point of the testing in the first place). So my next thought is I could set the constraints that (1) there should be a "similar" number of samples in the noise group, and (2) there should be the same number of clusters found. If these two conditions are met then I could relabel the clusters by order of their sample size, and then make my comparison using ARI or AMI.

I am doubtful that this is good, because I don’t believe it is necessarily true that (even given the two constraints) two clusters labelled the same on the basis of their sample should necessarily correspond to the same "global" cluster. It therefore seems problematic to me but I can’t think of an alternative.

Is the above approach generally reasonable? If not, is there something else I could do to assess the reliability/stability of HDBSCAN solutions? As an alternative, would it be better to just compute the DBCV score, %noise, and the number of clusters, and then use this as an indication of the quality of the clustering?

One Answer

This is only a partial answer since I'm not familiar with HDBSCAN, hopefully somebody else can provide a more complete answer.

As far as I understand you need to find which cluster in A corresponds to which cluster in B, i.e. an alignment between the clusters labels of A and of B. It's not recommended to match based only on the size, since it could happen by chance that a cluster in A has the same size (or similar size) as a cluster in B. Since the instances are different, you would have to rely on how the method represents the clusters.

  • For example probabilistic clustering methods represent each cluster as a distribution over the features, so one can use a distance/similarity measure between these distributions.
  • With k-means one would compare the centroids and match the pairs of clusters for which the distance is the shortest.

I'm not familiar with HDBSCAN so I don't know how the clusters are represented inside the model: whatever this is, the idea would be to compare each internal representation of a cluster in A vs. the same in B.

Correct answer by Erwan on December 7, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP