TransWikia.com

Unsupervised Hierarchical Agglomerative Clustering

Data Science Asked on July 1, 2021

I’ve read a number of papers where the authors talk about "Unsupervised Hierarchical Agglomerative Clustering". They seem to imply that the algorithm determines the number of clusters based on a hyper-parameter:

We define the hetereogeneity metric within a cluster to be the average
of all-pair jaccard distances, and at each step merge two clusters if
the heterogeneity of the resultant cluster is below a specified
threshold

When I search for python implementations of Agglomerative Clustering I keep coming up with sklearn, which requires the number of clusters to be specified aprior. In most examples this is computed by plotting a dendogram and then determining by what appears to be eyeballing the chart how many clusters – for example https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019 I’d argue it’s impossible from the chart alone to determine if 3 or 5 is the optimal (based on largest vertical distance). I believe this is Wards method but I’m not sure it’s the same as ""merging clusters where the heterogeneity is below a threshold" and

Is this possible in sklearn, or is there another python implementation which does this? I feel at the very least there should be a way to process the dendogram programmatically rather than plotting it?

One Answer

I think I've figured out how to implement the algorithm described in the paper I'm studying. I suspect they used scipy.cluster.hierarchy.

Anyway, my process is:

  1. Generate a distance matrix y from my list of examples x.
  2. Compute the linkage using scipy.cluster.hierarchy.linkage
  3. Generate flat clusters using scipy.cluster.hierarchy.fcluster

The last step is where the threshold mentioned is applied. I still have a question around how to use fcluster to generate clusters based on heterogeneity

What I've found confusing is there are a lot of tutorials on how to determine the number of clusters for sklearn.cluster.AgglomerativeClustering which use scipy.cluster.hierarchy.linkage then scipy.cluster.hierarchy.dendrogram to plot a dendrogram and which is then used to visually identify how many clusters are required.

Answered by David Waterworth on July 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP