Data Science Asked on July 1, 2021
I’ve read a number of papers where the authors talk about "Unsupervised Hierarchical Agglomerative Clustering". They seem to imply that the algorithm determines the number of clusters based on a hyper-parameter:
We define the hetereogeneity metric within a cluster to be the average
of all-pair jaccard distances, and at each step merge two clusters if
the heterogeneity of the resultant cluster is below a specified
threshold
When I search for python implementations of Agglomerative Clustering I keep coming up with sklearn, which requires the number of clusters to be specified aprior. In most examples this is computed by plotting a dendogram and then determining by what appears to be eyeballing the chart how many clusters – for example https://towardsdatascience.com/machine-learning-algorithms-part-12-hierarchical-agglomerative-clustering-example-in-python-1e18e0075019 I’d argue it’s impossible from the chart alone to determine if 3 or 5 is the optimal (based on largest vertical distance). I believe this is Wards method but I’m not sure it’s the same as ""merging clusters where the heterogeneity is below a threshold" and
Is this possible in sklearn, or is there another python implementation which does this? I feel at the very least there should be a way to process the dendogram programmatically rather than plotting it?
I think I've figured out how to implement the algorithm
described in the paper I'm studying. I suspect they used scipy.cluster.hierarchy
.
Anyway, my process is:
scipy.cluster.hierarchy.linkage
scipy.cluster.hierarchy.fcluster
The last step is where the threshold mentioned is applied. I still have a question around how to use fcluster
to generate clusters based on heterogeneity
What I've found confusing is there are a lot of tutorials on how to determine the number of clusters for sklearn.cluster.AgglomerativeClustering
which use scipy.cluster.hierarchy.linkage
then scipy.cluster.hierarchy.dendrogram
to plot a dendrogram and which is then used to visually identify how many clusters are required.
Answered by David Waterworth on July 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP