Data Science Asked on September 5, 2021
I am doing use several metrics in order to know what number of clusters is correct in order to do this I selected 3 clustering algorithms and 3 internal evaluation metrics, Silhouette, Calinsky Harbasz and Davies Bouldin.
The results of this was the following:
S CH DB
Kmean 3 3 9
Agglo 2 2 9
Gauss 3 3 10
The original dataset has 3 groups, and in general S and CH works well, the questions is what DB always returns a high value for the number of clusters?
Thanks
Each clustering evaluation metric follow different ideologies:
Silhouette analysis can be used to study the separation distance between the resulting clusters.Silhouette coefficients near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
Calinski–Harabasz rewards clusterings in which the cluster centroids are far apart and the cluster members are close to their respective centroids.
Davies-Bouldin is defined as the average similarity measure of each cluster with its most similar cluster, where similarity is the ratio of within-cluster distances to between-cluster distances. Thus, clusters which are farther apart and less dispersed will result in a better score. It prefers clusters equally-distanced from each other.
We can't really say what clustering quality measure is good or not.It depends on what you want to evaluate.You have to look it seems relevant for the kind of clustering you are doing.
For more info refer this discussion.
Correct answer by prashant0598 on September 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP