TransWikia.com

What criteria use in order to select the best internal validation for clustering?

Data Science Asked on September 4, 2021

I am doing homework about how to evaluate a clustering algorithm both hierarchical and partitional.

For doing this I have a dataset that I can plot as you can see:

enter image description here

The clustering algorithms that I am using are K-Means, Gaussian mixture, and Agglomerative.

The problem is that I don’t know which criteria use for selecting an internal validation for K-MEANS, for example, I read that silhouette is better for a convex cluster, but I am not able to see in the image if any of groups are or aren’t convex.

I am read some papers like:

but the problem is that I am not found any criteria to select A or B method for internal evaluation with facts, they’re only a test about all vs all.

Where can found or what is found the criteria for choosing one or the other method for evaluation criteria?
 

One Answer

There is no catch-all metric that can be used for evaluation (internal or otherwise) of the clustering achieved. This is why machine learning is also art. There are no hard limits, many things depend on application, domain, and data themselves.

TL;DR

The purpose of the homework is to familiarise yourself with the problem of clustering, but also with the fact that there is no definite best method nor evaluation metric as panacea for all cases.

As such, you can try various criteria (see below references) on your data and algorithm results and ponder on their effectiveness for your problem at hand.

An Impossibility Theorem for Clustering

Although the study of clustering is centered around an intuitively compelling goal,it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median.

Wikipedia has a nice summary of internal evaluation metrics:

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another. Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion. For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of k-means, nor of an evaluation criterion that assumes convexity, is sound.

More than a dozen of internal evaluation measures exist, usually based on the intuition that items in the same cluster should be more similar than items in different clusters. For example, the following methods can be used to assess the quality of clustering algorithms based on internal criterion:

Davies–Bouldin index

The Davies–Bouldin index can be calculated by the following formula:

$$DB={frac {1}{n}}sum _{i=1}^{n}max _{jneq i}left({frac {sigma _{i}+sigma _{j}}{d(c_{i},c_{j})}}right)$$

where $n$ is the number of clusters, $c_{x}$ is the centroid of cluster $x$, $sigma _{x}$ is the average distance of all elements in cluster $x$ to centroid $c_{x}$, and $d(c_{i},c_{j})$ is the distance between centroids $c_{i}$ and $c_{j}$. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion.

Dunn index

The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:

$$D={frac {min _{1leq i<jleq n}d(i,j)}{max _{1leq kleq n}d^{prime }(k)}},,$$

where $d(i,j)$ represents the distance between clusters $i$ and $j$, and $d'(k)$ measures the intra-cluster distance of cluster $k$. The inter-cluster distance $d(i,j)$ between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters. Similarly, the intra-cluster distance $d'(k)$ may be measured in a variety ways, such as the maximal distance between any pair of elements in cluster $k$. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

Silhouette coefficient

The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered, objects with a low value may be outliers. This index works well with k-means clustering, and is also used to determine the optimal number of clusters.

Furthermore:

An Evaluation of Criteria for Measuring the Quality of Clusters

An important problem in clustering is how to decide what is the best set of clusters for a given data set, in terms of both the number of clusters and the member-ship of those clusters. In this paper we develop four criteria for measuring the quality of different sets of clusters. These criteria are designed so that different criteria prefer cluster sets that generalise at different levels of granularity. We evaluate the suitability of these criteria for non-hierarchical clustering of the results returned by a search engine. We also compare the number of clusters chosen by these criteria with the number of clusters chosen by a group of human subjects. Our results demonstrate that our criteria match the variability exhibited by human subjects, indicating there is no single perfect criterion. Instead, it is necessary to select the correct criterion to match a human subject's generalisation needs.

Evaluation Metrics for Unsupervised Learning Algorithms

Determining the quality of the results obtained by clustering techniques is a key issue in unsupervised machine learning. Many authors have discussed the desirable features of good clustering algorithms. However, Jon Kleinberg established an impossibility theorem for clustering. As a consequence, a wealth of studies have proposed techniques to evaluate the quality of clustering results depending on the characteristics of the clustering problem and the algorithmic technique employed to cluster data.

Understanding of Internal Clustering Validation Measures

Clustering validation has long been recognized as one of the vital issues essential to the success of clustering applications. In general, clustering validation can be categorized into two classes, external clustering validation and internal clustering validation. In this paper, we focus on internal clustering validation and present a detailed study of 11 widely used internal clustering validation measures for crisp clustering. From five conventional aspects of clustering, we investigate their validation properties. Experiment results show that ?_??? is the only internal validation measure which performs well in all five aspects, while other measures have certain limitations in different application scenarios.

Correct answer by Nikos M. on September 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP