TransWikia.com

Determinate K in K-Means Clustering

Data Science Asked by Sreejithc321 on November 1, 2020

I have salary data of several user (Python list).

Now I am using KMeans to cluster them.

Given this data, Is there a way to figure out the best value for ‘K’ automatically through program?

I tried silhouette analysis

take k with highest silhouette_avg score. Is this the best way?

4 Answers

Wang, Kaijun, Baijie Wang, and Liuqing Peng. "CVAP: Validation for cluster analyses." Data Science Journal 0 (2009): 0904220071.:

To measure the quality of clustering results, there are two kinds of validity indices: external indices and internal indices.

An external index is a measure of agreement between two partitions where the first partition is the a priori known clustering structure, and the second results from the clustering procedure (Dudoit et al., 2002).

Internal indices are used to measure the goodness of a clustering structure without external information (Tseng et al., 2005).

For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels).

For internal indices, we evaluate the results using quantities and features inherent in the data set. The optimal number of clusters is usually determined based on an internal validity index.

(Dudoit et al., 2002): Dudoit, S. & Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7): 0036.1-21.

(Tseng et al., 2005): Thalamuthu, A, Mukhopadhyay, I, Zheng, X, & Tseng, G. C. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22(19):2405-12.


In your case, you need some internal indices since you have no a priori clustering structure. There exist tens of internal indices, like:

  • Silhouette index (implementation in MATLAB)
  • Davies-Bouldin
  • Calinski-Harabasz
  • Dunn index (implementation in MATLAB)
  • R-squared index
  • Hubert-Levin (C-index)
  • Krzanowski-Lai index
  • Hartigan index
  • Root-mean-square standard deviation (RMSSTD) index
  • Semi-partial R-squared (SPR) index
  • Distance between two clusters (CD) index
  • weighted inter-intra index
  • Homogeneity index
  • Separation index

Each of them have pros and cons, but at least they'll give you a more formal basis for your comparison. The MATLAB toolbox CVAP might be handy as it contains many internal validity indices.


More information:

Correct answer by Franck Dernoncourt on November 1, 2020

According to your question, U don't have prior knowledge of cluster and u wanted to estimates it. K-means is not sensitive towards outliers so if one use any index scoring than it will fail at that time. If u know the cluster in prior than index scoring be the measurement of cluster quality. Here according to your problem u need agglomerative clustering which I think gives values compare to K-means. As a Hierarchical algorithm, it does not need to give k but it needs to specify number of leaf node. By default it is "2" and it works in most of the cases. u only need to care about from where to cut the edges that u can probably get by looking at the dendrogram.

Answered by Gaurav Koradiya on November 1, 2020

In this case best way to find out the better "K" value is by

  1. Domain knowledge, where you determine the no.of groups you want to divide.
  2. In K-means clustering you can plot the graph where on "X-axis" - K values, "Y-values" - error values. In this way you can find the optimal k value as elbow point on the graph.

Answered by Raghuvaran M on November 1, 2020

Additionally to silhoutte score, you could use the well known elbow rule

I recommend you to use also algorithms that do not need the number of clusters to be specified prior, like DBSCAN or MeanShift so that you can compare the number of clusters found for those algorithms with the number of clusters you are specifying to K means.

Finally, be aware that k means may be affected by data scaling, so you want to scale your data prior in order to find the appropriate distance between features. Also, K means is sensitive to centroids initialization so you could run your algorithm more than once and check the difference in clusters each time.

**EDIT: If you want to use the elbow rule to find the optimal number of cluster programmatically you should look at ElbowRuleRepo

Example:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from kneed import KneeLocator

plt.scatter(X[:,0], X[:,1])

lista = list()
for k in range(1,7):
    kmeans = KMeans(n_clusters= k).fit(X)
    inertia = kmeans.inertia_
    lista.append(inertia)

x = range(1,7)
y = lista
plt.plot(x,y)

# From the plots we see that the optimal number of clusters is 3
optimal = KneeLocator(x = x, y = y, curve="convex", 
direction="decreasing")
print(f"The optimal number of clusters is {optimal.knee} with an inertia of {optimal.knee_y}")

Answered by Julio Jesus Luna on November 1, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP