Determinate K in K-Means Clustering

Question

I have salary data of several user (Python list).
Now I am using KMeans to cluster them.
Given this data, Is there a way to figure out the best value for 'K' automatically through program?
I tried silhouette analysis
take k with highest silhouette_avg score. Is this the best way?

Franck Dernoncourt · Accepted Answer

Wang, Kaijun, Baijie Wang, and Liuqing Peng. "CVAP: Validation for cluster analyses." Data Science Journal 0 (2009): 0904220071.:

To measure the quality of clustering results, there are two kinds of
validity indices: external indices and internal indices.
An external
index is a measure of agreement between two partitions where the first
partition is the a priori known clustering structure, and the second
results from the clustering procedure (Dudoit et al., 2002).
Internal
indices are used to measure the goodness of a clustering structure
without external information (Tseng et al., 2005).
For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels).
For internal indices, we evaluate the results using quantities and features inherent in the data set. The optimal number of clusters is usually determined based on an internal validity index.

(Dudoit et al., 2002): Dudoit, S. & Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7): 0036.1-21.
(Tseng et al., 2005): Thalamuthu, A, Mukhopadhyay, I, Zheng, X, & Tseng, G. C. (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22(19):2405-12.

In your case, you need some internal indices since you have no a priori clustering structure. There exist tens of internal indices, like:

Silhouette index (implementation in MATLAB)
Davies-Bouldin
Calinski-Harabasz
Dunn index (implementation in MATLAB)
R-squared index
Hubert-Levin (C-index)
Krzanowski-Lai index
Hartigan index
Root-mean-square standard deviation (RMSSTD) index
Semi-partial R-squared (SPR) index
Distance between two clusters (CD) index
weighted inter-intra index
Homogeneity index
Separation index

Each of them have pros and cons, but at least they'll give you a more formal basis for your comparison. The MATLAB toolbox CVAP might be handy as it contains many internal validity indices.

More information:

Determining the number of clusters in a data set
How to determine which method is the most reasonable clustering results?
How can I detect if my dataset is clustered or unclustered (i.e. forming one single cluster)?
How can we choose a "good" K for K-means clustering?
Cluster analysis in R: determine the optimal number of clusters

Gaurav Koradiya · Answer

According to your question, U don't have prior knowledge of cluster and u wanted to estimates it. K-means is not sensitive towards outliers so if one use any index scoring than it will fail at that time. If u know the cluster in prior than index scoring be the measurement of cluster quality. Here according to your problem u need agglomerative clustering which I think gives values compare to K-means. As a Hierarchical algorithm, it does not need to give k but it needs to specify number of leaf node. By default it is "2" and it works in most of the cases. u only need to care about from where to cut the edges that u can probably get by looking at the dendrogram.

Answered by Gaurav Koradiya on November 1, 2020

Raghuvaran M · Answer

In this case best way to find out the better "K" value is by

Domain knowledge, where you determine the no.of groups you want to divide.
In K-means clustering you can plot the graph where on "X-axis" - K values, "Y-values" - error values. In this way you can find the optimal k value as elbow point on the graph.

Julio Jesus Luna · Answer

Additionally to silhoutte score, you could use the well known elbow rule
I recommend you to use also algorithms that do not need the number of clusters to be specified prior, like DBSCAN or MeanShift so that you can compare the number of clusters found for those algorithms with the number of clusters you are specifying to K means.
Finally, be aware that k means may be affected by data scaling, so you want to scale your data prior in order to find the appropriate distance between features. Also, K means is sensitive to centroids initialization so you could run your algorithm more than once and check the difference in clusters each time.
**EDIT:
If you want to use the elbow rule to find the optimal number of cluster programmatically you should look at ElbowRuleRepo
Example:
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from kneed import KneeLocator

plt.scatter(X[:,0], X[:,1])

lista = list()
for k in range(1,7):
    kmeans = KMeans(n_clusters= k).fit(X)
    inertia = kmeans.inertia_
    lista.append(inertia)

x = range(1,7)
y = lista
plt.plot(x,y)

# From the plots we see that the optimal number of clusters is 3
optimal = KneeLocator(x = x, y = y, curve="convex", 
direction="decreasing")
print(f"The optimal number of clusters is {optimal.knee} with an inertia of {optimal.knee_y}")

Determinate K in K-Means Clustering

4 Answers

Add your own answers!

Ask a Question