Algorithm for deriving multiple clusters

Question

Suppose I have a set of data (with  2-dimensional feature space), and I want to obtain clusters from them. But I do not know how many clusters will be formed. Yet, I want separate clusters (The number of clusters is more than 2). I figured that k means of k medoid cannot be used in this case. Nor can I use hierarchical clustering. Also since there is no training set hence cannot use KNN classifier to any others (supervised learning cannot be used as no training set). I cannot use OPTICS algorithm as I do not want to specify the radius (I don't know the radius) Is there any machine learning technique that would give me multiple clusters (distance based clustering) that deals well with outlier points too?
This should be the output:

Jordan A · Accepted Answer

I don't think that EM clustering algorithms like k-means and Gaussian mixture models are quite what you're looking for. There are definitely other algorithms that don't require one to pick a number of clusters. My personal favorite (most of the time) is called mean-shift-clustering. You could find a great little blog post about it here, and it has a good implementation in python's scikit-learn library.

Correct answer by Jordan A on April 26, 2021

Pablo Suau · Answer

The fact is that you could use any of the algorithms you mentioned, and in general any algorithm that requires to set the number of clusters as a parameter (or any other parameter that indirectly sets the final number of clusters, like the threshold in a hierarchical clustering algorithm.)

The solution to your problem is model selection. Model selection methods evaluate different clustering solutions, and select the one that optimizes a given criterion.

For instance, in the case of K-means, you could find a solution for a range of k values, and keep the one that maximizes any cluster validation measure (see the Wikipedia entry for cluster analysis to read about some examples of cluster validation measures).

There are automatic and more complex approaches (one specific example is "Automatic Cluster Number Selection Using a Split and Merge K-Means Approach" by Muhr, M. and Granitzer, M., but this is just an example). These methods use cluster validation measures to automatically split or merge clusters, but the idea is basically the same.

Josh W. · Answer

If the data are suitable, you can use Gaussian mixture modeling, fit via an EM algorithm to estimate various separate Gaussian clusters. When determining the number of clusters, you can use something like BIC (or other penalized likelihood criterion) to penalize based on the number of parameters that you are estimating. Then simply search over different numbers of clusters and choose the number with the lowest BIC. This is a form of model-based clustering.
You should be able to use the mclust package in R to do this: mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation.

Has QUIT--Anony-Mousse · Answer

The radius in OPTICS is a maximum value, and it can be set to infinity!
So you don't need to know it, and you should give OPTICS as well as DBSCAN a try. There are heuristics to choose their parameters, if you know your data.

Similarly, try hierarchical clustering. There are good heuristics on how to extract flat partitions out of it.

You want something that handles noise well - this calls for DBSCAN, OPTICS and HAC.

Algorithm for deriving multiple clusters

4 Answers

Add your own answers!

Ask a Question