Grouping already clustered data (with a pre-defined x and y)

Question

I have an already clustered data set (I wanna keep my x and y), where there's clearly a small group of elements in the middle that don't follow the expected patterns.

I can select them manually, but I wonder if there's a way of automating the selection part of these elements, efficiently.

Something like using just the grouping part of a clustering algorithm, I've been trying it with a threshold, but it doesn't produce good results in cases that won't form a circular cluster.

daco · Answer

It would be helpful to know which clustering technique are you using.

You can use

Partition-based Clustering: for example K-Means Clustering, not that good with outliers.
Hierarchical-based Clustering: Produces trees of clusters (Agglomerative, Divisive). You get a Dendogram.
Density-based Clustering: produces arbitrary shaped clusters, for example DBSCAN

If you are looking something other that a circular cluster and you need clusters within clusters, I would try DBSCAN. It locates regions of high density and separate outliers and it can find clusters within clusters.

If you are using Python you can use DBSCAN with sklearn

from sklearn.cluster import DBSCAN

I hope that helps!

Sean Owen · Answer

You have it right, that you want your clustering to tell you which points are most anomalous. For k-means clustering it's the points that are farthest from their assigned cluster.

I don't see a reason to expect that the anomalies form a cluster themselves. If that's what you're expecting you may need to compute something else, like, a clustering of the points beyond a threshold?

Also consider a Gaussian mixture clustering, which is just like k-means except treats cluster assignments as soft and probabilistic. The outliers under that model might make more sense.

Grouping already clustered data (with a pre-defined x and y)

2 Answers

Add your own answers!

Ask a Question