TransWikia.com

What would be a good way to use clustering for outlier detection?

Data Science Asked by Solarbear on November 29, 2020

For simplicity let’s assume the feature space is the XY plane.

5 Answers

Perhaps you could cluster the items, then those items with the furthest distance from the midpoint of any cluster would be candidates for outliers.

Answered by Charlie Greenbacker on November 29, 2020

A very robust clustering algorithm against outliers is PFCM from Bezdek.

In this paper Bezdek proposes Possibilistic-Fuzzy-C-Means which is an improvement of the different variations of fuzzy posibilistic clustering. This algorithm is particularly good at detecting outliers and avoiding them to influence the clusterization. So using PFCM you could find which points are identified as outliers and at the same time have a very robust fuzzy clustering of your data.

Answered by Javierfdr on November 29, 2020

Gaussian mixture modeling can - if your data is nicely gaussian-like - be used for outlier detection. Points with a low density in every cluster are likely to be outliers.

Works well in idealistic scenarios.

Answered by Has QUIT--Anony-Mousse on November 29, 2020

  1. Apply your clustering algorithm
  2. Calculate distance from all data points to its assigned cluster
  3. Label the data points furthest from a center as an outlier

Randomly generating 100 data points from three gaussians, clustering them with k-means, and marking the 10 'furthest from a center' data points gave the following graph: enter image description here

see this notebook for the full example

The burden of solving what "distance" means will already have to be solved for you to run a clustering algorithm. It will still be up to you to pick off what distance means an outlier. In this example, I just picked the N most distant data point, though you'll probably want to pick any number of data points over a certain number of standard deviations from a center.

Answered by TheGrimmScientist on November 29, 2020

If your Data points are dense and noise points are away from the dense region, you can try DBSCAN algorithm.

enter image description here

Tweak its parameters until u get a best fit.

Answered by preems on November 29, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP