What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

Question

I am analyzing a portfolio of about 225 stocks and have gotten data for each of them based on their "Price/Earnings ratio", "Return on Assets", and "Earnings per share growth". I would like to cluster these stocks based on their attributes into 3 or 4 groups. However, there are substantial outliers in the data set. Instead of removing them altogether I would like to keep them in. What ML algorithm would be best suited for this? I have been told that K Means would not work so well since the outliers would skew the centroids of a particular cluster. Any and all thoughts welcome!

xChesster · Answer

You could try a hierarchical clustering approach. As an example, K clusters could initially be found for the data points. Then, for each of the K clusters, an arbitrary number of clusters could be found from the data points within the cluster to further refine the clustering.

Answered by xChesster on December 4, 2020

bapowell · Answer

DBSCAN is a density-based clustering method that is designed to apply to cases with noise. The user controls the minimum cluster size, which hopefully can be informed by the problem, and clusters that are smaller than this are ignored as noise.

Answered by bapowell on December 4, 2020

What is the most effective unsupervised ML algorithm to use when outliers are present in data set?

2 Answers

Add your own answers!

Ask a Question