TransWikia.com

Clustering and getting threshold to classify data points

Data Science Asked by Meher Béjaoui on July 7, 2021

I have real world data set (120 data points) about enterprises, containing 4 features. I would like to put these enterprises in exactly 4 categories based on the values of these specific features (an official requirement). I can either find an equation with the features as parameters; or find the 3 threshold values for each feature, that would divide my data into 4 distinct categories. There are no other inputs. I tried KMeans, but I would like more insights.

How do I determine the which method to use, and how do I compute the thresholds anyway? Thank you.

My data looks like this (simplified):

Enterprise Number of employees Income (currency unit) Expenditure (currency unit) Investments (currency unit)
First 1200 120 110 20
Second 5 60 70 30
Last 125 50 55 70

One Answer

There are several options:

  1. Hand-picked rules - Given domain expertise, manually choose the threshold values to create the four clusters.

  2. Machine learning - Set the number of clusters to four. Then use any clustering algorithm (e.g., k-means, Gaussian mixture model, DBSCAN, spectral). This has the advantage of learning the threshold values.

Choosing the best clustering result can be tricky since there are no external labels. It sounds like there are business requirements for the solution. Thus, business metrics should be used to evaluate the solution.

Answered by Brian Spiering on July 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP