Clustering and getting threshold to classify data points

Question

I have real world data set (120 data points) about enterprises, containing 4 features. I would like to put these enterprises in exactly 4 categories based on the values of these specific features (an official requirement). I can either find an equation with the features as parameters; or find the 3 threshold values for each feature, that would divide my data into 4 distinct categories. There are no other inputs. I tried KMeans, but I would like more insights.
How do I determine the which method to use, and how do I compute the thresholds anyway? Thank you.
My data looks like this (simplified):

Enterprise
Number of employees
Income (currency unit)
Expenditure (currency unit)
Investments (currency unit)

First
1200
120
110
20

Second
5
60
70
30

...
...
...
...
...

Last
125
50
55
70

Brian Spiering · Answer

There are several options:

Hand-picked rules - Given domain expertise, manually choose the threshold values to create the four clusters.

Machine learning - Set the number of clusters to four. Then use any clustering algorithm (e.g., k-means, Gaussian mixture model, DBSCAN, spectral). This has the advantage of learning the threshold values.

Choosing the best clustering result can be tricky since there are no external labels. It sounds like there are business requirements for the solution. Thus, business metrics should be used to evaluate the solution.

Clustering and getting threshold to classify data points

One Answer

Add your own answers!

Ask a Question