Data Science Asked by Meher Béjaoui on July 7, 2021
I have real world data set (120 data points) about enterprises, containing 4 features. I would like to put these enterprises in exactly 4 categories based on the values of these specific features (an official requirement). I can either find an equation with the features as parameters; or find the 3 threshold values for each feature, that would divide my data into 4 distinct categories. There are no other inputs. I tried KMeans, but I would like more insights.
How do I determine the which method to use, and how do I compute the thresholds anyway? Thank you.
My data looks like this (simplified):
Enterprise | Number of employees | Income (currency unit) | Expenditure (currency unit) | Investments (currency unit) |
---|---|---|---|---|
First | 1200 | 120 | 110 | 20 |
Second | 5 | 60 | 70 | 30 |
… | … | … | … | … |
Last | 125 | 50 | 55 | 70 |
There are several options:
Hand-picked rules - Given domain expertise, manually choose the threshold values to create the four clusters.
Machine learning - Set the number of clusters to four. Then use any clustering algorithm (e.g., k-means, Gaussian mixture model, DBSCAN, spectral). This has the advantage of learning the threshold values.
Choosing the best clustering result can be tricky since there are no external labels. It sounds like there are business requirements for the solution. Thus, business metrics should be used to evaluate the solution.
Answered by Brian Spiering on July 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP