Data Science Asked by lte__ on March 5, 2021
I’m facing an issue where I have a massive amount of data that I need to cluster. As we know, clustering algorithms can have a very high O complexity, and I’m looking for ways to reduce the time my algorithm is running.
I want to try a few different approaches, like pre-clustering (canopy clustering) or subspace clustering, correlation clustering etc.
However, something that I haven’t heard about, and I wonder why – Is it viable to simply get a representative sample from my dataset, run the clustering on that, and generalize this model to the whole dataset? Why/why not is this a viable approach? Thank you!
I would get a sufficiently large random/representative sample and cluster that.
To see what is such a sample, you will have to get two such samples and cluster them to get cluster solutions c1 and c2. If the matching clusters of c1 and c2 have the same model parameters, then you probably have representative samples.
You can match the clusters by looking at how c1 and c2 assign drawn data to clusters.
Correct answer by Suren on March 5, 2021
It's definitely viable, just that there is catch 22.
In order to get this representative sample from your dataset, you have to sample from every cluster. But if you already can sample from every cluster, you already know them, hence you don't need unsupervised learning.
Answered by Noah Weber on March 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP