Data Science Asked on November 15, 2020
I have a dataset of 300 000 rows and an ensemble model, which include grid search to find the best params of every algorithm. Unfortunately the grid search needs to long and I have problems to implement gpu running for the different algorithms (xgb,lightgbm..). Also the models which are from scikit-learn like random-forest dont run on gpu.
The idea is now, instead of 300 000 rows, I will create a small dataset with max. 500 rows, which need less time than the full dataset.
Could the minimal required sample size calculation help?
How big should the sample be? to get a good data distribution of the big dataset?
You could also use Sklearn Randomized Search, which is a bit less efficient than Grid Search, since it won't test all conbinaisons, but could find one of the best combinaisons, and then you manually end your tuning by tweaking slightly parameters based on what Randomized Search gave you. I think it would be more profitable than downsampling your dataset to make the GridSearch work, since you'll use the best Tuning method, but on a way less efficient Dataset.
Answered by BeamsAdept on November 15, 2020
This idea of using a small sample to the data set to search for the hyperparameters is called multi-fidelity methods. A good starting point is the book by Frank Hutter, Lars Kotthoff, Joaquin Vanschoren Automated machine learning: methods, systems, challenges which is open access.
Answered by Jacques Wainer on November 15, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP