Data Science Asked on July 3, 2021
I have a 400GB data set that I want to train a model on. What is the cheapest method to train this model? The options I can think of so far are:
I have just assumed XGBoost is the best package since its tabular data, but if another gradient boosted tree package would be better at handling this, that would be acceptable as well.
Any help would be greatly appreciated.
I don't know many boosting packages but I've been using XGboost for a while now and the biggest tabular dataset I've had was more than 40 times smaller than yours. The training took 2-3 days.
In my experience training time is worse than linear with the size of the data even though it highly depends on the data itself and the hyperparameters you chose. My guess is your training would be very (too) long.
If you really want to use XGboost, you should train on GPU, it seems to me that you are looking at cloud providers, I know google offers managed training of XGboost on GPU, others surely do as well.
With that amount of data I think you should consider using deep learning. You could maybe use tabnet which is a great model developed for tabular data by Google AI. It is easy to try using pytorch for instance.
Answered by LouisB on July 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP