What is the best way (cheapest / fastest option) to train an model on massive dataset (400GB+, 100m rows x 200 columns)?

Question

I have a 400GB data set that I want to train a model on. What is the cheapest method to train this model? The options I can think of so far are:

AWS instance with massive RAM and train CPU (slow, but instances are cheap).
AWS instance with many GPUs and using Dask + XGBoost to distribute (fast, but expensive instances, and  I don't even think there would be an instance large enough to handle).

I have just assumed XGBoost is the best package since its tabular data, but if another gradient boosted tree package would be better at handling this, that would be acceptable as well.
Any help would be greatly appreciated.

LouisB · Answer

I don't know many boosting packages but I've been using XGboost for a while now and the biggest tabular dataset I've had was more than 40 times smaller than yours. The training took 2-3 days.
In my experience training time is worse than linear with the size of the data even though it highly depends on the data itself and the hyperparameters you chose. My guess is your training would be very (too) long.
If you really want to use XGboost, you should train on GPU, it seems to me that you are looking at cloud providers, I know google offers managed training of XGboost on GPU, others surely do as well.
With that amount of data I think you should consider using deep learning. You could maybe use tabnet which is a great model developed for tabular data by Google AI. It is easy to try using pytorch for instance.

What is the best way (cheapest / fastest option) to train an model on massive dataset (400GB+, 100m rows x 200 columns)?

One Answer

Add your own answers!

Ask a Question