TransWikia.com

Training CNN on a huge data set

Data Science Asked by Omar Rayyan on September 5, 2021

I am trying to train an AlexNet image model on the RVL-CDIP Dataset. The dataset consists of 320,000 training images, 40,000 validation images, and 40,000 test images.

Since the dataset is huge I started training on 500 (per class) sample from the training set. The result is below: training on 500 (per class) sample, 32 batches , 40 epochs

we can see from the graph above that the validation loss started decreasing at a much slower rate around epoch 20 while training loss continued decreasing the same. This means our model started overfitting the data? I assume that this is probably because the data i have in the training set is not enough to get better results on the validation set? (validation data is also a 500 (per class) sample from the whole validation set)

is it a correct approach to train the model on a small sample (eg. 500 images per class), save the model, load the saved model weights and then train again with a larger sample (eg 1000 images)? My intution is that this way the model would have new data every new run that helps it to learn more about the validation set. And if this approach is correct, when training the model for the second time with a larger sample, should the training sample include images (some or all) that were trained in the first model?

You can find the full code with results here

One Answer

It reminds me of this question, the training loss is decreasing faster than the validation loss. I understand there is some overfitting, as the model is learning some patterns that are only in the training set, but the model is still learning some patterns that are more general, as the validation loss is decreasing as well. To me it would be more of an issue if the validation loss increased, but it is not the case.

Edit

Usually neural networks are trained with all the data, training by using mini-batch gradient descent already does what you mention in your approach without the need of storing the model in memory. So, I would train with as much data as possible, to have a model with the least possible variance. If you are not feeding the data using generators and the whole dataset doesn't fit into memory, I recommend to use them, or train with a model which is as big as possible given your memory limitations.

Answered by David Masip on September 5, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP