What to do when seed has a big impact on model performance?

Question

I have a training procedure set up for an image recognition task. Each time I train a model, I record training loss, validation loss, validation precision and validation recall.
Recently I switched from EfficientNet to a ResNet-based model. Both models are pretrained, so weight initialization is deterministic. With the old model I ran 5 experiments (each on 5-folds) with exactly the same parameters, varying only the seed and got around 0.001 standard deviation for validation loss. With the new model I ran 3 experiments (also with 5 folds) with exactly the same parameters and got a standard deviation of 0.028 for validation loss.
This high variance makes it very hard to interpret the effects of changing non-seed parameters when running new experiments with the new model, since in many cases the differences in performance sit within one standard deviation. Experiments take multiple hours each, so it's not feasible to run multiple experiments for each condition.
What does one do in cases like this?

Erwan · Answer

It's just a simple idea that I often recommend to understand what happens in cases like this:
You could try to do an ablation study, where you train the model with varying sizes of training data (e.g. 10%, 20%,..., 100% of your full training data). Given that your problem is instability, it would be even better to do every size several times with different random subsets. Then you plot the performance as a function of the data size, for example with boxplots in order to visualize the variance. The evolution of the performance (and its variance) would show you whether there is enough training data given the complexity of the model: if yes, the curve increases and then starts stabilizing around its maximum. If not, the curve is still in a phase of increase when it reaches 100%, meaning that more data would be needed to allow the model to stabilize.
Of course the disadvantage is that you need to run the training/testing many times, so this might not be practical.

Louka Ewington-Pitsos · Answer

Ok so not a full answer, but I will mention that in my case simply training for more epochs reduced the model variance:

As you can see, training for 9 epochs results in lower and more consistent val loss results. Things are still a bit unworkable since a single 9 epoch experiment takes > 6 hours, but it's something.

What to do when seed has a big impact on model performance?

2 Answers

Add your own answers!

Ask a Question