Do smaller neural nets always converge faster than larger ones?

Question

In your experience, do smaller CNN models (fewer params) converge faster than larger models?
I would think yes, naturally, because there are fewer parameters to optimize. However, I am training a a custom MobileNetV2-based Unet (with 2.9k parameters) for image segmentation, which is taking longer to converge than a model with greater number of parameters (5k params). If this convergence behavior is unexpected, it probably indicates a bug in the architecture

noe · Answer

For most cases, probably. For all cases, no. Especially if you are training on small data with very aggressive regularization in place, you may need a very long time until the desired performance level is achieved.
For instance, for some popular text generation networks called Transformers trained on small datasets, it is necessary to use very aggressive regularization techniques and train during a very very large number of iterations (see this Twitter thread where they describe how to train a Transformer model on the PTB and Wikitext-103 datasets).

Valentin Calomme · Answer

Interesting question. As @ncasas mentions, for most cases, probably, for all cases, no.
There are many things that impact how fast a network will converge.

The optimizer and training hyperparameters

Whether you are using SGD, Adam, or another optimizer, it will have a direct impact on convergence speed. These optimizers have hyperparameters including, notably, the learning rate which can make a huge difference.

The initial state of the network

Needless to say that a pre-trained network might converge faster than a non-converged one. Although small, there is always a probability that a different initialization of the weights can place you closer or further from convergence.

The architecture of the network itself

A network with N parameters may be designed in many different ways, with different layer types, the number of layers, and layer sizes. Each architecture will yield different convergence behaviour.

The difficulty of the problem at hand

What's good to remember is that neural networks typically do not converge to a global optimum but they do have many local optima which you can end up in. Training is essentially trying many combinations of weights and then decide on a configuration that is satisfactory. All of this to say that certain problems may possess more "satisfactory" configurations than others, meaning that some networks might have more ways to converge than others, meaning that they might find one faster.

Do smaller neural nets always converge faster than larger ones?

2 Answers

Add your own answers!

Ask a Question