BERT minimal batch size

Question

Is there a minimum batch size for training/re-fining a BERT model on custom data?
Could you name any cases where a mini batch size between 1-8 would make sense?
Would a batch size of 1 make sense at all?

Jindřich · Accepted Answer

Small mini-batch size leads to a big variance in the gradients. In theory, with a sufficiently small learning rate, you can learn anything even with very small batches.
In practice, Transformers are known to work best with very large batches. You can simulate large batches by accumulating gradients from the mini-batches and only do the update once in several steps.
Also, when finetuning BERT, you might also think of fine-tuning only the last layer (or several last layers), so you save some memory on the parameter gradients and can have bigger batches.

BERT minimal batch size

One Answer

Add your own answers!

Ask a Question