TransWikia.com

BERT minimal batch size

Data Science Asked by Predicted Life on February 23, 2021

Is there a minimum batch size for training/re-fining a BERT model on custom data?

Could you name any cases where a mini batch size between 1-8 would make sense?

Would a batch size of 1 make sense at all?

One Answer

Small mini-batch size leads to a big variance in the gradients. In theory, with a sufficiently small learning rate, you can learn anything even with very small batches.

In practice, Transformers are known to work best with very large batches. You can simulate large batches by accumulating gradients from the mini-batches and only do the update once in several steps.

Also, when finetuning BERT, you might also think of fine-tuning only the last layer (or several last layers), so you save some memory on the parameter gradients and can have bigger batches.

Correct answer by Jindřich on February 23, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP