Does batch_size in Keras have any effects in results' quality?

Question

I am about to train a big LSTM network with 2-3 million articles and am struggling with Memory Errors (I use AWS EC2 g2x2large).

I found out that one solution is to reduce the batch_size. However, I am not sure if this parameter is only related to memory efficiency issues or if it will effect my results. As a matter of fact, I also noticed that batch_size used in examples is usually as a power of two, which I don't understand either.

I don't mind if my network takes longer to train, but I would like to know if reducing the batch_size will decrease the quality of my predictions.

Thanks.

Jan van der Vegt · Accepted Answer

After one and a half years, I come back to my answer because my previous answer was wrong.

Batch size impacts learning significantly. What happens when you put a batch through your network is that you average the gradients. The concept is that if your batch size is big enough, this will provide a stable enough estimate of what the gradient of the full dataset would be. By taking samples from your dataset, you estimate the gradient while reducing computational cost significantly. The lower you go, the less accurate your esttimate will be, however in some cases these noisy gradients can actually help escape local minima. When it is too low, your network weights can just jump around if your data is noisy and it might be unable to learn or it converges very slowly, thus negatively impacting total computation time.

Another advantage of batching is for GPU computation, GPUs are very good at parallelizing the calculations that happen in neural networks if part of the computation is the same (for example, repeated matrix multiplication over the same weight matrix of your network). This means that a batch size of 16 will take less than twice the amount of a batch size of 8.

In the case that you do need bigger batch sizes but it will not fit on your GPU, you can feed a small batch, save the gradient estimates and feed one or more batches, and then do a weight update. This way you get a more stable gradient because you increased your virtual batch size.

shadi · Answer

Oddly enough, I found that larger batch sizes with keras require more epochs to converge.

For example, the output of this script based on keras' integration test is

epochs 15   , batch size 16   , layer type Dense: final loss 0.56, seconds 1.46
epochs 15   , batch size 160  , layer type Dense: final loss 1.27, seconds 0.30
epochs 150  , batch size 160  , layer type Dense: final loss 0.55, seconds 1.74

Keras issue 4708: the user turned out to be using BatchNormalization, which affected the results.
This tutorial on LSTM, section Tuning the Batch Size
search results for keras lstm batch size effect on result
My Neural Network isn't working! What should I do?, point 5 (You Used a too Large Batch Size) discusses exactly this

Using too large a batch size can have a negative effect on the
  accuracy of your network during training since it reduces the
  stochasticity of the gradient descent.

Edit: most of the times, increasing batch_size is desired to speed up computation, but there are other simpler ways to do this, like using data types of a smaller footprint via the dtype argument, whether in keras or tensorflow, e.g. float32 instead of float64

Jil Jung Juk · Answer

I feel the accepted answer is possibly wrong. There are variants in Gradient Descent Algorithms.

Vanilla Gradient Descent: Here the Gradient is being calculated on all the data points at a single shot and the average is taken. Hence we have a smoother version of the gradient takes longer time to learn. 
Stochastic Gradient Descent : Here one-data point at a time hence the gradient is aggressive (noisy gradients) hence there is going to be lot of oscillations ( we use Momentum parameters - e.g Nesterov to control this). So there is a chance that your oscillations can make the algorithm not reach a local minimum.(diverge).
Mini-Batch Gradient Descent: Which takes the perks of both the previous ones averages gradients of a small batch. Hence not too aggressive like SGD and allows Online Learning which Vanilla GD never allowed.

The smaller the Mini-Batch the better would be the performance of your model (not always) and of course it has got to do with your epochs too faster learning. If you are training on large dataset you want faster convergence with good performance hence we pick Batch-GD's.

SGD had fixed learning parameter hence we start other Adaptive Optimizers like Adam, AdaDelta, RMS Prop etc which changes the learning parameter based on the history of Gradients.

Chris F Carroll · Answer

A couple of papers have been published showing – and conventional wisdom in 2020 seems to still be persuaded— that, as Yann LeCun put, large batches are bad for your health.
Two relevant papers are

Revisiting Small Batch Training For Deep Neural Networks, Dominic Masters and Carlo Luschi which implies that anything over 32 may degrade training in SGD.

and

On Large-batch Training For Deep Learning: Generalization Gap And Sharp Minima

which offers possible reasons. To paraphrase badly, big batches are likely to get stuck in local (“sharp”) minima, small batches not. There is some interplay with choice of learning rate.

Does batch_size in Keras have any effects in results' quality?

4 Answers

Add your own answers!

Ask a Question