TransWikia.com

Do batch GD and stochastic GD give the same results?

Data Science Asked by AI_new2 on October 21, 2020

If a neural network is trained on a dataset of M samples for N epochs, do batch GD and SGD give the same result? Is SGD is faster because utilize the hardware better?

I am asking because I figured out that both (batch GD & SGD) give mathematically the same result, but I read SGD avoid local minima, how can this can be true if SGD & batch GD give the same result!?
enter image description here

One Answer

They are not the same; the batch gradient descent averages the gradients of samples 1 through M. This is seen in the first equation, denoted by the summation over M elements and divided by M total elements (hence the average).

Stochastic gradient descent, as illustrated in the second equation, updates via a single, randomly selected instance.

The confusion I believe stems from thinking about loops over 1 through M. The batch gradient descent summation (loop from 1 to M) is performed per iteration, whereas the stochastic gradient descent loop is performed over all epochs.

Thus the intermediate weight updates will be different, and these results compound. So even if both methods process every instance over the entire training course, the act of updating only after a single instance will inherently cause a different trajectory than if one were to average all the gradients per iteration.

SGD can be faster simply because it only processes a single instance per weight update. In fact, purely from a hardware efficiency perspective, SGD is suboptimal due to its intrinsic serialized nature over the course of training. Batch gradient descent can easily be parallelized and take advantage of GPUs by processing multiple instances simultaneously.

Because SGD is, well, stochastic, the perturbations this causes in the surrogate model development can help alleviate bias in the training set. This is why SGD can sometimes help avoid local minima, albeit using a rather crude optimization strategy.

Edit to answer the comment asking about the relationship between training set bias and local minima:

Picture an example where a model has 2 parameters that define the xy plane and the corresponding error is plotted as the z component. This defines a function mapping R² to R. This function is only an approximation of the real-world function because you are not testing on every possible training instance. A biased training set poorly approximates the real-world function. Thus its local minima are different from the real-world local minima. When using batch GD, you will likely target the nearest local minima, without really any change of target over the course of training. This is good if your training set is relatively unbiased, but otherwise not ideal as you would be over fitting to a poor approximation. SGD can improve generalization by forcing a redirect of target in each iteration. Every iteration, if you sample a different function out of the M training samples, then you will likely change target minima multiple times. If your training set is relatively biased, then this can help not overfit to a poor approximation. If your training was a perfect approximation, then SGD would be pointless. Some may argue, given a perfect approximation, that the random sampling helps to avoid getting trapped in the starting basin, but at that point it's just luck.

Answered by Benji Albert on October 21, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP