TransWikia.com

How is Stochastic Gradient Descent used like Mini Batch gradient descent?

Data Science Asked by Hunar on December 27, 2020

As I know, Gradient Descent has three variants which are:

1- Batch Gradient Descent: processes all the training examples for each iteration of gradient descent.

2- Stochastic Gradient Descent: processes one training example per iteration. Hence, the parameters are being updated even after one iteration in which only a single example has been processed.

3- Mini Batch gradient descent: which works faster than both batch gradient descent and stochastic gradient descent. Here, b examples where b < m are processed per iteration.

But in some cases they use Stochastic Gradient Descent and they define a batch size for training which is what I am confused about.

Also, what about Adam, AdaDelta & AdaGrad, are they all mini-batch gradient descent or not?

One Answer

But in some cases they use Stochastic Gradient Descent and they define a batch size for training, I am confused about that?

If you defining a batch size, then you are performing mini-batch gradient descent.

Stochastic Gradient Descent ( SGD ) and Mini Batch Gradient Descent are often used interchangeably in many places. The idea is that if the batch size is 1, then its totally SGD.

If the batch size is equal to the number of samples in the dataset, then its Batch Gradient Descent.

Also, what about Adam, AdaDelta, AdaGrad, are they all mini-batch gradient descent? or not?

Adam, Adadelta and AdaGrad are all different in which we update the parameters. The main concept of GD is the calculation of gradient of the loss function which is fundamental to all the above optimizers.

The difference is where on how much samples from a dataset are the gradients of the loss function calculated. After the gradients are calculated, the different optimizers mentioned use modified methods to update the parameters using the calculated gradients.

Finally,

SGD, Mini batch/Batch GD are methods determining on how much samples are gradients calculated. AdaGrad, Adadelta etc. update the parameters differently based on calculated gradients.

Correct answer by Shubham Panchal on December 27, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP