Why RMSProp converges faster than Momentum?

Question

Why is RMSProp in many cases converging faster than Momentum?

Momentum:

$$v_{dW} := beta v_{dw} +(1-beta)dW$$
$$W := W-alpha v_{dw}$$

RMSProp:

$$ S_{dw} := B cdot S_{dw} + (1-B)cdot (dW)^2$$
  $$W := W- alpha frac{dW}{sqrt{S_{dw}}}$$

Where $alpha$ is the learning rate (0.01 etc), $beta$ is the momentum term (0.9 etc), similar to B

From my point of view, both momentum and RMSProp have "tendency to keep moving". Well, I can see how RMSprop will naturally accelerate on flat surfaces due to

$$frac{1}{sqrt{S_{dw}}}$$

when $S_{dw}$ is small, but is there another benefit that RMSprop provides?

Media · Answer

The basic intuition is that you should not have the same learning rate for different dimensions. For instance, you can have a high slope in one direction but not for another. Consequently, you should not have the same speed for the two directions. Momentum adds acceleration. Suppose gradient is your instant velocity and the average is your average velocity. Momentum is actually viscosity or somehow friction. Suppose that you are near your optimal points, your gradients become zero and you have low average which means your speed changes slowly. They have both alpha term but what is going to be used is the running average, just a kind of average which is simple to be calculated. Take a look at here and here for making an analogy.

Answered by Media on February 11, 2021

Varun Bajpai · Answer

Momentum is linear and provides speed to the update

RMSprop contributes the exponentially decaying average of past "squared gradients"

In RMS Prop By using the average, we actually try to diminish the vertical movement because they sum up to 0(approximately) while averaging.

RMS provides average to the update

Adam uses RMS prop and Momentum
Speed and Average of update combined together, On an average it will speed up the direction in which more update is needed

All three are faster than Stochastic Gradient Decent without Exponential Weighted Average, Worst Case use Momentum, Dont go for normal weight updates

Why RMSProp converges faster than Momentum?

2 Answers

Add your own answers!

Ask a Question