Data Science Asked on August 1, 2021
The problem of vanishing gradients is basically that since our step size is proportional to the gradient, if the gradient is very small, it might take a long time to reach a local minimum. So why don’t we just not take our step size to be proportional to the gradient, and instead do a line search?
The intuition for why gradient descent ought to work is just that if we move in the direction of steepest descent, it should tend to make the function smaller. However, it’s not obvious why the step-size should have to be proportional to the magnitude of the gradient. In an old paper on the subject, Haskell shows that gradient descent converges so long as we do a line search, he doesn’t consider the case of proportional step size. The more I think about it, the less good reason I see at all why taking the step size to be proportional to the gradient should be considered a good or natural way to do gradient descent.
Line search may help with exploding/vanishing gradient problem. However, line search does not work well with mini-batches and most training uses mini-batches. One of the main advantages of line search is that it tells you whether you have stepped too far. That advantage disappears when you subsample the data.
Answered by Brian Spiering on August 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP