Data Science Asked by Kenenbek Arzymatov on April 3, 2021
Consider a simple 1-D function $y = x^2$ to find a maximum with the gradient ascent method.
If we start in point 3 on x
-axis:
$$ frac{partial f}{partial x} biggrrvert_{x=3} = 2x biggrrvert_{x=3} = 6 $$
This means that a direction in which we should move is a $6$.
Gradient ascent gives rule to update:
x = old_x + learning_rate * gradient
What I can’t understand why we need to multiply a learing_rate
with gradient
. Why we can’t just use x = old_x + learning_rate * sign(gradient)
.
Because if we made a learning_rate
step in a positive direction it is already a maximum switch of x
we can make.
I know the reasoning behind finding maximum direction in this equation:
$$grad(?(?))⋅?⃗=|grad(?(?))||?⃗|cos(?) $$
But I can’t undestand why just to accept a sign of gradient (plus or minus) is not enough for ascending.
Using only sign of gradient is a way to go, but might result in slow convergence. Nevertheless it is a valid variation of the method.
Sign-based optimization methods have become popular in machine learning due to their favorable communication cost in distributed optimization and their surprisingly good performance in neural network training. Furthermore, they are closely connected to so-called adaptive gradient methods like Adam. Recent works on signSGD have used a non-standard "separable smoothness" assumption, whereas some older works study sign gradient descent as steepest descent with respect to the ℓ∞-norm. In this work, we unify these existing results by showing a close connection between separable smoothness and ℓ∞-smoothness and argue that the latter is the weaker and more natural assumption. We then proceed to study the smoothness constant with respect to the ℓ∞-norm and thereby isolate geometric properties of the objective function which affect the performance of sign-based methods. In short, we find sign-based methods to be preferable over gradient descent if (i) the Hessian is to some degree concentrated on its diagonal, and (ii) its maximal eigenvalue is much larger than the average eigenvalue. Both properties are common in deep networks.
The aim of this article is to study the properties of the sign gradient descent algorithms involving the sign of the gradient instead of the gradient itself and first introduced in the RPROP algorithm. This article provides two results of convergence for local optimization, a first one for nominal systems without uncertainty and a second one for systems with uncertainties. New sign gradient descent algorithms including the dichotomy algorithm DICHO are applied on several examples to show their effectiveness in terms of speed of convergence. As a novelty, the sign gradient descent algorithms can allow to converge in practice towards other minima than the closest minimum of the initial condition making these algorithms suitable for global optimization as a new metaheuristic method.
Correct answer by Nikos M. on April 3, 2021
Just focusing on "sign" part. The sign of the gradient tells in which direction to move but only "locally". Once you move you have a new surface and then may be now the steepest portion(which leads to global minima/maxima) is not there and then you need to decide again the direction based on the sign and this might not be optimal.
As Wikipedia https://en.wikipedia.org/wiki/Gradient_descent gives an example. Imagine you are on a top of mountain and everything is foggy and you want to get to the bottom. You calculate and take the step in the steepest direction which you think will lead to the bottom but since there is fog your visibility is limited, this step might land you in the position from there next step is NOT getting to the bottom.
As the above answer also suggests this will lead to slow convergence as well.
Answered by BlackCurrant on April 3, 2021
Think about the convergence behavior of your algorithm (assuming a fixed learning rate) — once it gets x to within learning_rate
of the optimum, it will jump to the other side, and the sign will change. Then it will jump back to exactly the previous value, and oscillate between those two values until you choose to terminate it, never getting any closer to the true optimum.
An algorithm that used the value of the gradient would see a smaller gradient as you approached the optimum, and could therefore take smaller steps at that point and approach the optimum arbitrarily closely.
Answered by hobbs on April 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP