Data Science Asked on September 25, 2021
In gradient descent, I know that local minima occur when the derivative of a function is zero, but when the loss function is used, the derivative is equal to zero only when the output and the predicted output are the same (according to the equation below).
So, when the predicted output equals the output, that means the global minima is reached! So, my question is: How can a local minima occur, if zero gradient occurs only for the "perfect" fit?
$$theta_j := theta_j – {alpha over m} sum_{i=1}^M (hat y^i-y^i)x_j^i$$
The equation you used for gradient descent isn't general; it's specific for linear regression.
In linear regression, there is indeed only a single global minimum and no local minima; but for more complex models, the loss function is more complex, and local minima are possible.
Answered by Itamar Mushkin on September 25, 2021
The premise of “no minimum without a perfect fit” is incorrect.
Let's look at a simple example with square loss.
$$L(hat{y}, y) = sum_i (y_i-hat{y}_i)^2$$
$$ (x_1, y_1) = (0,1)$$ $$ (x_2, y_2) = (1,2)$$ $$ (x_3, y_3) = (3,3)$$
We decide to model this with a line: $hat{y}_i = beta_0 + beta_1 x_i$.
Let's optimize the parameters according to the loss function.
$$L(hat{y}, y) = (1-(beta_0 + beta_1(0)))^2 + (2-(beta_0 + beta_1(1)))^2 + (3-(beta_0 + beta_1(3)))^2$$
Now we take the partial derivatives of $L$ with respect to $beta_0$ and $beta_1$ and do the usual calculus of minimization.
So we minimize the loss function, but we certainly do not have a perfect fit with a line.
Answered by Dave on September 25, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP