Cross Validated Asked by Mike Land on November 12, 2020
I have noticed that PyTorch models perform significantly better when ReLU is used instead of Softplus with Adam as optimiser.
How can it happen to be that a non-differentiable function is easier to optimise than an analytic one? Is it true, then, that there is no gradient optimisation except than in name, and some kind of combinatorics is used under the hood?
ReLU in general is known to outperform many smoother activation functions. It’s easy to optimize, because it’s half-linear. The advantage when using it is usually speed, so it can be the case that if you waited more iterations, used different learning rate, batch sizes, or other hyperparameters, etc, you’d get similar results.
Answered by Tim on November 12, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP