Artificial Intelligence Asked by jr123456jr987654321 on December 7, 2021
In many diagrams, as seen below, residual neural networks are only depicted with ReLU activation functions, but can residual NNs also use other activation functions, such as the sigmoid, hyperbolic tangent, etc.?
The problem with certain activation functions, such as the sigmoid, is that they squash the input to a finite interval (i.e. they are sometimes classified as saturating activation functions). For example, the sigmoid function has codomain $[0, 1]$, as you can see from the illustration below.
This property/behaviour can lead to the vanishing gradient problem (which was one of the problems that Sepp Hochreiter, the author of the LSTM, was trying to solve in the context of recurrent neural networks, when developing the LSTM, along with his advisor, Schmidhuber).
Empirically, people have noticed that ReLU can avoid this vanishing gradient problem. See e.g. this blog post. The paper Deep Sparse Rectifier Neural Networks provides more details about the advantage of ReLUs (aka rectifiers), so you may want to read it. However, ReLUs can also suffer from another (opposite) problem, i.e. the exploding gradient problem. Nevertheless, there are several ways to combat this issue. See e.g. this blog post.
That being said, I am not an expert on residual networks, but I think that they used the ReLU to further avoid the vanishing gradient problem. This answer (that I gave some time ago) should give you some intuition about why residual networks can avoid the vanishing gradient problem.
Answered by nbro on December 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP