How does a zero centered activation functions like tanh helps in gradient decent?

Question

I know that, if X are all positive, or negative then the sign of the downstream gradient will be same as that of the upstream gradient, but what I don't understand is how the zero centered activation function can overcome this problem?

Since, even in the case of tanH function, if all X are positive then too the sign remains same.

Forgive my english, not a native speaker.

maksylon · Answer

$tanh$ function is scaled standard sigmoid function $y=frac{1}{1+e^{-ax}}$. Due to that scaling it has more steep gradient that standard sigmoid function. Steep gradient is important, because it makes backprop training faster and less likely to get stuck in zero-gradient area.

Furthermore $y(0) neq 0$ where $tanh(0) = 0$. Sometimes it is very important, especially when dealing with normalized $[0,1]$ signals as inputs.

Graph4Me Consultant · Answer

You can have a look at this survey: https://arxiv.org/pdf/2004.06632.pdf
It discusses different aspects of activation functions. It also explains why centered activation functions are considered to be more suitable in practice.

Note that if you consider the universal approximation theorem, the activation function does not need to be zero-centered.

How does a zero centered activation functions like tanh helps in gradient decent?

2 Answers

Add your own answers!

Ask a Question