Sigmoid vs Relu function in Convnets

Question

The question is simple: is there any advantage in using sigmoid function in a convolutional neural network? Because every website that talks about CNN uses Relu function.

Armen Aghajanyan · Accepted Answer

The reason that sigmoid functions are being replaced by rectified linear units, is because of the properties of their derivatives.

Let's take a quick look at the sigmoid function $sigma$ which is defined as $frac{1}{1+e^{-x}}$. The derivative of the sigmoid function is $$sigma '(x) = sigma(x)*(1-sigma(x))$$ The range of the $sigma$ function is between 0 and 1. The maximum of the $sigma'$ derivative function is equal to $frac{1}{4}$. Therefore when we have multiple stacked sigmoid layers, by the backprop derivative rules we get multiple multiplications of $sigma'$. And as we stack more and more layers the maximum gradient decreases exponentially. This is commonly known as the vanishing gradient problem. The opposite problem is when the gradient is greater than 1, in which case the gradients explode toward infinity (exploding gradient problem).

Now let's check out the ReLU activation function which is defined as: $$R(x) = max(0,x)$$ The graph of which looks like

If you look at the derivatives of the function (slopes on the graph), the gradient is either 1 or 0. In this case we do not have the vanishing gradient problem or the exploding problem. And since the general trend in neural networks has been deeper and deeper architectures ReLU became the choice of activation.

Hope this helps.

Sigmoid vs Relu function in Convnets

One Answer

Add your own answers!

Ask a Question