Advanced Activation Layers in Deep Neural Networks

Question

I'm wondering about the benefits of advanced activation layers such as LeakyReLU, Parametric ReLU, and Exponential Linear Unit (ELU). What are the differences between them and how do they benefit training?

Sophie Searcy - Metis · Accepted Answer

ReLU
Simply rectifies the input, meaning positive inputs are retained but negatives give an output of zero. (Hahnloser et al. 2010)
$$
f(x) = max(0,x)
$$
Pros:

Eliminates the vanishing/exploding gradient problem. (true for all following as well)
Sparse activation. (true for all following as well)
Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence).

Cons:

Dying ReLU problem (many neurons end up in a state where they are inactive for most or all inputs).
Not differentiable. (true for all following as well)
No negative values means mean unit activation is often far from zero. This slows down learning.

Leaky ReLUs
Adds a small coefficient ($<1$) for negative values. (Maas, Hannun, & Ng 2013)
$$
f(x) = begin{cases}
x & text{if } x geq 0 
0.1 x & text{otherwise}
end{cases}
$$
Pros:

Alleviates dying ReLU problem. (true for all following)
Negative activations push mean unit activation closer to zero and thus speeds up learning. (true for all following)

Cons:

Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

PReLUs
Just like Leaky ReLUs but with a learnable coefficient. (Note that in the below equation a different $a$ can be learned for different channels.) (He et al. 2015)
$$
f(x) = begin{cases}
x & text{if } x geq 0 
a x & text{otherwise}
end{cases}
$$
Pros:

Improved performance (lower error rate on benchmark tasks) compared to Leaky ReLUs.

Cons:

Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

ELUs
$$
f(x) = begin{cases}
x & text{if } x geq 0 
alpha(exp(x)-1) & text{otherwise}
end{cases}
$$
Replaces the small linear gradient of Leaky ReLUs and PReLUs with a vanishing gradient. (Clevert, Unterthiner, Hochreiter 2016)
Pros:

Improved performance (lower error and faster learning) compared to ReLUs.
Deactivation state is noise-robust.

Advanced Activation Layers in Deep Neural Networks

One Answer

ReLU

Leaky ReLUs

PReLUs

ELUs

Add your own answers!

Ask a Question