TransWikia.com

Advanced Activation Layers in Deep Neural Networks

Data Science Asked by Vadim Smolyakov on November 30, 2020

I’m wondering about the benefits of advanced activation layers such as LeakyReLU, Parametric ReLU, and Exponential Linear Unit (ELU). What are the differences between them and how do they benefit training?

One Answer

ReLU

Simply rectifies the input, meaning positive inputs are retained but negatives give an output of zero. (Hahnloser et al. 2010)

$$ f(x) = max(0,x) $$ Pros:

  • Eliminates the vanishing/exploding gradient problem. (true for all following as well)
  • Sparse activation. (true for all following as well)
  • Noise-robust deactivation state (i.e. does not attempt to encode the degree of absence).

Cons:

  • Dying ReLU problem (many neurons end up in a state where they are inactive for most or all inputs).
  • Not differentiable. (true for all following as well)
  • No negative values means mean unit activation is often far from zero. This slows down learning.

Leaky ReLUs

Adds a small coefficient ($<1$) for negative values. (Maas, Hannun, & Ng 2013)

$$ f(x) = begin{cases} x & text{if } x geq 0 0.1 x & text{otherwise} end{cases} $$

Pros:

  • Alleviates dying ReLU problem. (true for all following)
  • Negative activations push mean unit activation closer to zero and thus speeds up learning. (true for all following)

Cons:

  • Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

PReLUs

Just like Leaky ReLUs but with a learnable coefficient. (Note that in the below equation a different $a$ can be learned for different channels.) (He et al. 2015)

$$ f(x) = begin{cases} x & text{if } x geq 0 a x & text{otherwise} end{cases} $$

Pros:

  • Improved performance (lower error rate on benchmark tasks) compared to Leaky ReLUs.

Cons:

  • Deactivation state is not noise-robust (i.e. noise in deactivation results in different levels of absence).

ELUs

$$ f(x) = begin{cases} x & text{if } x geq 0 alpha(exp(x)-1) & text{otherwise} end{cases} $$

Replaces the small linear gradient of Leaky ReLUs and PReLUs with a vanishing gradient. (Clevert, Unterthiner, Hochreiter 2016)

Pros:

  • Improved performance (lower error and faster learning) compared to ReLUs.
  • Deactivation state is noise-robust.

Correct answer by Sophie Searcy - Metis on November 30, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP