Vanishing Gradient vs Exploding Gradient as Activation function?

Question

ReLU is used as an activation function that serves two purposes:

Breaking linearity in DNN.
Helping in handling Vanishing Gradient problem.

For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of Gradient, similarly to ReLU which set the minimum Gradient limit which is 0.
So far I read, ReLU is considered as an activation function. In a similar fashion, can we use Gradient Clipping also as an activation function? If yes, any pros and cons of using it?

Leevo · Answer

ReLU is considered as an activation function, on similar fashion can we use Gradient Clipping also as an activation function?

ReLU is an activation function. Gradient clipping is a technique to keep the problem of exploding gradient at bay.

I wish also to stress that the best technique to control for vanishing/exploding gradients is, at the moment, batch normalization. Dropout (a technique born to fight overfitting) also has a similar regularization effect - by forcing the model to distribute weights more evenly through the layer. That's why you don't see gradient clipping that often as it used to.

EDIT:

I forgot to mention that a proper scaling of your variables and appropriate weight initializations make the problem of vanishing/exploding gradient not very frequent. This of course is purely based on personale experience. It's still very important to take it into account

Vanishing Gradient vs Exploding Gradient as Activation function?

One Answer

Add your own answers!

Ask a Question