BPTT vs Vanishing Gradient Problem

Question

I know that BPTT is the method to apply Back Propagation on RNN.

Which is works fine with RNN as it stops at certain point as changes approach to zero

but isn't it the exact Vanishing Gradient Problem?

if it is the same then Why does it have 2 names, one is a problem and one is a method.

if not then what am I missing here, what is the difference between them?

Elliot · Answer

BPTT is the process of backpropagating the gradients calculations (chain rule) of a loss function at each time step in a recurrent network. The parameters $U$ (input weights) and $W$ (recurrent connection weights) will have a contributions to the gradient at each time step so the total gradient will be the sum of each contribution.

The vanishing gradient is a generic issue in neural networks of having decaying gradients values over the parameters of the network, meaning that updates for some parameters will be very small and hard to propagate. This gradient-based learning problem may occur with deep networks, vanilla recurrent networks (because of the large product of contributions to the gradient over timesteps) and often raised by the inherent nature of some activation functions.

Paul · Answer

Backpropagation stops when, as it tunes the model parameters, it finds a minimum in the cost function. At the minimum, the cost function has zero gradient, by definition. Numerically, the gradient will never be exactly zero, so that’s why the algorithm stops when the gradient falls below a certain minimum.

That’s a vanishing gradient, but it’s what should happen when looking for a minimum, so that’s not the vanishing gradient problem. The problem is when the gradient is very small, even when you’re not near the minimum. You haven’t tuned your parameters at all yet, and the performance of the model is terrible, but the gradient of the cost function is practically zero. That means you don’t know how to update your parameters, so you can’t train.

This happens, for example, in a regular feed-forward network, with many layers that all use the sigmoid activation function. The cost function is calculated using the output of the last layer. The gradient of the cost function as a function of the parameters in the first layer contains the product of the gradients in all layer layers. The gradient of the sigmoid is  below 1 always, so you’re multiplying many numbers that are below 1, leading to a very small number. That’s the vanishing gradient problem.

This also explains why using a ReLU rather than sigmoid helps, and why this problem gets worse, the more layers you have.

BPTT vs Vanishing Gradient Problem

2 Answers

Add your own answers!

Ask a Question