Understanding dropout and gradient descent

Question

I am looking at how to implement dropout on deep neural networks and found something counter intuitive. In the forward phase dropout mask activations with a random tensor of 1s and 0s to force net to learn the average of the weights. This help the net to generalize better. But during the update phase of the gradient descent the activations are not masked. This to me seems counter intuitive. If I mask connections activations with dropout, why I should not mask the gradient descent phase?

Neil Slater · Accepted Answer

In dropout as described in here, weights are not masked. Instead, the neuron activations are masked, per example as it is presented for training (i.e. the mask is randomised for each run forward and gradient backprop, not ever repeated).
The activations are masked during forward pass, and gradient calculations use the same mask during back-propagation of that example. This can be implemented as a modifier within a layer description, or as a separate dropout layer.
During weight update phase, typically applied on a mini-batch (where each example would have had different mask applied) there is no further use of dropout masks. The gradient values used for update have already been affected by masks applied during back propagation.
I found a useful reference for learning how dropout works, for maybe implementing yourself, is the Deep Learn Toolbox for Matlab/Octave.

Understanding dropout and gradient descent

One Answer

Add your own answers!

Ask a Question