TransWikia.com

Backprop Through Max-Pooling Layers?

Data Science Asked by shinvu on December 9, 2020

This is a small conceptual question that’s been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?

I came across max-pooling layers while going through this tutorial for Torch 7’s nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don’t understand how the gradient calculation is done for a max-pooling layer.

I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${delta_i}^l$ (defined as ${delta_i}^l = frac{partial E}{partial {z_i}^l}$) is given by:
$$
{delta_i}^l = theta^{‘}({z_i}^l) sum_{j} {delta_j}^{l+1} w_{i,j}^{l,l+1}
$$

So, a max-pooling layer would receive the ${delta_j}^{l+1}$’s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${delta_i}^{l}$ isn’t a single number anymore, but a vector ($theta^{‘}({z_j}^l)$ would have to be replaced by $nabla theta(left{{z_j}^lright})$). Furthermore, $theta$, being the max function, isn’t differentiable with respect to it’s inputs.

So….how should it work out exactly?

3 Answers

There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.

So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$

Correct answer by abora on December 9, 2020

Max Pooling

So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:

$ P_i = f(sum_j W_{ij} PR_j)$,

where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:

$grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.

But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:

$grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,

$grad(PR_{others}) = 0.$

Answered by patapouf_ai on December 9, 2020

@Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation and this within a computational graph which is quick to grasp.!

while implementing the maxpool operation(a computational node in a computational graph-Your NN architecture), we need a function creates a "mask" matrix which keeps track of where the maximum of the matrix is. True (1) indicates the position of the maximum in X, the other entries are False (0). We keep track of the position of the max because this is the input value that ultimately influenced the output, and therefore the cost. Backprop is computing gradients with respect to the cost, so anything that influences the ultimate cost should have a non-zero gradient. So, backprop will "propagate" the gradient back to this particular input value that had influenced the cost.

Answered by Anu on December 9, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP