Data Science Asked on February 8, 2021
In many sources on the literature on weight initialization, I find the idea that it’s a good idea to keep the activations stable through the layers, that is making sure they stay about the same size/order of magnitude as you go through the layers. Sometimes it’s implied that this has something to do with avoiding exploding or vanishing gradients.
For example in this blog article:
During the forward step, the activations (and then the gradients) can quickly get really big or really small — this is due to the fact that we repeat a lot of matrix multiplications. Either of these effects is fatal for training.
But why? To me this seems like an non sequitur, I don’t see why unstable activations would imply unstable gradients. The only thing I can think of is that if we have an edge with weight $w$ connecting a node with activation $f(x)$ into a node with pre-activation $y$, we have the following formula:
$$
frac{partial C}{partial w} =
f(x)frac{partial C}{partial y}
$$
Is it just because the activation $f(x)$ appears in that formula? I’m not sure because no source that I’ve found explicitly says this. But what if the differential $frac{partial C}{partial y}$ cancels out the effect of $f(x)$, for instance if the activations get really big as we go through the layers, but the differentials get really small, and it cancels out?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP