Data Science Asked on November 26, 2020
I’ve been reading the literature on vanishing/exploding gradients and specifically how they connect to weight initialization. An idea I’ve come across a few times, which seems very important in this area, is that we want the variance to remain the same throughout the layers of a neural network, that is, if $v_n$ is the variance of the $n$-th layer, we want all the $v_n$ to be about equal. For example in Kumar 2017:
In this paper, I revisit the oldest, and most widely used approach to the problem with the goal of resolving some of the unanswered theoretical questions which remain in the literature. The problem can be stated as follows: If the weights in a neural network are initialized using samples from a normal distribution, $N(0,v^2)$, how should $v^2$ be chosen to ensure that the variance of the outputs from the different layers are approximately the same?
The paper goes on to claim that "the first systematic analysis of this problem was conducted by Gloriot and Bengio", citing Gloriot & Bengio (2010). But that paper seems to just assume that the reader already accepts keeping the variance stable is a good idea, I can’t find anything like an explanation in the paper. They just make this claim:
From a forward-propagation point of view, to keep information flowing we would like that
$$forall(i, i′), Var[z^i] =Var[z^{i′}]$$
From a back-propagation point of view we would similarly like to have
$$forall(i, i′), Varleft(frac{∂Cost}{∂s^i}right)=Varleft(frac{∂Cost}{∂s^{i′}}right)$$
What variance is actually being taken here? Variance with respect to what?
What exactly is helped by the variance being stable in this way?
Following the notation of the article, let's compute the gradient of the cost function w.r.t. the parameters in two consecutive layers (that we are going to call: layer $i$ and layer $i+1$). In this layers, the quantity used to update their respective weight matrices, is given by:
$$text{Layer 1}rightarrow frac{partial Cost}{partial W^i}= frac{partial Cost}{partial s^i}frac{partial s^i}{partial W^i}=frac{partial Cost}{partial s^i} (z^{i-1})^T$$
$$text{Layer 2}rightarrow frac{partial Cost}{partial W^{i+1}}= frac{partial Cost}{partial s^{i+1}}frac{partial s^{i+1}}{partial W^{i+1}}=frac{partial Cost}{partial s^{i+1}} (z^{i})^T$$
There we can see that if we consider: $$ Varleft(frac{partial Cost}{partial s^i}right) = Varleft(frac{partial Cost}{partial s^{i+1}}right) ,,,,,leftrightarrow,,,,, Var(z^i)=Var(z^{i-1})$$
Then we would have: $$Varleft(frac{partial Cost}{partial W^i} right) = Varleft(frac{partial Cost}{partial W^{i+1}} right)$$
This is a good thing because having the same variance in the updates of both layers means that the updates are globally spread in the same way, so assuming that the mean value of $partial Cost/partial W$ in both layers is the same, then this would mean that globally these layers are learning at the same rythm.
Whatever the answer to (2) is, what is the proof or evidence that that's the case?
A good advantage of the previous reasoning is that if we could achieve this happening in the whole neural network, then all the layers in the NN would be learning at the same rythm! So problems like vanishing or exploding gradient would be avoided.
Correct answer by Javier TG on November 26, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP