Data Science Asked on February 2, 2021
In this paper, on page 5, we find the formula
$$Var(z^i)=Var(x)prod_{i’=0}^{i-1}n_{i’}Var(W^{i’})$$
I am really struggling to understand what is meant by this formula. I think at least some of the following are true:
However, it’s entirely unclear to me what these variances are. I thought maybe $Var(y)$ just meant the empirical variance of the vector $y$, i.e. the sum of the squares of the differences of the elements of $y$ from the mean of $y$, and likewise for $Var(x)$ and $Var(W)$, where the latter is just the variance of all of the entries of $W$. But under this interpretation the formula turns out to be false numerically, so I’m at a bit of a loss to understand what this equation is supposed to mean.
$$Var(z^i)=Var(x)prod_{i'=0}^{i-1}n_{i'}Var(W^{i'})$$
This formula expresses the variance of the values of the activation for each neuron $k$ of the layer $i rightarrow z^i_k$. This value, under the assumptions that are going to be mentioned along this post, is the same for all the neurons in the layer $i$. This is why the authors express it as $Var(z^i)$ instead of a particular $Var(z^i_k)$. Just to simplify the notation
Note that this value is scalar, as well as the other variances involved in the equation: $Var(x)$ and $Var(W^i)$. So, just as a summary, these variances have the next meanings:
So now let's see how to reach this formula.
Let's analyse some parts of that page of the article that answer the other questions and that are going to be also useful to understand the assumptions that are being made.
For a dense artificial neural network using symmetric activation function $f$ with unit derivative at $0$ (i.e. $f'(0) = 1$), if we write $z^i$ for the activation vector of layer $i$, and $s^i$ the argument vector of the activation function at layer $i$, we have: $$ s^i = z^iW^i + b^i,,,,,,,,,,,,,,,,,,z^{i+1}=f(s^i)$$
We can draw some conclusions given that paragraph:
The variances will be expressed with respect to the input, outpout and weight initialization randomness
This means that the variances that they are going to use are going to be $Var(x)$, $Var(z^i)$ and $Var(W^i)$ respectively, and during the initialization (the weights are only spread randomly at the initialization).
The variance of the biases is not being considered as they are assumed to be initialized to the same value at the initialization $ Rightarrow Var(b^i)=0$.
Consider the hypothesis that we are in a linear regime at the initialization, that the weights are initialized independently and that the inputs features variances are the same ($=Var(x))$.
Given this, we know that:
So now, let's compute for example the Variance of the activations of the first hidden layer ($i=1$) i.e. $Var(z^1_k)$:
$$ begin{align} Var(z^1_k) &= Var(f(s^0_k)) && (1)\ &approx (f'(mathbb{E}(s^0_k))^2,,Var(s^0_k) && (2)\ &= (f'(0))^2,,Var(s^0_k) && (3)\ &= Var(s^0_k) && (4) \ &= Var(W^0_k x + b_k^0) && (5)\ &= Var(W^0_k x) = Var(w^0_{k1} x_1 + w^0_{k2} x_2 +... ) && (6)\ &= Var(w^0_{k1} x_1) + Var( w^0_{k2} x_2) +... && (7)\ &= n_0Var(w^0_{kj})Var(x_j) && (8)\ &= n_0Var(W^0)Var(x) && (9) end{align}$$
Note that we would end up with the same expression of $Var(z^i_k)$ for every neuron $k$ of the layer $i rightarrow $ Now we understand why $Var(z^i)$ represents the variance of the activation for each neuron in the layer $i$ i.e. $Var(z^i) = Var(z^i_k)$.
The justifications for each step are:
Extending this reasoning to all the layers we end up with the equation given by the authors: $$Var(z^i)=Var(x)prod_{i'=0}^{i-1}n_{i'}Var(W^{i'})$$
Edit regarding the comments The reasoning holds even if the features are not independent. This is because, we can proof that the covariance of the $z^{i+1}_k$ terms in the same layer $i$ is $0$. This also serves as an explantion to the step $(7)$ of the example above.
To see this let's compute $Cov(w_k^iz^i,w_j^iz^i)$ where $w_k^i$ and $w_j^i$ represent the vector of weights related to the neurons $k$ and $j$ of the layer $i$, with $kneq j$: $$begin{align} Cov(w_k^iz^i,,,w_j^iz^i) &= mathbb{E}left[(w_k^iz^i-mathbb{E}(w_k^iz^i))(w_j^iz^i-mathbb{E}(w_j^iz^i))right] &=mathbb{E}left[(w_k^iz^i)(w_j^iz^i)right] &=mathbb{E}left[w_k^iz^i,w_j^iz^iright] &=mathbb{E}left[w_k^iright]mathbb{E}left[z^i,w_j^iz^iright] &=0^T,mathbb{E}left[z^i,w_j^iz^iright]=0 end{align}$$
Note that we can do $mathbb{E}left[w_k^iz^i,w_j^iz^iright]=mathbb{E}left[w_k^iright]mathbb{E}left[z^i,w_j^iz^iright]$ because $w_k^i$ is independent of $z^i,w_j^iz^i$
Then by extending this to the other neurons of the layer, we can confirm that: $Var(W^iz^i) = n_iVar(W^i)Var(z^i)rightarrow$ We can reach the equation given by the authors.
Answered by Javier TG on February 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP