What does this formula in Glorot & Bengio mean?

Question

In this paper, on page 5, we find the formula
$$Var(z^i)=Var(x)prod_{i'=0}^{i-1}n_{i'}Var(W^{i'})$$
I am really struggling to understand what is meant by this formula. I think at least some of the following are true:

We're dealing with a linear neural network, i.e. no activation functions.
$z^i$ is the output of one layer of the network, so for the very first hidden layer (closest to the input) of this network we would have:
$$Var(y)=nVar(x)Var(W)$$
where $y$ is the output vector of the first hidden layer, $x$ is the input vector and $W$ is the matrix of weights connecting that first layer to the input.

However, it's entirely unclear to me what these variances are. I thought maybe $Var(y)$ just meant the empirical variance of the vector $y$, i.e. the sum of the squares of the differences of the elements of $y$ from the mean of $y$, and likewise for $Var(x)$ and $Var(W)$, where the latter is just the variance of all of the entries of $W$. But under this interpretation the formula turns out to be false numerically, so I'm at a bit of a loss to understand what this equation is supposed to mean.

Javier TG · Answer

1. Meaning of the formula
$$Var(z^i)=Var(x)prod_{i'=0}^{i-1}n_{i'}Var(W^{i'})$$
This formula expresses the variance of the values of the activation for each neuron $k$ of the layer $i rightarrow z^i_k$. This value, under the assumptions that are going to be mentioned along this post, is the same for all the neurons in the layer $i$. This is why the authors express it as $Var(z^i)$ instead of a particular $Var(z^i_k)$. Just to simplify the notation
Note that this value is scalar, as well as the other variances involved in the equation: $Var(x)$ and $Var(W^i)$. So, just as a summary, these variances have the next meanings:

$Var(W^i)rightarrow$ The variance of the random initialization of the weights.
$Var(x)rightarrow$ The variance of each feature (which is asummed to be the same for every feature as we are going to see in the next lines).
$Var(z^i)rightarrow$ As said, this is the variance of the activation of each neuron of the layer $i$.

So now let's see how to reach this formula.
2. Reaching the formula
2.1 Analysing some parts of the paper
Let's analyse some parts of that page of the article that answer the other questions and that are going to be also useful to understand the assumptions that are being made.

For a dense artificial neural network using symmetric activation function $f$ with unit derivative at $0$ (i.e. $f'(0) = 1$), if we write $z^i$ for the activation vector of layer $i$, and $s^i$ the argument vector of the activation function at layer $i$, we have:
$$ s^i = z^iW^i + b^i,,,,,,,,,,,,,,,,,,z^{i+1}=f(s^i)$$

We can draw some conclusions given that paragraph:

The authors are considering an activation function subject to $f'(0) = 1$ in that part of the article. The fact that $f'(0) = 1$ means that locally the function $f(s^i)$ will behave linearly at the neighborhood of $s^i_k=0rightarrow f'(s^i_k) approx 1$, where $k$ is the index of the neuron.
As you said $rightarrow z^i$ represents the activation output of one layer

The variances will be expressed with respect to the input,
outpout and weight initialization randomness

This means that the variances that they are going to use are going to be $Var(x)$, $Var(z^i)$ and $Var(W^i)$ respectively, and during the initialization (the weights are only spread randomly at the initialization).
The variance of the biases is not being considered as they are assumed to be initialized to the same value at the initialization $ Rightarrow Var(b^i)=0$.

Consider the hypothesis that we are in a linear regime at the initialization, that the weights are initialized independently and that the inputs features variances are the same ($=Var(x))$.

Given this, we know that:

Linear regime $Rightarrow f'(s^i_k) approx 1$ as we saw above.
Weights are initialized independently $Rightarrow Var(W^iW^{i+1})=Var(W^i)Var(W^{i+1})$
Inputs features variances are the same $Rightarrow Var(x_k) = Var(x_{k+1}) ,,forall,, k$ where $k$ is the index of the input

2.2 Example
So now, let's compute for example the Variance of the activations of the first hidden layer ($i=1$) i.e. $Var(z^1_k)$:
$$ begin{align}
Var(z^1_k) &= Var(f(s^0_k)) && (1)\
&approx (f'(mathbb{E}(s^0_k))^2,,Var(s^0_k) && (2)\
&= (f'(0))^2,,Var(s^0_k) && (3)\
&= Var(s^0_k) && (4) \
&= Var(W^0_k x + b_k^0) && (5)\
&= Var(W^0_k x) = Var(w^0_{k1} x_1 + w^0_{k2} x_2 +... )  && (6)\
&= Var(w^0_{k1} x_1) + Var( w^0_{k2} x_2) +...  && (7)\
&= n_0Var(w^0_{kj})Var(x_j)  && (8)\
&= n_0Var(W^0)Var(x) && (9)
end{align}$$
Note that we would end up with the same expression of $Var(z^i_k)$ for every neuron $k$ of the layer $i rightarrow $ Now we understand why $Var(z^i)$   represents the variance of the activation for each neuron in the layer $i$ i.e. $Var(z^i) = Var(z^i_k)$.
2.3 Justifications
The justifications for each step are:

$z^{i+1}=f(s^i) Rightarrow z^{1}=f(s^0)$
The justification to this can be found in the Wikipedia post "Approximating the variance of a function"
$mathbb{E}(s^0)= mathbb{E}(W^0x) = mathbb{E}(W^0)mathbb{E}(x)=0$. This is because $x$ and $W^i$ are independent and $W^i$ is assumed to be randomly initialized with mean $0$.
Linear regime $Rightarrow f'(s^i_k) approx 1$.
$s^i = W^ix + b^i$
Bias values are initialized independently and $Var(b^i)=0$ as we mentioned above.
This is true because $Cov(w^0_{k1} x_1, w^0_{k2} x_2)=0$. Something that is proved at the end of the post.
(and 9) This is because the weights and inputs are independent, and because of the assumptions made by the authors, which are:
$$Var(x_j)=Var(x),,forall,k ,,,,,,,text{ and },,,,,,, Var(w_{jk})=Var(W),,forall,,k,j$$

3. Conclusion
Extending this reasoning to all the layers we end up with the equation given by the authors:
$$Var(z^i)=Var(x)prod_{i'=0}^{i-1}n_{i'}Var(W^{i'})$$

Edit regarding the comments
The reasoning holds even if the features are not independent. This is because, we can proof that the covariance of the $z^{i+1}_k$ terms in the same layer $i$ is $0$. This also serves as an explantion to the step $(7)$ of the example above.
To see this let's compute $Cov(w_k^iz^i,w_j^iz^i)$ where $w_k^i$ and $w_j^i$ represent the vector of weights related to the neurons $k$ and $j$ of the layer $i$, with $kneq j$:
$$begin{align}
Cov(w_k^iz^i,,,w_j^iz^i) &= mathbb{E}left[(w_k^iz^i-mathbb{E}(w_k^iz^i))(w_j^iz^i-mathbb{E}(w_j^iz^i))right]
&=mathbb{E}left[(w_k^iz^i)(w_j^iz^i)right]
&=mathbb{E}left[w_k^iz^i,w_j^iz^iright]
&=mathbb{E}left[w_k^iright]mathbb{E}left[z^i,w_j^iz^iright]
&=0^T,mathbb{E}left[z^i,w_j^iz^iright]=0
end{align}$$
Note that we can do $mathbb{E}left[w_k^iz^i,w_j^iz^iright]=mathbb{E}left[w_k^iright]mathbb{E}left[z^i,w_j^iz^iright]$ because $w_k^i$ is independent of $z^i,w_j^iz^i$
Then by extending this to the other neurons of the layer, we can confirm that: $Var(W^iz^i) = n_iVar(W^i)Var(z^i)rightarrow$ We can reach the equation given by the authors.

What does this formula in Glorot & Bengio mean?

One Answer

1. Meaning of the formula

2. Reaching the formula

2.1 Analysing some parts of the paper

2.2 Example

2.3 Justifications

3. Conclusion

Add your own answers!

Ask a Question