Maths of Xavier initialization

Data Science Asked on August 14, 2021

The paper I read is Glorot et al (2010). And the math part is in Section 4.2.1.
Formula (5) and (10) make sense to me but I cannot derive formula (6) and (7) myself from (2) and (3).

I found many tutorials on the internet used the formula
$$Var[XY] = Var[X]Var[Y] + (E[X])^2 Var[Y] + Var[X](E[Y])^2$$
which requires the independence between X and Y.

But in formula (2) and (3) the gradients are not independent of W and Z, because all of them are related to each other through the output from the last layer.

I would appreciate it if anyone can give me a derivation of the formula (6) and (7).
Thanks in advance.

deep learning mathematics neural network statistics

Add your own answers!

Ask a Question

Get help from others!