Data Science Asked on April 27, 2021
In the paper that introduced Batch Normalization, on page 5, the authors write the equation
$$frac{partial text{BN}((aW)u)}{partial u} = frac{partialtext{BN}(Wu)}{u}$$
Here $W$ is the matrix of weights connecting the layer $u$ to the next, batch-normalized layer, so the conclusion is that scaling the weights by a constant doesn’t affect this partial derivative.
This seems false to me. Let $b$ be the value of some output neuron and $u$ the layer above, so that:
$$b=sum w_iu_i$$
Now let $hat b$ be the batch-normalized version of $b$:
$$hat b = b – frac1Nsum b^i$$
where by $b^i$ I mean the value of the neuron $b$ for the $i$-th training input in the batch. We have
$$hat b = sum w_iu_i – frac1Nsum_j b^j= sum w_iu_i – frac 1Nsum_j (sum_i w_ju_j^i)$$
Since the values $u_i$ never appear in the second sum we simply have
$$partial_{u_i} hat b = w_i$$
Which very much does scale with $W$. Am I making a mistake, or misinterpreting the original equation?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP