Artificial Intelligence Asked by Toby on January 10, 2021
The title is one of the special things in Progressive GAN, a paper of the NVIDIA team. By using this method, they introduced that
Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights.
In details, they inited all learnable parameters by normal distribution $N(0,1)$. During training time, each forward time, they will scale the result with per-layer normalization constant from He’s initializer
I reproduced the code from pytorch GAN zoo Github’s repo
def forward(self, x, equalized):
# generate He constant depend on the size of tensor W
size = self.module.weight.size()
fan_in = prod(size[1:])
weight = math.sqrt(2.0 / fan_in)
'''
A module example:
import torch.nn as nn
module = nn.Conv2d(nChannelsPrevious, nChannels, kernelSize, padding=padding, bias=bias)
'''
x = self.module(x)
if equalized:
x *= self.weight
return x
At first, I thought the He constant will be $c = frac{sqrt{2}}{sqrt{n_l}}$ as He’s paper. Normally, $n_l > 2$ so $w_l$ can be scale up which lead to the gradient in backpropagation is increase as the formula in ProGan’s paper $hat{w}_i=frac{w_i}{c}$ $rightarrow$ prevent vanishing gradient.
However, the code shows that $hat{w}_i=w_i*c$.
In summary, I can’t understand why to scale down the parameter many times during training will help the learning speed be more stable.
I asked this question on some communities e.g: StackOverflow, mathematics, Data Science, and still haven’t had an answer.
Please help me explain it, thank you!
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP