Problem with convergence of ReLu in MLP

Question

I created neural network from scratch in python using only numpy and I'm playing with different activation functions. What I observed is quite weird and I would love to understand why this happens.
The problem I observed depends on initial weights. When using sigmoid function it does not matter that much if weights are random numbers in ranges of [0,1] or [-1,1] or [-0.5,0.5]. But when using ReLu the network very often has a huge problem with ever converging when I'm using random weights in range [-1,1]. But when I changed the range of initialization of weights to [-0.5,0.5] it started to work. This only applies to ReLu activation function and I totally don't get it why it won't work for [-1,1]. Shouldn't it be able to converge with any random weights?
Also when I changed initial weights to normal distibution, it has no problem with convergence. I understand that normal distribution should work better and faster than random [-1,1]. What I don't understand is why it can't converge (error remains the same epoch after epoch) with [-1,1] and has no problem with converging with normal distribution... Shouldn't it always be able to converge just slower and faster with different initialization method?
PS. I'm using normal backpropagation with softmax as last layer and MSE as loss function

kate-melnykova · Accepted Answer

I will start with a toy example for the convergence part. Suppose that the loss function is $f(x) = x^4$ and we want to minimize it using the gradient descent. Clearly, the minimum is attained at zero and, in general, we would like the magnitude of the current approximation to decrease. The update rule of gradient descent is $$ x_{k+1} = x_k - lambda nabla f = x_k - lambda cdot 4x_k^3.$$ Simplifying the expression, we get $$x^{k+1} = x^k(1 - 4lambda x_k^2).$$ And now the combo initialization + learning rate start to appear. If $|1-4lambda x_0^2| < 1$, then $|x_0| > |x_1| > |x_2| > ...$. The sequence will go to zero eventually. If $|1-4lambda x_0^2| > 1$, then $|x_0| < |x_1|$. In this case, $|x_1| < |x_2|$ and so on -- the sequence will grow. Therefore, if the learning rate $lambda$ if fixed, then the initial values $x^0$ determine if the grad descent converges or not. When the gradient descent converge? The math says that the gradient descent converge when the learning rate $lambda$ and the gradient $nabla f$ satisfy $$|lambda nabla f| < 1$$ along the optimization path. Note that this condition does not need to be valid for all values of $x$, but at every $x_0, x_1, x_2, ...$ For many "good" functions, it suffices to require $|lambda nabla f| < 1$ only at $x_0$. The reason is that after the first iteration, we are closer to the local minimum, and for many "good" functions, it means that the gradient will be smaller. What about weight choice in [-0.5,0.5] and [-1, 1]? I think of it as follows: Suppose that we selected weights in $[-0.5, 0.5]$ (model 1) and then multiplied all weights by 2 to get uniform distribution in $[-1, 1]$ (model 2). Suppose that the learning rate is identical in both cases, and let's check how the SGD performs. For simplicity of the argument, I replace it with the gradient descent. How does it transfer to the NN? Note that a linear map (say, for the dense layer) has the following property. Suppose that all weights W are multiplied by 2, then $$|2W| = 2|W|.$$ ReLU is quazi-linearly-scalable: for every $a > 0$, $$acdottext{ReLU}(x)=text{ReLu}(ax).$$ Note that if your NN has $d$ layers, then, multiplying all weights by 2, your output increases in $2^d$ times (factor of 2 for each layer). It is hard to compute the gradient for the NN, but using the product rule, I expect it to increase in approximately $2^{d}$ times. If model 1 satisfies the conditions for the gradient descent convergence, then we need to decrease the learning rate about $2^{d}$ times to guarantee the gradient descent convergence condition.

Problem with convergence of ReLu in MLP

One Answer

Add your own answers!

Ask a Question