Chain function in backpropagation

Question

I'm reading a Neural Networks Tutorial. In order to answer my question you might have to take a brief look at it.
I understand everything until something they declare as "chain function":

It's located under 4.5 Backpropagation in depth.
I know that the chain rule says that the derivative of a composite function equals to the product of the inner function derivative and the outer function derivative, but still don't understand how they got to this equation.
Update:
Please read the article and try to only use the terms and signs that are declared there. It's confusing and tough enough anyway, so I really don't need extra-knowledge now. Just use the same "language" as in the article and explain how they arrived to that equation.

10xAI · Accepted Answer

This is the Neural Network -
$hspace{5cm}$
This is the equation -
$hspace{5cm}$$frac{partial J}{partial w_{12}^{(2)}} = frac{partial J}{partial h_{1}^{(3)}} frac{partial h_{1}^{(3)}}{partial z_{1}^{(2)}} frac{partial z_{1}^{(2)}}{partial w_{12}^{(2)}}$
$J$ = The calculated Loss
 $w_{12}^{(2)}$ - Weight for which the rate of Loss is to be calculated
$h_{1}^{(3)}$ - Output after the Activation of Output Neuron [This will be used with True value to get the Loss]
$z_{1}^{(2)}$ - This is Output before the Activation functions
We create chain so that each individual partial derivative can be easily calculated and we get the derivative of two variables which are not directly connected i.e. inner layer weights and the Loss
First term - Derivative of Loss w.r.t Output -
$J$ =  $y_{true}^2$ - ${h_{1}^{(3)}}^2$ [Assuming a square loss]
Deriative will be - 2*${h_{1}^{(3)}}$
Second term - Derivative of output after activation w.r.t Output before activation -
$h_{1}^{(3)}$ =  $z_{1}^{(2)}$ [Assuming a Linear o/p activation]
Deriative will be - 1
If it is Sigmoid, Let's assume sigmoid as $f(x)$
$h_{1}^{(3)}$ =  $f(z_{1}^{(2)})$
We know the derivative of $Sigmoid = f(x)(1 - f(x))$ Read here
 So the derivative is - $f(z_{1}^{(2)})$(1 - $f(z_{1}^{(2)}))$
Third term - Derivative of output before activation w.r.t one of the participating three weights -
$z_{1}^{(2)}$ =  $h_{1}^{(2)}$ * $w_{11}^{(2)}$ + $h_{2}^{(2)}$ * $w_{12}^{(2)}$ + $h_{3}^{(2)}$ * $w_{13}^{(2)} + b$
First and last two terms will will 0. Deriative of mid term will be - $h_{2}^{(2)}$

BlackCurrant · Answer

Well may be you should understand how its done in a Perceptron first and then fill the gaps in,

but still don't understand how they got to this equation.

in Gradient descent we want to calculate the steepest descent along the surface, which we can do by calculating the derivative of E with respect to each component of weights.so we can write it like this. Now the negative of this gives me the steepest.

Now, in back propagation, to put it in very simple terms, we have all these weights(now from one unit to another since its a network) which we want to tune in such a way it minimizes the error for the complete network. for that we calculate the error in forward pass and then accordingly go back and tune the weights . But this time i have to to figure out how to do it when there are weights from one node to another(your article is trying to show you this). but good news is i have this equation which tells for how i can do it for a weight. So now, Its basically straight forward using this equation, just needs to track of all these weights from one unit to another.

So now i have weights like this and its weights sum

And we again solve it by like-

Now , in your article they are doing the same but wrote this equation just to explain that how a weights change can influence the error in the whole network, so just the mathematical representation . for chain rule you can read- https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-1a/v/chain-rule-introduction

Please read Tom Mitchell's chapter on Neural Networks.

Here is another answer to it-Compute backpropagation

Chain function in backpropagation

2 Answers

Add your own answers!

Ask a Question