In the case of invalid actions, which output probability matrix should we use in back-propagation?

Question

As discussed in this thread, you can handle invalid moves in reinforcement learning by re-setting the probabilities of all illegal moves to zero and renormalising the output vector.
In back-propagation, which probability matrix should we use? The raw output probabilities, or the post-processed vector?

Marcel_marcel1991 · Answer

I am also quite new to this field, but I think you should use the normalized outputs for the backpropagation. In general, you would want to backpropagate all the calculations you did in the forward pass, so why would you want to exclude the step of the normalization in your backward pass? This would essentially make the renormalization to have no effect (different loss values but no different model weight update).
For example, in policy gradients, you backpropagate through the log propability of the selected action. In the forward pass, the sampling of the probability (that determines which action is selected) is not affected by the renormalization (you might just get different loss values in the end in your loss function). But, compared to this, in the backward pass, you need the actual value of the log probability to calculate the gradient that updates the model weights.
So (I think) the normalization is mostly just done for the backpropagation to get "renormalized" gradients. And no unbalanced gradients between states with more/less allowed actions.

In the case of invalid actions, which output probability matrix should we use in back-propagation?

One Answer

Add your own answers!

Ask a Question