TransWikia.com

In the case of invalid actions, which output probability matrix should we use in back-propagation?

Artificial Intelligence Asked by guineu on February 20, 2021

As discussed in this thread, you can handle invalid moves in reinforcement learning by re-setting the probabilities of all illegal moves to zero and renormalising the output vector.

In back-propagation, which probability matrix should we use? The raw output probabilities, or the post-processed vector?

One Answer

I am also quite new to this field, but I think you should use the normalized outputs for the backpropagation. In general, you would want to backpropagate all the calculations you did in the forward pass, so why would you want to exclude the step of the normalization in your backward pass? This would essentially make the renormalization to have no effect (different loss values but no different model weight update).

For example, in policy gradients, you backpropagate through the log propability of the selected action. In the forward pass, the sampling of the probability (that determines which action is selected) is not affected by the renormalization (you might just get different loss values in the end in your loss function). But, compared to this, in the backward pass, you need the actual value of the log probability to calculate the gradient that updates the model weights.

So (I think) the normalization is mostly just done for the backpropagation to get "renormalized" gradients. And no unbalanced gradients between states with more/less allowed actions.

Answered by Marcel_marcel1991 on February 20, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP