TransWikia.com

Reinforcement learning: negative reward (punish) illegal actions?

Data Science Asked by BigBadMe on September 2, 2020

If you train an agent using reinforcement learning (with Q-function in this case), should you give a negative reward (punish) if the agent proposes illegal actions for the presented state?

I guess over time if you only select from between the legal actions, the illegal ones would eventually drop out, but would punishing them cause them to drop out sooner and possibly cause the agent to explore more possible legal actions sooner?

To expand on this further; say you’re training an autonomous vehicle, and the output is drive direction (forward or reverse) and speed. Say for the scenario you’re in, the vehicle must drive between a speed range, e.g. 20mph min, 40mph max, what do you do in the scenario where the agent gives an action to drive forward but gives a speed below the minimum speed? Or another example, say you’re training to play a game, and the agent proposes an illegal action which it cannot perform.

I can’t proceed with the action because it’s illegal, so what do I do? How do I proceed with training in that situation? I will of course enter the min/max speeds as part of the state given to the agent, but how do I prevent it from proposing actions that are illegal, and how do I proceed with training when it does?

One Answer

I think that you should specify better what an illegal action is. Suppose to me in a highway with a self-driving car. I don't which are your legal limits values but suppose that they are between 20mph and 40mph. Of course, the car itself could drive at less than 20mph and let us suppose that its maximum possible velocity is 60mph. If your self-driving car is driving at less of 20mph or between 40mph and 60mph you should give a negative reward for each time-step it is outside legal limits. Instead, if your self-driving car is at more than 60mph you have a problem in your environment because 60mph is a physical limitation and should be handled by the environment. An easy solution would be to clip the action between 0mph and 60mph and give negative reward if it's in the range 0mph to 20mph or if it is in the range 40mph to 60mph.

Answered by gvgramazio on September 2, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP