Data Science Asked by Thomas Johnson on September 29, 2020
I’m trying to understand how Q-learning deals with games where the optimal policy is a mixed strategy. The Bellman equation says that you should choose $max_a(Q(s,a))$ but this implies a single unique action for each $s$. Is Q-learning just not appropriate if you believe that the problem has a mixed strategy?
One possibility is to use softmax and choose each action a randomly with probabiliy $p = frac{exp(Q(s,a))}{sum_a exp(Q(s,a))}$. I don't thinks it is still Q-learning though.
Answered by Robin Nicole on September 29, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP