Cross Validated Asked by Ruye on December 1, 2021
In the video by Prof Brunskill "Stanford CS234 winter 2019 lecture 4" for model-free control (https://www.youtube.com/watch?v=j080VBVGkfQ), at 57:49/1:17:45, the pseudo code for SARSA includes line 8 for e-greedy update of the current policy pi. It seems the results of this code include the optimal policy pi as well as Q(s,a). The code for Q-learning at 1:10:53/1:17:45 is the same.
On the other hand, in the book by Sutton and Barto (https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf) in the SARSA algorithm (Figure 6.9 on page 155) the policy is not updated in the iteration. The result of this code seems to be just Q(s,a). The code for Q-learning (Figure 6.12 on page 158) is the same.
In the latter case, how do I get the optimal policy? Do I run another round of greedy learning based on Q(s,a) to get the optimal policy? Or can I treat Q(s,a) as a 2-d table and choose action a that maximizes Q(s,a) for each s? Is such a policy the same as that found by the algorithm by Prof Brunskill?
Let me try to answer my own questions. While reading more carefully the SARSA code by Sutton and Barto, I realize the language "using policy derived from Q (e.g.,e-greedy)" is just the same as line 8 in the SARSA code by Brunskill. If this is the case, the policy is being modified during the iteration, doesn't that make the algorithm off-policy instead of on-policy?
However, in this code by Sutton and Barto, although actions from "policy derived from Q" are used, the policy pi itself is still never explicitly updated. Does that mean one needs to derive the optimal policy from Q one more time, after running the SARSA code?
Answered by Ruye on December 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP