Is my pseudocode titled "Monte Carlo Exploring Starts (with model)" correct?

Question

Reinforcement Learning: An Introduction second edition, Richard S. Sutton and Andrew G. Barto:

We made two unlikely assumptions above in order to easily obtain this guarantee of
convergence for the Monte Carlo method. ... For now we focus on the assumption that policy evaluation operates on an infinite
number of episodes. This assumption is relatively easy to remove. In fact, the same issue
arises even in classical DP methods such as iterative policy evaluation, which also converge
only asymptotically to the true value function.

There is a second approach to avoiding the infinite number of episodes nominally
required for policy evaluation, in which we give up trying to complete policy evaluation
before returning to policy improvement. On each evaluation step we move the value
function toward q⇡k, but we do not expect to actually get close except over many steps.
We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme
form of the idea is value iteration, in which only one iteration of iterative policy evaluation
is performed between each step of policy improvement. The in-place version of value
iteration is even more extreme; there we alternate between improvement and evaluation
steps for single states.

The original pseudocode:

Monte Carlo ES (Exploring Starts), for estimating $pi approx pi_{*}$
Initialize:
$quad$ $pi(s) in mathcal{A}(s)$ (arbitrarily), for all $s in mathcal{S}$
$quad$ $Q(s, a) in mathbb{R}$ (arbitrarily), for all $s in mathcal{S}, a in mathcal{A}(s)$
$quad$ $Returns(s, a) leftarrow$ empty list, for all $s in mathcal{S}, a in mathcal{A}(s)$
Loop forever (for each episode):
$quad$ Choose $S_{0} in mathcal{S}, A_{0} in mathcal{A}left(S_{0}right)$ randomly such that all pairs have probability $geq 0$
$quad$ Generate an episode from $S_{0}, A_{0},$ following $pi: S_{0}, A_{0}, R_{1}, ldots, S_{T-1}, A_{T-1}, R_{T}$
$quad$ $G leftarrow 0$
$quad$ Loop for each step of episode, $t=T-1, T-2, ldots, 0$
$quadquad$ $G leftarrow gamma G+R_{t+1}$
$quadquad$ Unless the pair $S_{t}, A_{t}$ appears in $S_{0}, A_{0}, S_{1}, A_{1} ldots, S_{t-1}, A_{t-1}:$
$quadquadquad$ Append $G$ to $Returnsleft(S_{t}, A_{t}right)$
$quadquadquad$ $Qleft(S_{t}, A_{t}right) leftarrow text{average}left(Returnsleft(S_{t}, A_{t}right)right)$
$quadquadquad$ $pileft(S_{t}right) leftarrow arg max _{a} Qleft(S_{t}, aright)$

I want to make the same algorithm but with a model. The book states:

With a model, state values alone are sufficient to determine a policy; one simply looks ahead one step and chooses whichever action leads to the best combination of reward and next state, as we did in the chapter on DP.

So based on the 1st quote I must use "stars exploration" and "one evaluation — one improvement" ideas (as well as in model-free version) to make the algorithm converge.
My version of the pseudocode:

Monte Carlo ES (Exploring Starts), for estimating $pi approx pi_{*}$ (with model)
Initialize:
$quad$ $pi(s) in mathcal{A}(s)$ (arbitrarily), for all $s in mathcal{S}$
$quad$ $V(s) in mathbb{R}$ (arbitrarily), for all $s in mathcal{S}$
$quad$ $Returns(s) leftarrow$ empty list, for all $s in mathcal{S}$
Loop forever (for each episode):
$quad$ Choose $S_{0} in mathcal{S}, A_{0} in mathcal{A}left(S_{0}right)$ randomly such that all pairs have probability $geq 0$
$quad$ Generate an episode from $S_{0}, A_{0},$ following $pi: S_{0}, A_{0}, R_{1}, ldots, S_{T-1}, A_{T-1}, R_{T}$
$quad$ $G leftarrow 0$
$quad$ Loop for each step of episode, $t=T-1, T-2, ldots, 1$:
$quadquad$ $G leftarrow gamma G+R_{t+1}$
$quadquad$ Unless $S_{t}$ appears in $S_{0}, S_{1}, ldots, S_{t-1}:$
$quadquadquad$ Append $G$ to $Returns left(S_{t}right)$
$quadquadquad$ $Vleft(S_{t}right)leftarrowtext{average}left(Returnsleft(S_{t}right)right)$
$quadquadquad$ $pileft(S_{t-1}right) leftarrow operatorname{argmax}_{a} sum_{s^{prime}, r} pleft(s^{prime}, r mid S_{t-1}, aright)left[gamma Vleft(s^{prime}right)+rright]$

— Here I update the policy in $S_{t-1}$ because the step before we update $V(S_{t})$ and changes to $V(S_{t})$ don't affect $pi (S_{t})$, but affect $ pi (S_{t-1})$, as $S_{t}$ is in $S'$ for $S_{t-1}$.
Pseudocode as images:

Is my pseudocode titled "Monte Carlo Exploring Starts (with model)" correct?

Add your own answers!

Ask a Question