Data Science Asked by raffaem on February 17, 2021
I am reading “reinforcement learning – An introduction” by Sutton and Barto.
At pag. 59, there is the Bellman equation for the state-value function
$begin{array}{ll}
v_{pi}(s) &=
mathbb{E}_{pi}[G_t|S_t=s] &= mathbb{E}_{pi}[R_{t+1} + gamma G_{t+1}|S_t=s] &= sumlimits_{a} pi(a|s) sumlimits_{s’} sumlimits_{r} p(s^{‘},r|s,a) left[ r + gamma mathbb{E}_{pi}[G_{t+1}|S_{t+1}=s’] right]
end{array}$
I didn’t understand why the expected value survived in the last expression. The definition of the expected value is $mathbb{E}[X] = sumlimits x cdot p(x)$, not $mathbb{E}[X] = sumlimits mathbb{E}[x] cdot p(x)$
I don’t know whether my question is clear. In the last equation of the defition of $v_{pi}(s)$, I would not have put the expected value inside
With expected values you have a fair bit of freedom to expand/resolve or not.
For instance, assuming the distributions $X$ and $Y$ are independently resolved (i.e. the values are not correlated):
$$mathbb{E}[X + Y] = (sum_x xp(x))+ mathbb{E}[Y]$$
$$mathbb{E}[XY] = sum_x xp(x)mathbb{E}[Y]$$
Each time step of a MDP is independent in this way, so you can use this when handling sums and products within expectations in the Bellman equations (provided you separate terms by time step).
For the Bellman equation, the goal is to relate $v_pi(s_t)$ to $v_pi(s_{t+1})$, and the definition of value is given as an expectation, so it makes sense to preserve the second expectation rather than expand it.
Something has to change though, as within the second sum a time step is effectively taken from $s$ to $s'$, so the new expectation has to include that. It has in some sense been expanded, just not fully broken down into the full product of every following policy decision and state transition.
You could try to write out the full expansion from expected to products of sums over distributions using some container like $Pi_{n=t+1}^{T}$ - showing how to calculate the expected value over full tree of all possibilities - and the maths would still work. But it would be a very longhand way of showing the same relationship.
Answered by Neil Slater on February 17, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP