Artificial Intelligence Asked by ijuneja on November 4, 2021
In the paper – "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", on page 1083, on the 6th line from the bottom, the authors define expectation of the empirical model as
$$hat{mathbb{E}}_{s,s’,a}[V(s’)] = sum_{s’ in S} hat{P}^{a}_{s, s’}V(s’).$$
I didn’t understand the significance of this quantity since it puts $V(s’)$ inside an expectation while assuming the knowledge of $V(s’)$ in the definition on the right.
A clarification in this regard would be appreciated.
EDIT:
The paper defines $hat{P}^{a}_{s, s’}$ as,
$$hat{P}^{a}_{s, s’} = frac{|(s, a, s’, t)|}{|(s, a, t)|}.$$
Where $|(s, a, t)|$ is the number of times state $s$ was visited and action $a$ was taken and $|(s, a, s’, t)|$ as the number of times among the $|(s, a, t)|$ times $(s, a)$ was visited when the next state landed in was $s’$ during model learning.
No explicit definition for $V$ is provided however, $V^{pi}$ is defined as the usual expected discounted return, using the same definition as Sutton and Barto or other sources.
If I understand your question correctly, the significance of this is due to the fact that $s'$ is random. In the RHS of the equation it is assumed that $V(cdot)$ is known for each state, but the quantity is measuring the expected value of the next state given the current state and action.
Answered by harwiltz on November 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP