Data Science Asked on February 26, 2021
TD(0) algorithm is defined as the iterative update of the following:
$$ V(s) leftarrow V(s) + alpha({r + gamma V(s’)} – V(s) ) $$
Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?
$alpha$ is independent of the type of RL algorithm. It is the learning rate, i.e. the rate at which you will update a state value. You could set it to 1 or less.
Policy evaluation is a 'general principle'. Temporal difference is a way to make it work. More precisely, TD defines by how far in the future you take in account the consequences of an action. In your equation, gamma defines by how much you take that future into account.
Answered by Dany Yatim on February 26, 2021
No - Dynamic programming estimates the value of the next state by first looking at all possible next states. Temporal difference 0 estimates the value of the next state by only looking at a single next state.
Answered by Brian Spiering on February 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP