Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

Question

TD(0) algorithm is defined as the iterative update of the following:

$$ V(s) leftarrow V(s) + alpha({r + gamma V(s')} - V(s) ) $$

Now, if we assume alpha to be equal to 1, we get the traditional Policy Evaluation formula in Dynamic programming. Is it correct?

Dany Yatim · Answer

$alpha$ is independent of the type of RL algorithm. It is the learning rate, i.e. the rate at which you will update a state value. You could set it to 1 or less.

Policy evaluation is a 'general principle'. Temporal difference is a way to make it work. More precisely, TD defines by how far in the future you take in account the consequences of an action. In your equation, gamma defines by how much you take that future into account.

Brian Spiering · Answer

No - Dynamic programming estimates the value of the next state by first looking at all possible next states. Temporal difference 0 estimates the value of the next state by only looking at a single next state.

Answered by Brian Spiering on February 26, 2021

Idenitity between TD(0) algorithm and Policy Evaluation in Dynamic Programming when alpha is equal to 1

2 Answers

Add your own answers!

Ask a Question