Artificial Intelligence Asked by THAT_AI_GUY on August 24, 2021
Why is the expected return in Reinforcement Learning (RL) computed as a sum of cumulative rewards?
Would it not make more sense to compute $mathbb{E}(R mid s, a)$ (the expected return for taking action $a$ in the given state $s$) as the average of all rewards recorded for being in state $s$ and taking action $a$?
In many examples, I’ve seen the value of a state computed as the expected return computed as the cumulative sum of rewards multiplied by a discount factor:
$V^π(s)$ = $mathbb{E}(R mid s)$ (the value of state s, if we follow policy π is equal to the expected return given state s)
So, $V^π(s)$ = $mathbb{E}(r_{t+1}+ γr_{t+2}+ (γ^2)_{t+3} + … mid s) = {E}(∑γ^kr_{t+k+1}mid s)$
as $R=r_{t+1}+ γr_{t+2}+ {γ^2}r_{t+3}, + … $
Would it not make more sense to compute the value of a state as the following:
$V^π(s)$ = $(r_{t+1} + γr_{t+2} + (γ^2)_{t+3}, + … mid s)/k = {E}(∑γ^kr_{t+k+1}mid s)/k $ where k is the number of elements in the sum, thus giving us the average reward for being in state s.
Reference for cumulative sum example: https://joshgreaves.com/reinforcement-learning/understanding-rl-the-bellman-equations/
Why is the expected return in Reinforcement Learning (RL) computed as a sum of cumulative rewards?
That is the definition of return.
In fact when applying a discount factor this should formally be called discounted return, and not simply "return". Usually the same symbol is used for both ($R$ in your case, $G$ in e.g. Sutton & Barto).
There are also other variations, such as truncated return (sum up to a given time horizon). They all share the feature that a return is a sum of reward values. You cannot really change that and keep the formal term "return", that's how it has been defined.
You can however define the value function to be something other than the expected return. Rather than looking for alternative definitions of return as your title suggests, you could be looking for alternative metrics to use as value functions.
You do go on to ask about computing "the value of a state" without mentioning the word "return", but it is not 100% clear whether you are aware that the way to resolve this is to not use return, but something else.
Would it not make more sense to compute the value of a state as the following: $V^π(s)$ = $(r_{t+1} + γr_{t+2} + (γ^2)_{t+3}, + ... mid s)/k = {E}(∑γ^kr_{t+k+1}mid s)/k $ where k is the number of elements in the sum, thus giving us the average reward for being in state s.
Your example would nearly always result in zero for long-running or non-episodic problems, as you are summing a decreasing geometric series possibly up to very large $k$, then dividing by the maximum $k$. Notation-wise you are also using $k$ to be an iterator and the maximum value of the same iterator, that would need fixing.
However, this is very close to a real value metric used in reinforcement learning, called the average reward setting.
The expected average reward value function for a non-episodic problem is typically given by
$$V^pi(s) = mathbb{E}[lim_{h to infty}frac{1}{h}sum_{k=0}^{h}r_{t+k+1}|s_t = s]$$
Note there is no discount factor, it is not usually possible to combine a discount factor with the average reward setting.
Sutton & Barto point out in Reinforcement Learning: An Introduction chapter 10, section 10.4, that when using function approximation on continuing tasks, then a discount factor is not a useful part of the setting. Instead average reward is a more natural approach. It is also not so different, and quite easy to modify the Bellman equations and update rules. However, many DQN implementations still use discounted return to solve continuing tasks. That is because with high enough discount factor $gamma$, e.g. $0.99$ or $0.999$, then the end result is likely to be the same optimal solution - the discount factor has moved from being part of the problem formulation to being a solution hyperparameter.
Answered by Neil Slater on August 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP