Data Science Asked on March 16, 2021
The advantage function in GAE is defined as (eq 1)
where (eq 2),
The question is, in Eq 2, why is a value function (estimator, that needs to be trained) used at all? Especially when the value function is used to estimate a reward whose true value is already available to us? Why not just use that reward as the value?
During the training process, the value function is estimated with an $ MSE_{loss} = (Value-RewardToGo_{Discounted})^2$ loss function and gradient descent [3]. Here, why is a separate value function is being used to estimate reward? Why not just used the actual reward that’s calculated in Eq 2? If the value function is being used to estimate values of states other than the trajectory taken then it makes sense, but here, since we have traversed the particular trajectory we already have the true value through the reward.
Does this make sense? Please share your thoughts
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP