TransWikia.com

GAE - Understanding the TD based advantage function

Data Science Asked on March 16, 2021

The advantage function in GAE is defined as (eq 1)

enter image description here

where (eq 2),

enter image description here

The question is, in Eq 2, why is a value function (estimator, that needs to be trained) used at all? Especially when the value function is used to estimate a reward whose true value is already available to us? Why not just use that reward as the value?

During the training process, the value function is estimated with an $ MSE_{loss} = (Value-RewardToGo_{Discounted})^2$ loss function and gradient descent [3]. Here, why is a separate value function is being used to estimate reward? Why not just used the actual reward that’s calculated in Eq 2? If the value function is being used to estimate values of states other than the trajectory taken then it makes sense, but here, since we have traversed the particular trajectory we already have the true value through the reward.

Does this make sense? Please share your thoughts

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP