Cross Validated Asked on December 6, 2021
Suppose I have a tabular Q learning problem such as grid-world.
Let our loss be defined as,
$$hat{L}(Q)=0.5(Q(s,a)-(r+gammamax_{a’}{Q(s’,a’)}))^2$$
Then $Q_{k+1}(s,a) = Q_k(s,a) – eta nabla hat {L}(Q) = Q_k(s,a) – eta(Q_k(s,a) – r_k+gammamax_{a’}{Q_k(s’,a’)})$ which is just Q learning.
So, does a gradient descent approach make sense if we take our loss function to be the difference between the current Q value and the TD error?
Yes, it is possible; you are close, but not quite there.
You lost a gradient in your equation; it should be: $$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a)-(r+gammamax_{a'}{Q(s',a')})right)left(left.frac{d~Q}{d~theta}right|_{(s,a)} - gamma left.frac{d~max_{a'}Q}{d~theta}right|_{(s')} right)$$
Which does simplify a bit in the case of a tabular representation:
$$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a)-(r+gammamax_{a'}{Q(s',a')})right)left(1 - gamma left.frac{d~max_{a'}Q}{d~theta}right|_{(s')} right)$$
Problems may arise if $s=s'$ and $a=a'$, because your update will be $0$ (which it shouldn't). It's also not a good idea to try and differentiate the $max$ function.
You can do the "double deep q-learning trick" and introduce $theta_textrm{old}$ to estimate $Q(s',a')$, i.e., use the q-table from the previous step. This will make the other gradient dissapear, and you are indeed left with q-learning:
$$Q_{k+1}(s,a) = Q_k(s,a) - eta left(Q(s,a, theta)-(r+gammamax_{a'}{Q(s',a', theta_textrm{old})})right)$$
In this case, the loss will be
$$hat L(theta)= frac12 left(Q(s,a,theta)−(r+gamma max_{a′}Q(s′,a′, theta_textrm{old}))right)^2$$
Answered by FirefoxMetzger on December 6, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP