Data Science Asked by tandem on March 24, 2021
I am having a hard time understanding why my loss is constantly a zero
when using DQN.
I’m trying to use the gym environment to play the game CartPole-V0
.
In the code below,
r_batch
indicates rewards sampled from the replay buffer, and similarly s_batch, ns_batch, and dones_batch indicate the sampled state, next states, and if the “game” was done. Each time, I sample 4 samples from the buffer.
# This is yi
target_q = r_batch + self.gamma * np.amax(self.get_target_value(ns_batch), axis=1) * (1 - done_batch)
# this is y
target_f = self.model.predict(s_batch)
# this is the gradient descent computation
losses = self.model.train_on_batch(s_batch, target_f)
The values of target_q are:
[ 0.42824322 0.01458293 1. -0.29854858]
The values of target_f are:
[[0.11004215 1.0435755 ]
[0.20311067 2.1085744 ]
[0.413234 4.376865 ]
[0.24785805 2.6716242 ]]
I’m missing something here, and I don’t know what..
OK, a bit more digging led me to this:
The TD target or "target value" gets its name because by updating a Q table or training a NN with it as a ground truth, the estimator will output values in future closer to the supplied value. The estimator "gets closer to the target".
So, I have to update the prediction network with the "ground truth".
for i, val in enumerate(actions):
predict_q[i][val] = target_q[i]
Answered by tandem on March 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP