TF2.0 DQN has a loss of 0?

Question

I am having a hard time understanding why my loss is constantly a zero when using DQN. 
I'm trying to use the gym environment to play the game CartPole-V0.

In the code below, 
r_batch indicates rewards sampled from the replay buffer, and similarly s_batch, ns_batch, and dones_batch indicate the sampled state, next states, and if the "game" was done. Each time, I sample 4 samples from the buffer.

# This is yi
target_q = r_batch + self.gamma * np.amax(self.get_target_value(ns_batch), axis=1) * (1 - done_batch) 
# this is y
target_f = self.model.predict(s_batch)
# this is the gradient descent computation
losses = self.model.train_on_batch(s_batch, target_f)

The values of target_q are:

[ 0.42824322  0.01458293  1.         -0.29854858]

The values of target_f are:

[[0.11004215 1.0435755 ]
 [0.20311067 2.1085744 ]
 [0.413234   4.376865  ]
 [0.24785805 2.6716242 ]]

I'm missing something here, and I don't know what..

tandem · Answer

OK, a bit more digging led me to this:

The TD target or "target value" gets its name because by updating a Q table or training a NN with it as a ground truth, the estimator will output values in future closer to the supplied value. The estimator "gets closer to the target".

So, I have to update the prediction network with the "ground truth".

for i, val in enumerate(actions):
        predict_q[i][val] = target_q[i]

TF2.0 DQN has a loss of 0?

One Answer

Add your own answers!

Ask a Question