TransWikia.com

How does the target network in double DQNs find the maximum Q value for each action?

Artificial Intelligence Asked on November 7, 2021

I understand the fact that the neural network is used to take the states as inputs and it outputs the Q-value for state-action pairs. However, in order to compute this and update its weights, we need to calculate the maximum Q-value for the next state $s’$. In order to get that, in the DDQN case, we input that next state $s’$ in the target network.

What I’m not clear on is: how do we train this target network itself that will help us train the other NN? What is its cost function?

One Answer

Both in DQN and in DDQN, the target network starts as an exact copy of the Q-network, that has the same weights, layers, input and output dimensions, etc., as the Q-network.

The main idea of the DQN agent is that the Q-network predicts the Q-values of actions from a given state and selects the maximum of them and uses the mean squared error (MSE) as its cost/loss function. That is, it performs gradient descent steps on

$$left(Y_{t}^{mathrm{DQN}} -Qleft(s_t, a_t;boldsymbol{theta}right)right)^2,$$

where the target $Y_{t}^{mathrm{DQN}}$ is defined (in the case of DQN) as

$$ Y_{t}^{mathrm{DQN}} equiv R_{t+1}+gamma max _{a} Qleft(S_{t+1}, a ; boldsymbol{theta}_{t}^{-}right) $$

$boldsymbol{theta}$ are the Q-network weights and $boldsymbol{theta^-}$ are the target network weights.

After a usually fixed number of timesteps, the target network updates its weights by copying the weights of the Q-network. So, basically, the target network never performs a feed-forward training phase and, thus, ignores a cost function.

In the case of DDQN, the target is defined as

$$ Y_{t}^{text {DDQN}} equiv R_{t+1}+gamma Qleft(S_{t+1}, underset{a}{operatorname{argmax}} Qleft(S_{t+1}, a ; boldsymbol{theta}_{t}right) ; boldsymbol{theta}_{t}^{-}right) $$

This target is used to decouple the selection of the action (i.e. the argmax part) from its evaluation (i.e. the computation of the Q value at the next state with this selected action), as stated the paper that introduced the DDQN)

The max operator in standard Q-learning and DQN, in (2) and (3), uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation

Answered by ddaedalus on November 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP