TransWikia.com

Actor Network Target Value in A2C Reinforcement Learning

Data Science Asked by datatech on June 1, 2021

In DQN, we use;

$Target = r+gamma v(s’)$ equation to train (fit) our network. It is easy to understand since we use the $Target$ value as the dependent variable like we do in supervised learning. I.e. we can use codes in python to train the model like,

model.fit(state,target, verbose = 0)

where $r$ and $v(s’)$ can be found by model prediction.

When it comes to A2C network, things becomes more complicated. Now we have got two networks. Actor and Ctitic. It is said, the Critic network is not different from the what it is done in DQN. Only difference is now we have got only one output neuron in the network. So, similarly, we calculate the $Target = r+gamma v(s’)$ after acting via sampled action from $pi(a|s)$ distribution. And we train the model with model.fit(state,target, verbose = 0) in python as well.

However, in Actor case it is so confusing. Now, we have another neural network which takes states as input and gives probabilities as the output by using softmax activation function. It is cool. But the point I stuck is, the dependent variable to adjust the Actor network. So in python,

model2.fit(state,?,verbose=0)

What is the ? value to "supervise" the network to adjust the weights? and WHY?

In several resources I found the Advantage value which is nothing but the $Target – V(s)$.
And also there is something called as the actor loss which is calculated by,

enter image description here

Why?

Thanks in advance!

One Answer

The target will still be a form of a return estimation ($V(s_t)$, $Q(a_t,s_t)$, Advantage, n-step reward, etc). For example in your case the $Q_w$ that Critic estimated.

You will need to review a bit Policy Gradient methods in this order: PG Theorem, REINFORCE (Actor only method) then AC (Actor-Critic) and then A2C. I will give you a conceptual explanation, abstracted from the math behind. The general form of Policy Gradient is:

likelihood of action given state multiplied by a form of return: $logpi_{mathbf{theta}}(a_t|s_t)cdot R_t$

What this tells us is: maximize the likelihood of a specific action multiplied by the return. In classification we do know the correct class (action) but here this is not the case. The only learning signal comes from the reward. Therefore multiplying the likelihood of an action with a form of return for selecting that action, that will decrease/increase the probability of selecting that action again given at the current state. The PG theorem states that this is the direction to update the weights $theta$ in order to maximize return.

In A2C, you can have various implementations: 2 separate networks or 1 network with 2 separate heads. Actor is responsible for learning the distribution of actions that maximize the return given state. Critic is responsible for estimating the return from current state (and action). Thus the loss function in A2C is usually the Policy Gradient loss plus the Mean Squared Error between expected return and observed return (plus entropy for exploration). Please refer to the original A2C paper for the equations. As you can see we have 2 main loss functions (one for each network (head) ).

The return, in any form, affects training in 2 ways:

  • Learning signal for the actor: Multiplies the likelihood of actions and by doing so increases/decreases the probability for selecting these actions. This will change the parameters of the policy in order for the policy to favor rewarding actions.
  • Learning signal for the critic: MSE target.

As you can see in a loose sense you are doing again supervised learning for both Actor and Critic but with classification loss for Actor and regression loss for Critic. The why this occurs comes from the Policy Gradient theorem that shows that in order to maximize the return following a parametrized policy we need to update the parameters towards the direction that maximize the PG loss function (for the Actor).

Please note that for various reasons using the return might not be that of a good idea. That's why you will see the PG loss in various forms ( e.g. policy likelihood multiplied by the advantage $A(t)$ or $V(s_t)-R_t$, $Q(a_t,s_t)$ etc)

Correct answer by Constantinos on June 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP