Maximum Entropy Policy Gradient Derivation

Question

I am reading through the paper on Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review by Sergey Levine. I am having a difficulty in understanding this part of the derivation on Maximum Entropy Policy Gradients (Section 4.1)

Note that in the above derivation, the term H(q(thetha(at|st))) should have been log (qthetha(at|st)), and that log refers to log base e (i.e. natural logarithm). In the first line of gradient, it should have been r(st,at) - log(qthetha(at|st)).
In particular, I cannot understand how the second summation term from t'=t to t'=T in the second line arises in the derivation. I managed to derive the formula by expanding the definition of expectation, but the result that I got do not have the second summation term. Can anyone give me ideas on where this second summation term comes from mathematically?

Wuops · Answer

Writing the objective in a different way may be helpful.
If we only focus on the reward part and ignore the entropy part since the focus in the question is about the second summation, the original objective is
$$
J(theta)=sum_{t=1}^Tmathbb{E}_{(s_t,a_t)sim q(s_t,a_t)}[r(s_t,a_t)]
$$
If we look at it from the perspective of the full trajectory $tau$, just like the ELBO in the paper Eq.19, the objective is
$$
J(theta)=mathbb{E}_{tausim q_theta(tau)}left[sum_{t=1}^Tr(s_t,a_t)right]
$$
Then
$$
begin{aligned}
nabla_theta J(theta)&=nabla_thetasum_tau q_theta(tau)sum_{t=1}^T r(s_t,a_t)
&=sum_tau nabla_theta q_theta(tau)sum_{t=1}^T r(s_t,a_t)
&=sum_tau q_theta(tau)nabla_theta log q_theta(tau)sum_{t=1}^T r(s_t,a_t)
&=mathbb{E}_{tausim q_theta(tau)}left[nabla_theta log q_theta(tau)sum_{t=1}^Tr(s_t,a_t)right]
end{aligned}
$$
Since $q_theta(tau)=q(s_1)prod_{t=1}^T q(s_{t+1}|s_t|a_t) q_theta(a_t|s_t)$,
$$
nabla_theta log q_theta(tau)=sum_{t=1}^T nabla_theta log q_theta(a_t|s_t)
$$
Then
$$
nabla_theta J(theta)=mathbb{E}_{tausim q_theta(tau)}left[sum_{t=1}^T nabla_theta log q_theta(a_t|s_t) sum_{t=1}^Tr(s_t,a_t)right]
$$
That is where the second summation comes from.
Since policy at time $t'$ cannot affect reward at time $t$ when $t<t'$ (causality),
$$
begin{aligned}
nabla_theta J(theta)&=mathbb{E}_{tausim q_theta(tau)}left[sum_{t=1}^T nabla_theta log q_theta(a_t|s_t) sum_{t'=t}^Tr(s_t',a_t')right]
&=sum_{t=1}^T mathbb{E}_{(s_t,a_t)sim q(s_t,a_t)}left[nabla_theta log q_theta(a_t|s_t) sum_{t'=t}^Tr(s_t',a_t')right]
end{aligned}
$$

Maximum Entropy Policy Gradient Derivation

One Answer

Add your own answers!

Ask a Question