Definition of the Q* function in reinforcement learning

Question

I'm making my way through Sutton's Introduction to Reinforcement Learning. He gives the definition of the $q_*$ function as follows
$$
q_*(a) = mathbf{E}[R_t | A_t = a]
$$
where $A_t$ is the action taken at time t and $R_t$ is the reward associated with taking $A_t$. From my understanding, $q_*$ represents the true value of taking action $a$, which is the mean reward when $a$ is selected.
But I'm confused about why $t$ is included in this equation at all. Should $q_*(a)$ really be $q_*(a, t)$? Or are we to understand $q_*$ as taking the expected reward across all $t$?

vineet gundecha · Accepted Answer

The reward of action $a$  is defined as a stationary probability distribution with mean $q_*(a)$. This is independent of time $t$.
However the estimate of $q_*(a)$ at time $t$, denoted by $Q_t(a)$, is dependent on time $t$

Or are we to understand q∗ as taking the expected reward across all t?

The expectation is not over time, but over a probability distribution with mean $q_*(a)$.
For eg., in the 10-armed bandit problem, the reward for each of the 10 actions comes from a Normal distribution with mean $q_*(a), a= 1,...,10$ and variance 1.

Definition of the Q* function in reinforcement learning

One Answer

Add your own answers!

Ask a Question