Data Science Asked by marlineer43 on August 2, 2021
I’m making my way through Sutton’s Introduction to Reinforcement Learning. He gives the definition of the $q_*$ function as follows
$$
q_*(a) = mathbf{E}[R_t | A_t = a]
$$
where $A_t$ is the action taken at time t and $R_t$ is the reward associated with taking $A_t$. From my understanding, $q_*$ represents the true value of taking action $a$, which is the mean reward when $a$ is selected.
But I’m confused about why $t$ is included in this equation at all. Should $q_*(a)$ really be $q_*(a, t)$? Or are we to understand $q_*$ as taking the expected reward across all $t$?
The reward of action $a$ is defined as a stationary probability distribution with mean $q_*(a)$. This is independent of time $t$. However the estimate of $q_*(a)$ at time $t$, denoted by $Q_t(a)$, is dependent on time $t$
Or are we to understand q∗ as taking the expected reward across all t?
The expectation is not over time, but over a probability distribution with mean $q_*(a)$.
For eg., in the 10-armed bandit problem, the reward for each of the 10 actions comes from a Normal distribution with mean $q_*(a), a= 1,...,10$ and variance 1.
Correct answer by vineet gundecha on August 2, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP