Data Science Asked by James K J on August 23, 2020
I came across the below paragraphs, which I believe are the answers to the question Why infinite sampling is not realistic assumption in most real applications. Still i dont get the below explanation ?. When we draw more samples from the environment, MC brings the approximate value function close to the true value function isn’t it ? Then why infinite sampling is not considered as a realistic assumption.
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes. To obtain a practical algorithm we will have to remove both assumptions. We postpone consideration of the first assumption until later in this chapter.
The second approach to avoiding the infinite number of episodes nominally required for policy evaluation is to forgo trying to complete policy evaluation before returning to policy improvement. On each evaluation step we move the value function toward , but we do not expect to actually get close except over many steps. We used this idea when we first introduced the idea of GPI in Section 4.6. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.
For now we focus on the assumption that policy evaluation operates on an infinite number of episodes. This assumption is relatively easy to remove. In fact, the same issue arises even in classical DP methods such as iterative policy evaluation, which also converge only asymptotically to the true value function. In both DP and Monte Carlo cases there are two ways to solve the problem. One is to hold firm to the idea of approximating in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to assure that these bounds are sufficiently small. This approach can probably be made completely satisfactory in the sense of guaranteeing correct convergence up to some level of approximation. However, it is also likely to require far too many episodes to be useful in practice on any but the smallest problems.
It is not a realistic assumption because you don't have infinite time or decimal precision to find the absolutely correct value function, but you don't need that anyway since a rough estimate of it will be enough to improve the policy.
If your question is why you can't let the agent learn indefinitely in real applications, I'm guessing it is because it may be potentially expensive or dangerous to let it explore randomly in a real scenario, so you want to deploy it with an optimal or near-optimal, deterministic, policy.
Answered by nestor556 on August 23, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP