Most reinforcement learning agents are trained in simulated environments. The goal is to maximize performance in (often) the same environment, preferably with a minimum amount of interactions. Having a good model of the environment allows to use planning and thus drastically improves the sample efficiency!
Why is the simulation not used for planning in these cases? It is a sampling model of the environment, right? Can’t we try multiple actions at each or some states, follow the current policy to look several steps ahead and finally choose the action with the best outcome? Shouldn’t this allow us to find better actions more quickly compared to policy gradient updates?
In this case, our environment and the model are kind of identical and this seems to be the problem. Or is the good old curse of dimensionality to blame again? Please help me figure out, what I’m missing.
Shouldn't this allow us to find better actions more quickly compared to policy gradient updates?
It depends on the nature of the simulation. If the simulation models a car as a solid body moving with three $(x,y,theta)$ degrees of freedom in a plane (hopefully, if it doesn't hit anything and propel vertically), the three ordinary differential equations of solid body motion can be solved quite quickly, compared to a simulation used to model the path of least resistance of a ship on wavy sea, where fluid dynamics equations must be solved, that require a huge amount of resources. OK, the response time needed for a ship is much longer, than for a car, yes, but to compute it predictively, one needs a huge amount of computational power.
Answered by tmaric on August 9, 2020
Get help from others!