How exactly does DQN learn?

Question

I created my custom environment in gym, which is a maze. I use a DQN model with BoltzmannQPolicy.
It trains fine with the following variables:

position of the agent
distance from the endpoint
position of the endpoint
which directions can it move to

So I don't give it an image or anything. If I train and test it in the same environment (the same maze, without changing the position of walls) it can solve it easily. But if I introduce it to a completely different environment (maze) without the training then it doesn't know what to do. I don't know if the problem is with my code, or DQN is just for solving the same environment.
Which algorithm should I use instead?

Neil Slater · Answer

The way you have set your DQN up, it is designed to solve just one maze at a time. It has not (and cannot) learn to solve mazes in general, because it has no access to data about the layout of the maze, and a basic DQN agent has no capability to memorise layout seen so far.
You could view the training process as general algorithm for "solving the maze". That is, given a new maze, you can take your written agent, run it for a while in training mode and it will produce a policy that solves that maze. You may be able to tune or adjust the algorithm so that it does it the most efficiently in terms of time or number of steps. It will not be able to do so as well as a hard-coded maze-solving algorithm, because DQN and Q-learning is a very generic learning algorithm that takes no advantage of consisent logic seen in mazes.
If you want to train a more generic maze-solving agent, this becomes a harder problem to solve using reinforcement learning, but it is achievable. There are two key - and linked - assumptions of the theory behind Q learning that you would need to address before you could make a general maze solver that coped with variations such s moving the walls, but without needing to be re-trained:

The Markov property is assumed for the state variables, which means that the state description contains all the relevant information about future state transitions and rewards. Without some way to know where the walls are, or at least note the position of walls it has seen before, the agent does not have access to a state representation with the Markov property.

There is at least one deterministic optimal policy. If you are intending to withold information from the agent (and not give it capabilities such as memory in order to construct missing information), then there may not be a deterministic optimal policy.

So it is possible to use reinforcement learning to teach an agent about solving mazes in general, and gain the ability to attempt to navigate out of a new unseen before maze. If that is of interest to you, you need to first decide what capabilities and knowledge about the environment the agent is allowed to observe. E.g. what specifically interests you about writing a maze solver?
Which approaches you could use to make a more general maze solver depends critically on how you want to frame the problem. For instance it is entirely valid to give the solver an image of the maze as input if you want the agent to learn to solve mazes in a similar manner to a human solving a printed puzzle.
There is one common factor that you will need for training a more generic maze solver - you will want to train on many example mazes taken from the population of all possible mazes. Training on just one maze as an example will typically be about as successful as training a supervised classifier or regression task on one example. Adding many more mazes, or better a maze generator, and training on many of them is likely to make training time much longer than before.

Constantinos · Answer

The only thing that you need to do is to start your agent and the goal/end at a random (non-overlapping) location. You can try your setup initially with an empty grid (no walls). If DQN learns, your set up is good and you can start introducing obstacles into the grid. Gradually, the agent will start associating the end location inputs as something rewarding and will learn to navigate.
Things to consider (thus my suggestion to run your algorithm first in a 10x10 empty grid with random initial locations):

Input feature space: Should better be at the same scale.
The Boltzmann's policy temperature should anneal throughout training in order to explore at the beginning (DQN will collect it in the memory buffer) and exploit later on.
In the maze scenario, random locations especially ones that the agent and the goal are quite close in space, will help DQN to "realize" the rewarding target soon enough. Therefore, starting with 0/few obstacles with random initial locations for all entities on the grid will act as a form of curriculum learning for your agent.

How exactly does DQN learn?

2 Answers

Add your own answers!

Ask a Question