How AlphaGo Zero is learning from $pi_t$ when $z_t = -1$?

Question

I have questions on the way AlphaGo Zero is trained.
From original AlphaGo Zero paper, I knew that AlphaGo Zero agent learns a policy, value functions by the gathered data ${(s_t, pi_t, z_t)}$ where $z_t = r_T in {-1,1}$.
However, the fact that the agent tries to learn a policy distribution when $z_t = -1$ seems to be counter-intuitive (at least to me).
My assertion is that the agent should not learn the policy distribution of when it loses (i.e, gets $z_t=-1$), since such a policy will guide it to lose.
I think I have missed some principles and resulted in that assertion. Or is my assertion reasonable, either?

Dennis Soemers · Accepted Answer

Intuitively I think there's definitely something to be said for your idea, but it's not a 100% clear case, and there are also some arguments to be made for the case that we should also be training the policy from data where $z_t = -1$.
So, first let's establish that if we do indeed choose to discard any and all data where $z_t = -1$, we are in fact discarding a really significant part of our data; we're discarding 50% of all the data we generate in games like Go where there are no draws (less than that in games like Chess where there are many draws, but still a significant amount of data). So this is not a decision to be made lightly (it has a major impact on our sample efficiency), and we should probably only do it if we really believe that policy learning from any data where $z_t = -1$ is actually harmful.

The primary idea behind the self-play learning process in AlphaGo Zero / AlphaZero may intuitively be explained as:

When we run an MCTS search biased by a trained policy $pi_t$, we expect the resulting distribution of visits to be slightly better than what was produced by $pi_t$ alone.
According to the expectation from point 1., we may use the visit counts of MCTS as a training target for the policy $pi_t$, and hence we expect to get a slight improvement in the quality of that trained policy.
If we were to now run a new MCTS search biased by the updated policy in the same situation again, we would expect that to perform even better than it previously did because it is now biased by a new policy which has improved in comparison to the policy we previously used.

Of course, there can be exceptions to point 1. if we get unlucky, but on average we expect that to be true. Crucially for your question, we don't expect this to only be true in games where we actually won, but still also be true in games that we ultimately end up losing. Even if we still end up losing the game played according to the MCTS search, we expect that we at least put up a slightly better fight with the MCTS + $pi_t$ combo than we would have done with just $pi_t$, and so it may still be useful to learn from it (to at least lose less badly).
On top of this, it is important to consider that we intentionally build in exploration mechanisms in the self-play training process, which may "pollute" the signal $z_t$ without having polluted the training target for the policy. In self-play, we do not always pick the action with the maximum visit count (as we would in an evaluation match / an important tournament game), but we pick actions proportionally to the MCTS visit counts. This is done for exploration, to introduce extra variety in the experience that we generate, to make sure that we do not always learn from exactly the same games. This can clearly affect the $z_t$ signal (because sometimes we knowingly make a very very bad move just for the sake of exploration), but it does not affect the policy training targets encountered throughout that game; MCTS still tries to make the best that it can out of the situations it faces. So, these policy training targets are still likely to be useful, even if we "intentionally" made a mistake somewhere along the way which caused us to lose the game.

How AlphaGo Zero is learning from $pi_t$ when $z_t = -1$?

One Answer

Add your own answers!

Ask a Question