Artificial Intelligence Asked on August 24, 2021
I have questions on the way AlphaGo Zero is trained.
From original AlphaGo Zero paper, I knew that AlphaGo Zero agent learns a policy, value functions by the gathered data ${(s_t, pi_t, z_t)}$ where $z_t = r_T in {-1,1}$.
However, the fact that the agent tries to learn a policy distribution when $z_t = -1$ seems to be counter-intuitive (at least to me).
My assertion is that the agent should not learn the policy distribution of when it loses (i.e, gets $z_t=-1$), since such a policy will guide it to lose.
I think I have missed some principles and resulted in that assertion. Or is my assertion reasonable, either?
Intuitively I think there's definitely something to be said for your idea, but it's not a 100% clear case, and there are also some arguments to be made for the case that we should also be training the policy from data where $z_t = -1$.
So, first let's establish that if we do indeed choose to discard any and all data where $z_t = -1$, we are in fact discarding a really significant part of our data; we're discarding 50% of all the data we generate in games like Go where there are no draws (less than that in games like Chess where there are many draws, but still a significant amount of data). So this is not a decision to be made lightly (it has a major impact on our sample efficiency), and we should probably only do it if we really believe that policy learning from any data where $z_t = -1$ is actually harmful.
The primary idea behind the self-play learning process in AlphaGo Zero / AlphaZero may intuitively be explained as:
Of course, there can be exceptions to point 1. if we get unlucky, but on average we expect that to be true. Crucially for your question, we don't expect this to only be true in games where we actually won, but still also be true in games that we ultimately end up losing. Even if we still end up losing the game played according to the MCTS search, we expect that we at least put up a slightly better fight with the MCTS + $pi_t$ combo than we would have done with just $pi_t$, and so it may still be useful to learn from it (to at least lose less badly).
On top of this, it is important to consider that we intentionally build in exploration mechanisms in the self-play training process, which may "pollute" the signal $z_t$ without having polluted the training target for the policy. In self-play, we do not always pick the action with the maximum visit count (as we would in an evaluation match / an important tournament game), but we pick actions proportionally to the MCTS visit counts. This is done for exploration, to introduce extra variety in the experience that we generate, to make sure that we do not always learn from exactly the same games. This can clearly affect the $z_t$ signal (because sometimes we knowingly make a very very bad move just for the sake of exploration), but it does not affect the policy training targets encountered throughout that game; MCTS still tries to make the best that it can out of the situations it faces. So, these policy training targets are still likely to be useful, even if we "intentionally" made a mistake somewhere along the way which caused us to lose the game.
Correct answer by Dennis Soemers on August 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP