AlphaZero Paper review


Paper: Mastering the game of Go without human knowledge. https://www.nature.com/articles/nature24270

Key takeaways:

  1. No human domain knowledge, just the rules of the game.
  2. Use CNN with residual blocks, batch normalization and rectifier nonlinearities as policy and value network.
  3. Use a same network to output both current state evaluation (value network) and also next move probability distribution (policy network)
  4. Use Monte Carlo Tree Search + policy network to make the next move stronger.
  5. Board history is important as input feature.

The loss function sums the evaluation loss and policy loss, together with a regulation parameter.

(p,v) = f_\theta(s) and l = (z-v)^2 - \pi^T log p + c|\theta|^2


Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.