Bo Song

Life is beautiful. Enjoy it.

AlphaZero Paper review

November 2, 2024 in all by songbo

Paper: Mastering the game of Go without human knowledge. https://www.nature.com/articles/nature24270

Key takeaways:

No human domain knowledge, just the rules of the game.
Use CNN with residual blocks, batch normalization and rectifier nonlinearities as policy and value network.
Use a same network to output both current state evaluation (value network) and also next move probability distribution (policy network)
Use Monte Carlo Tree Search + policy network to make the next move stronger.
Board history is important as input feature.

The loss function sums the evaluation loss and policy loss, together with a regulation parameter.

(p,v) = f_\theta(s) and l = (z-v)^2 - \pi^T log p + c|\theta|^2

Leave a comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.