Paper: Mastering the game of Go without human knowledge. https://www.nature.com/articles/nature24270
Key takeaways:
- No human domain knowledge, just the rules of the game.
- Use CNN with residual blocks, batch normalization and rectifier nonlinearities as policy and value network.
- Use a same network to output both current state evaluation (value network) and also next move probability distribution (policy network)
- Use Monte Carlo Tree Search + policy network to make the next move stronger.
- Board history is important as input feature.
The loss function sums the evaluation loss and policy loss, together with a regulation parameter.
(p,v) = f_\theta(s) and l = (z-v)^2 - \pi^T log p + c|\theta|^2