This is a trial implementation of DeepMind's Oct19th publication: Mastering the Game of Go without Human Knowledge.
repo: leela-zero (c++ AlphaGo Zero replica)
repo: reversi-alpha-zero (if you like reversi(黑白棋))
Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee 12 in several im- portant aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it only uses the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte- Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning.
Congratulation to DeepMind to pierce the frontier once again! AlphaGO Zero (fully self-play by reinforcement learning with no human games examples).
I downloaded the paper Mastering the Game of Go without Human Knowledge in the first place, but only found myself lack prior knowledge in Monte Carlo Search Tree (MCST). I tried my best to highlight what is interesting.
This time's AlphaGo uses combined policy & value network (final fc diverges to two branches) to cope with training stability. From Paper:
Innovation (annealing & Dirichlet noise) in MCTS has enabled exploration
And exploration leads to learning more and more complex movings, making the game at the end of training (~70h) both competitive and balanced.
The input is still raw stones but normal CNN has been replaced by residual net
And finally pure RL has outperformed supervised learning+RL agent
- input 19 x 19 x 17: 7 previous states + current state player’s stone, 7 previous states + current state opponent’s stone, player’s colour
-
- A convolution of 256 filters of kernel size 3 x 3 with stride 1
- Batch normalisation
- A rectifier non-linearity
Residual Blocks
-
- A convolution of 256 filters of kernel size 3 x 3 with stride 1
- Batch normalisation
- A rectifier non-linearity
- A convolution of 256 filters of kernel size 3 x 3 with stride 1
- Batch normalisation
- A skip connection that adds the input to the block
- A rectifier non-linearity
Policy Head
- 1.A convolution of 2 filters of kernel size 1 x 1 with stride 1 2. Batch normalisation 3. A rectifier non-linearity 4. A fully connected linear layer that outputs a vector of size 192^2 + 1 = 362 corresponding to logit probabilities for all intersections and the pass move
Value Head
-
- A convolution of 1 filter of kernel size 1 x 1 with stride 1
- Batch normalisation
- A rectifier non-linearity
- A fully connected linear layer to a hidden layer of size 256
- A rectifier non-linearity
- A fully connected linear layer to a scalar
- A tanh non-linearity outputting a scalar in the range [ 1, 1]
python 3.6
pip install -r requirement.txt
Under repo's root dir
cd data/download
chmod +x download.sh
./download.sh
It is only an example, feel free to assign your local dataset directory
python preprocess.py preprocess ./data/SGFs/kgs-*
python main.py --mode=train --force_save —-n_resid_units=20
python main.py --mode=gtp —-policy=random --model_path='./savedmodels/model--0.0.ckpt'
Under repo’s root dir
python utils/selfplay.py
*Brain Lee *Ritchie Ng