Coder Social home page Coder Social logo

junxiaosong / alphazero_gomoku Goto Github PK

View Code? Open in Web Editor NEW
3.2K 103.0 956.0 8.07 MB

An implementation of the AlphaZero algorithm for Gomoku (also called Gobang or Five in a Row)

License: MIT License

Python 100.00%
alphazero mcts alphago-zero gomoku gobang monte-carlo-tree-search alphago reinforcement-learning rl board-game

alphazero_gomoku's Introduction

AlphaZero-Gomoku

This is an implementation of the AlphaZero algorithm for playing the simple board game Gomoku (also called Gobang or Five in a Row) from pure self-play training. The game Gomoku is much simpler than Go or chess, so that we can focus on the training scheme of AlphaZero and obtain a pretty good AI model on a single PC in a few hours.

References:

  1. AlphaZero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
  2. AlphaGo Zero: Mastering the game of Go without human knowledge

Update 2018.2.24: supports training with TensorFlow!

Update 2018.1.17: supports training with PyTorch!

Example Games Between Trained Models

  • Each move with 400 MCTS playouts:
    playout400

Requirements

To play with the trained AI models, only need:

  • Python >= 2.7
  • Numpy >= 1.11

To train the AI model from scratch, further need, either:

  • Theano >= 0.7 and Lasagne >= 0.1
    or
  • PyTorch >= 0.2.0
    or
  • TensorFlow

PS: if your Theano's version > 0.7, please follow this issue to install Lasagne,
otherwise, force pip to downgrade Theano to 0.7 pip install --upgrade theano==0.7.0

If you would like to train the model using other DL frameworks, you only need to rewrite policy_value_net.py.

Getting Started

To play with provided models, run the following script from the directory:

python human_play.py  

You may modify human_play.py to try different provided models or the pure MCTS.

To train the AI model from scratch, with Theano and Lasagne, directly run:

python train.py

With PyTorch or TensorFlow, first modify the file train.py, i.e., comment the line

from policy_value_net import PolicyValueNet  # Theano and Lasagne

and uncomment the line

# from policy_value_net_pytorch import PolicyValueNet  # Pytorch
or
# from policy_value_net_tensorflow import PolicyValueNet # Tensorflow

and then execute: python train.py (To use GPU in PyTorch, set use_gpu=True and use return loss.item(), entropy.item() in function train_step in policy_value_net_pytorch.py if your pytorch version is greater than 0.5)

The models (best_policy.model and current_policy.model) will be saved every a few updates (default 50).

Note: the 4 provided models were trained using Theano/Lasagne, to use them with PyTorch, please refer to issue 5.

Tips for training:

  1. It is good to start with a 6 * 6 board and 4 in a row. For this case, we may obtain a reasonably good model within 500~1000 self-play games in about 2 hours.
  2. For the case of 8 * 8 board and 5 in a row, it may need 2000~3000 self-play games to get a good model, and it may take about 2 days on a single PC.

Further reading

My article describing some details about the implementation in Chinese: https://zhuanlan.zhihu.com/p/32089487

alphazero_gomoku's People

Contributors

autodataming avatar bigballon avatar dshnightmare avatar junxiaosong avatar mingxuzhang avatar mrmitzh avatar observerspy avatar qinxiaozhi avatar realhurrison avatar ya0guang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alphazero_gomoku's Issues

关于MCTS中计算Q值的方法的区别

我看一些文章中Q是取的平均值,但是我看代码中Q是滑动平均值,这两个值应该是不一样的,所以想问一下为什么代码中用滑动平均?

softmax 中减去prob最大值的目的

您好,蒙特卡罗树部分输出的probs经过了softmax,然后每个prob都减去了max值:
probs = np.exp(x - np.max(x))
请问减去max值是为了防止结果溢出吗?

Why -leaf_value at mcts_alphaZero.py line 66?

    def update_recursive(self, leaf_value):
        """Like a call to update(), but applied recursively for all ancestors.
        """
        # If it is not root, this node's parent should be updated first.
        if self._parent:
            self._parent.update_recursive(-leaf_value)
        self.update(leaf_value)

小白的猜想,请多多指教。

AlphaGo zero和 alphazero 的区别是少了一个eval。self和opt还是一样。AlphaGo zero中是一个步骤一个步骤来。如,在我们的单机电脑中,是先self生成数据,然后再opt训练数据。再eval评估。我们不能同时进行这三个步骤。因为不能刚好连接得上。。。即使alphazero少了eval。但还是要一步一步来啊,要怎么做才能self数据的时候自动的变成opt的模型呢?我想的是self数据完成后直接变成model。然后model又直接self这样不是更快了吗。这样我们就可以美美的睡觉第二天起来,直接能用上更好的model了。如:https://github.com/chncyhn/flappybird-qlearning-bot

这个公式是不是写错了

self._u = (c_puct * self._P *np.sqrt(self._parent._n_visits) / (1 + self._n_visits))
是不是要改成下面的:
self._u = (c_puct * self._P *np.sqrt(self._parent._n_visits / (1 + self._n_visits)))

self._parent._n_visits后面的括号改到后面去

Can I change the board size?

I wonder how can I change the board size into 19.

Do you expect is it possible to train it with size 19 instead of 8?

Tensorflow版本的疑问

tensorflow版本中input_state的维度转换直接用reshape应该有问题吧?输入是[batch_size, c, h, w],tensorflow需要的则是[batch_size, h, w, c],直接reshape的话只改变了维度,数据并没有转置

一个关于对局数据的问题

如果训练的是一个先手有利的棋类(如无禁手的五子棋),那训练数据会是大量的黑棋胜的数据,这样会对训练结果造成什么样的影响呢?

请问你的电脑配置是什么呢,想参考一下

请问你的电脑配置是什么呢,想参考一下
而且有个问题想问一问,我用的tensorflow跑885的大小,cpu是i5 6300hq,gpu是gtx 965m,跑出来的时间gpu比cpu更耗时,是因为gpu太差了吗

No module named 'numpy.core.multiarray\r'

Traceback (most recent call last):
File "human_play.py", line 75, in
run()
File "human_play.py", line 59, in run
policy_param = pickle.load(open('best_policy_8_8_5.model', 'rb'))
ImportError: No module named 'numpy.core.multiarray\r'

对于 -leaf_value 的问题

node.update_recursive(-leaf_value)

首先非常感谢作者的贡献,这份代码的教育意义非常大。对于这一行代码的问题,首先解释一下,对于 update_recursive 函数里面的 -leaf_value 我比较清楚。但是该局棋面的估计 leaf_value, 按照上述的代码,为什么本节点(做出估计的节点)的估计 V 要变成 -leaf_value。我认为做出估计的节点的 V 应该是 leaf_value 不是 -leaf_value,在向树上反向传播的时候需要每一层修正一次符号这个我是明白的(因为树是敌我交替的),但是update_recursive函数的调用入口的负号这里不是很懂,是我比较大的一个困惑,希望可以得到作者的解读,再次感谢。

顺便确认一下,代码中多次提到的 current player 是指当前棋局下需要走子的玩家吗?

关于GPU利用率的问题

跑的时候GPU内存占满了,但GPU运算部件的利用率只有12%,不知道可以增大利用率吗

请问 如何提升使用GPU训练时的效率

您好,我是一个刚接触机器学习没多久的大学生,想请教您一些问题。我发现在使用GPU训练时,GPU的占用率长时间保持20%以下,看了其他的Issues之后,也明白了这是生成自我对局的时间过长而导致的。那么,如果我提升mini-batch的size的话,应该是能提升训练的效率的吧?但是这会不会导致其他问题呢,望您能在百忙之中给予答复,谢谢。

wondering about the input of model

I'm wondering whether it would be better if we extend the input to more layers to include more information on the board such as 'live 3' and 'sleep 4' ......

关于KL散度控制学习率的问题

您好,注意到代码中有通过比较新旧两个神经网络输出的KL散度来控制学习率的方法,实验过程中学习率先快速增加然后逐渐减少,说明这个方法确实有用。想问一下这种方法有相关的文献资料的介绍吗?还是您凭经验创造出来的呢?

game.py的一點小問題

我注意到你的
game.py#L42

game.py#L97
計算h與w之時預設height 與 width同長

h = m // width
w = m % width

关于模型的走棋策略的问题

您好!我在使用您的代码进行测试的时候(8*8棋盘),训练了4000轮达到了对于4000次模拟的纯蒙特卡洛算法10局全胜的结果,在与5000次模拟的蒙特卡洛算法中5胜5平。但是我在观察与传统五子棋引擎、人类与自我对弈的过程中,发现了两个问题:

  1. 可能出现已经有明显的必胜的落子的走法时,选择不落在那里
  2. 对于棋盘中还有较多可落子位置时,棋盘边缘的棋子,似乎根本不会考虑落子。。比如一开始就在棋盘边缘落子,则必胜;与传统五子棋引擎对弈时一般也是攻防到一定回合后对手连续在边缘落子,此模型不进行防守导致落败。
    default
    default

关于entropy

楼主你好,您在model里面的这个entropy似乎计算错了,应该是log吧,您用的tf.exp

关于训练后的模型转换

你好,训练出来的模型有三种格式(.model.meta; .model.index; model.data.....),但我要使用模型的时候,应该使用哪个呢?或者说还是要做个转换才能使用?

用tensorflow 训练1000局仍然无法收敛

用的是默认的配置:6x6 board and 4 in a row.
macos上跑的。

batch i:1100, episode_len:21
kl:0.00058,lr_multiplier:11.391,loss:4.518421649932861,entropy:3.5188446044921875,explained_var_old:0.000,explained_var_new:0.000
current self-play batch: 1100
num_playouts:1000, win: 2, lose: 8, tie:0

请指教。

MCTS最终得出的行为%pi的问题

在敲你的代码过程中遇到了两个问题,麻烦您给指导一下:

  1. 根据AlphaGo Zero论文中的描述,在MCTS的backup过程中,首先根据policy-value network得到叶子节点的p,v,之后使用v来更新各个树内节点的Q值。在你的代码中使用的是函数update_recursive(leaf_value),这其中的leaf_value应该就是论文中该叶子节点的v对吧?为什么在mcts_alphazero.py的66行和137行代码中该函数传递的是 负的leaf_value,而不是正的?是不是AlphaZero的backup方式与AlphaGo Zero不同呢?我在看AlphaZero论文的时候没有看到这些细节的东西,希望能够得到你的指点。
  2. 针对每个状态的MCTS结束的时候会输出一个策略%pi,根据AlphaGo Zero论文中的描述,最终这个策略的公式是跟行为的执行次数和temp有关的,对应的应该是你的代码中函数softmax完成的功能,这个函数化简的结果确实是论文中的公式,但是为什么要减去最大值,即probs = np.exp(x - np.max(x)),而不是直接按照论文给出的公式计算各个行为的概率呢?
  3. 在selfplay时候,AlphaGo Zero论文中给行为增加noise中使用Dir(0.03)分布,而代码中使用的参数值为0.3,不知道这块怎么理解?
  4. 我看了AlphaGo Zero和AlphaZero的论文,还是没有明白这两个方法的区别,能否满烦你给指点指点!!
    拜托了!!!!!!谢谢你的分享!

关于self.data_buffer.extend(play_data)

play_data=[1,2],data_buffer为空队列的话,执行self.data_buffer.extend(play_data)想要的结果是data_buffer=[1,2],(1,2分别为训练数据[s,矩阵,z]),但实际结果会不会是data_buffer = [[1,2]]? 是不是应该把play data里的每一项提出来append到data buffer后面?

想请问一下楼主有试过在evaluate的时候用最新训练的model和上一次得到的model做对比吗?

instead of 和 pure MCTS 最对比,我想这样的话是不是无论在效果优化上会得到提升?并且不需要自己在写一个pure MCTS了。因为个人觉得对于pure MCTS来说,playout的次数增加对提高它胜率是一个diminishing return

我是参考了另一个github的repro https://github.com/suragnair/alpha-zero-general 他们的效果也很不错

BTW, 谢谢楼主po这些代码出来以供学习,本人小白一枚。

采用人工对战数据加快收敛 可行性分析

您好 我现在想采用人工对战的数据用于加速收敛,请问这里人工对战的话,训练时候的mcts_probs_batch概率该如何设定呢 ,可否让采取当前action的概率为1 其他为0?

关于c_puct的问题

楼主你好,非常感谢你的分享!
我在tensorflow上试着跑了一下您的代码,可是在8*8的尺寸下不知为何效果不理想。于是把合法落子点限制在已有落子点的附近,但是效果还是不甚理想。想问一下如果减少合法动作数量的话,这个c_puct是否需要相应地增加或者减小呢?

Neural Network Architecture

This neural network architecture is quite different from that in Alphago Zero's paper, for instance, the latter took a resnet approach, using 1 convolutional block and 19 residual blocks.
Simply stacking layers may cause certain defects(e.g. speed and accuracy) in network training.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.