junxiaosong / alphazero_gomoku Goto Github PK

An implementation of the AlphaZero algorithm for Gomoku (also called Gobang or Five in a Row)

License: MIT License

Python 100.00%

alphazero mcts alphago-zero gomoku gobang monte-carlo-tree-search alphago reinforcement-learning rl board-game

alphazero_gomoku's Introduction

AlphaZero-Gomoku

This is an implementation of the AlphaZero algorithm for playing the simple board game Gomoku (also called Gobang or Five in a Row) from pure self-play training. The game Gomoku is much simpler than Go or chess, so that we can focus on the training scheme of AlphaZero and obtain a pretty good AI model on a single PC in a few hours.

References:

AlphaZero: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
AlphaGo Zero: Mastering the game of Go without human knowledge

Update 2018.2.24: supports training with TensorFlow!

Update 2018.1.17: supports training with PyTorch!

Example Games Between Trained Models

Each move with 400 MCTS playouts:

Requirements

To play with the trained AI models, only need:

Python >= 2.7
Numpy >= 1.11

To train the AI model from scratch, further need, either:

Theano >= 0.7 and Lasagne >= 0.1
or
PyTorch >= 0.2.0
or
TensorFlow

PS: if your Theano's version > 0.7, please follow this issue to install Lasagne,
otherwise, force pip to downgrade Theano to 0.7 pip install --upgrade theano==0.7.0

If you would like to train the model using other DL frameworks, you only need to rewrite policy_value_net.py.

Getting Started

To play with provided models, run the following script from the directory:

python human_play.py

You may modify human_play.py to try different provided models or the pure MCTS.

To train the AI model from scratch, with Theano and Lasagne, directly run:

python train.py

With PyTorch or TensorFlow, first modify the file train.py, i.e., comment the line

from policy_value_net import PolicyValueNet  # Theano and Lasagne

and uncomment the line

# from policy_value_net_pytorch import PolicyValueNet  # Pytorch
or
# from policy_value_net_tensorflow import PolicyValueNet # Tensorflow

and then execute: python train.py (To use GPU in PyTorch, set use_gpu=True and use return loss.item(), entropy.item() in function train_step in policy_value_net_pytorch.py if your pytorch version is greater than 0.5)

The models (best_policy.model and current_policy.model) will be saved every a few updates (default 50).

Note: the 4 provided models were trained using Theano/Lasagne, to use them with PyTorch, please refer to issue 5.

Tips for training:

It is good to start with a 6 * 6 board and 4 in a row. For this case, we may obtain a reasonably good model within 500~1000 self-play games in about 2 hours.
For the case of 8 * 8 board and 5 in a row, it may need 2000~3000 self-play games to get a good model, and it may take about 2 days on a single PC.

alphazero_gomoku's People

Contributors

Stargazers

Watchers

Forkers

ryfan-rs mars-wei iamsile roreagan whitegl zhangfanfox crelon o-github-o nlpguyz duoergun0729 adolfoeliazat lageek kastnerkyle awesome-archive passwind mazilong jeanxi eminemrain trigrass2 lch794613 kefault moszh woodlgz zhongxingpeng euwen cutecha sparkg cafe wangliangtrue ife1er stiger104 wuyijian ckqsars minning ricelingz mingxuzhang hengqujushi yxsunrise ganji15 mohism-research angzz tony8 1987618girl anwesh2 royshan jamescaiyy coldstarnju tahy1 rainmanwang jackiezhai liufaaan vincentcaow zerkh sunyunzhuo huhuigou janathonl whrenstone xuhuairuogu bajizhh a382695908 latex123 fantexi023 zzzsw yangboz alexyoung757 parsonszeng yzu2ustc lvnszn byrantwithyou snifferchess aobai chinaver2002 play3577 concohnb apple1987 lilonghua1987 kummar yerongchagui breeef stoensin jordanyz lemon0707 ericchai quantumlike nanfengpo gouyl frankatmech little1tow wangzhiyi jiandanjinxin clayandgithub jiangziguo wangmengzhi trexwithoutt ryanchen-cn wwjcmp rapheallong ufully zouyunzhe tickpeach

alphazero_gomoku's Issues

遇到好像是陷入了local minimum的问题

在自我对弈的训练中最终陷入到两方都只进攻不防守，导致一局棋很快就结束了而且局面都比较类似，这种情况的问题在哪里？

关于对gpu的需求的疑问

贴个腾讯PhoenixGo的链接：http://tech.qq.com/a/20180511/024785.htm
源代码：https://github.com/Tencent/PhoenixGo
这个新闻说只要1块GPU就够了？！是指训练用1块就够了还是跑训练好的模型用1块？

关于MCTS中计算Q值的方法的区别

我看一些文章中Q是取的平均值，但是我看代码中Q是滑动平均值，这两个值应该是不一样的，所以想问一下为什么代码中用滑动平均？

c_puct 这个参数是什么意思？

一直不懂这个参数的意思。有什么作用？

softmax 中减去prob最大值的目的

您好，蒙特卡罗树部分输出的probs经过了softmax，然后每个prob都减去了max值：
probs = np.exp(x - np.max(x))
请问减去max值是为了防止结果溢出吗？

Why -leaf_value at mcts_alphaZero.py line 66?

    def update_recursive(self, leaf_value):
        """Like a call to update(), but applied recursively for all ancestors.
        """
        # If it is not root, this node's parent should be updated first.
        if self._parent:
            self._parent.update_recursive(-leaf_value)
        self.update(leaf_value)

小白的猜想，请多多指教。

AlphaGo zero和 alphazero 的区别是少了一个eval。self和opt还是一样。AlphaGo zero中是一个步骤一个步骤来。如，在我们的单机电脑中，是先self生成数据，然后再opt训练数据。再eval评估。我们不能同时进行这三个步骤。因为不能刚好连接得上。。。即使alphazero少了eval。但还是要一步一步来啊，要怎么做才能self数据的时候自动的变成opt的模型呢？我想的是self数据完成后直接变成model。然后model又直接self这样不是更快了吗。这样我们就可以美美的睡觉第二天起来，直接能用上更好的model了。如：https://github.com/chncyhn/flappybird-qlearning-bot

为什么664train了一天还是0win10loss呢，而且还是乱下棋。。。是不是要修改什么？

这个公式是不是写错了

self._u = (c_puct * self._P *np.sqrt(self._parent._n_visits) / (1 + self._n_visits))
是不是要改成下面的：
self._u = (c_puct * self._P *np.sqrt(self._parent._n_visits / (1 + self._n_visits)))

self._parent._n_visits后面的括号改到后面去

Can I change the board size?

I wonder how can I change the board size into 19.

Do you expect is it possible to train it with size 19 instead of 8?

Tensorflow版本的疑问

tensorflow版本中input_state的维度转换直接用reshape应该有问题吧？输入是[batch_size, c, h, w]，tensorflow需要的则是[batch_size, h, w, c]，直接reshape的话只改变了维度，数据并没有转置

Support gomocup protocol

The source code for the protocol can be found here:
https://github.com/stranskyjan/pbrain-pyrandom/blob/master/pisqpipe.py

Then it can be used with the Piskvork gomoku manager to compare with other engines like http://www.aiexp.info/pages/yixin.html (which is presently the top gomoku engine)

棋盘增大后是不是需要增加CNN的层数呢？

一个关于对局数据的问题

如果训练的是一个先手有利的棋类（如无禁手的五子棋），那训练数据会是大量的黑棋胜的数据，这样会对训练结果造成什么样的影响呢？

请问你的电脑配置是什么呢，想参考一下

请问你的电脑配置是什么呢，想参考一下
而且有个问题想问一问，我用的tensorflow跑885的大小，cpu是i5 6300hq，gpu是gtx 965m，跑出来的时间gpu比cpu更耗时，是因为gpu太差了吗

No module named 'numpy.core.multiarray\r'

Traceback (most recent call last):
File "human_play.py", line 75, in
run()
File "human_play.py", line 59, in run
policy_param = pickle.load(open('best_policy_8_8_5.model', 'rb'))
ImportError: No module named 'numpy.core.multiarray\r'

如何保存权重weight文件呢

你好，请问如何把weight的参数保存为txt文件呢？

对于 -leaf_value 的问题

AlphaZero_Gomoku/mcts_alphaZero.py

Line 137 in 68603c0

node.update_recursive(-leaf_value)

首先非常感谢作者的贡献，这份代码的教育意义非常大。对于这一行代码的问题，首先解释一下，对于 update_recursive 函数里面的 -leaf_value 我比较清楚。但是该局棋面的估计 leaf_value, 按照上述的代码，为什么本节点(做出估计的节点)的估计 V 要变成 -leaf_value。我认为做出估计的节点的 V 应该是 leaf_value 不是 -leaf_value,在向树上反向传播的时候需要每一层修正一次符号这个我是明白的（因为树是敌我交替的），但是update_recursive函数的调用入口的负号这里不是很懂，是我比较大的一个困惑，希望可以得到作者的解读，再次感谢。

顺便确认一下，代码中多次提到的 current player 是指当前棋局下需要走子的玩家吗？

No module named cPickle

Hello,I meet the problem “No module named cPickle” how to solve it？thanks！

请问人机对弈的时候，为什么不保留之前的树的统计数据呢?

AlphaZero_Gomoku/mcts_alphaZero.py

Line 206 in 66292c5

self.mcts.update_with_move(-1)

如果想在15*15的棋盘上获得较好的结果，不知道pure_mcts和simulation的play out应大概设为多少呢？

gif是用什么画出来的呢

关于GPU利用率的问题

跑的时候GPU内存占满了，但GPU运算部件的利用率只有12%，不知道可以增大利用率吗

请问如何提升使用GPU训练时的效率

您好，我是一个刚接触机器学习没多久的大学生，想请教您一些问题。我发现在使用GPU训练时，GPU的占用率长时间保持20%以下，看了其他的Issues之后，也明白了这是生成自我对局的时间过长而导致的。那么，如果我提升mini-batch的size的话，应该是能提升训练的效率的吧？但是这会不会导致其他问题呢，望您能在百忙之中给予答复，谢谢。

wondering about the input of model

I'm wondering whether it would be better if we extend the input to more layers to include more information on the board such as 'live 3' and 'sleep 4' ......

关于KL散度控制学习率的问题

您好，注意到代码中有通过比较新旧两个神经网络输出的KL散度来控制学习率的方法，实验过程中学习率先快速增加然后逐渐减少，说明这个方法确实有用。想问一下这种方法有相关的文献资料的介绍吗？还是您凭经验创造出来的呢？

game.py的一點小問題

我注意到你的
game.py#L42
與
game.py#L97
計算h與w之時預設height 與 width同長

h = m // width
w = m % width

关于模型的走棋策略的问题

您好！我在使用您的代码进行测试的时候（8*8棋盘），训练了4000轮达到了对于4000次模拟的纯蒙特卡洛算法10局全胜的结果，在与5000次模拟的蒙特卡洛算法中5胜5平。但是我在观察与传统五子棋引擎、人类与自我对弈的过程中，发现了两个问题：

可能出现已经有明显的必胜的落子的走法时，选择不落在那里
对于棋盘中还有较多可落子位置时，棋盘边缘的棋子，似乎根本不会考虑落子。。比如一开始就在棋盘边缘落子，则必胜；与传统五子棋引擎对弈时一般也是攻防到一定回合后对手连续在边缘落子，此模型不进行防守导致落败。

大棋盘的一点问题

请问在大棋盘上的实验效果如何，比如 15 * 15 的棋盘大小

关于entropy

楼主你好，您在model里面的这个entropy似乎计算错了，应该是log吧，您用的tf.exp

如果是可以走棋的游戏 action网络应该怎样设计？

围棋和五子棋都是放下后不可移动，所以action和evaluation共用了一部分网络
如果是象棋跳棋类型，应该怎么设计这个action网络部分呢？
能否提供一点思路？

关于训练后的模型转换

你好，训练出来的模型有三种格式(.model.meta; .model.index; model.data.....)，但我要使用模型的时候，应该使用哪个呢？或者说还是要做个转换才能使用？

Can you pleaase provide a python3 version, thank you!

MCTS_Pure是啥啊？

AlphaZero_Gomoku/train.py

Line 150 in 68603c0

pure_mcts_player = MCTS_Pure(c_puct=5,

state_dict in pytorch isn't compatible with params with theano

state_dict in pytorch is a dict while params trained with theano dumped as list.
when you want to retrain the model trained with theano, it seems that the model can't be loaded properly.
is there any way to solve this?

现有的两个885的model是经过多少次训练得来的？

就想知道
best_policy_8_8_5.model
best_policy_8_8_5.model2
这两个的区别

五子棋比赛规则不了解一下吗？？？

黑旗有限制的

train的时候这个data_buffer满了以后以前的数据会自动出去让新的进来吗？

用tensorflow 训练1000局仍然无法收敛

用的是默认的配置：6x6 board and 4 in a row.
macos上跑的。

batch i:1100, episode_len:21
kl:0.00058,lr_multiplier:11.391,loss:4.518421649932861,entropy:3.5188446044921875,explained_var_old:0.000,explained_var_new:0.000
current self-play batch: 1100
num_playouts:1000, win: 2, lose: 8, tie:0

请指教。

如何把weight的参数保存为txt文件呢？

你好，请问如何把weight的参数保存为txt文件呢？

请问实际对弈的时候，把np.ramdom.choice改成直接选最大概率会不会更好

如题

MCTS最终得出的行为%pi的问题

在敲你的代码过程中遇到了两个问题，麻烦您给指导一下：

根据AlphaGo Zero论文中的描述，在MCTS的backup过程中,首先根据policy-value network得到叶子节点的p，v，之后使用v来更新各个树内节点的Q值。在你的代码中使用的是函数update_recursive(leaf_value),这其中的leaf_value应该就是论文中该叶子节点的v对吧？为什么在mcts_alphazero.py的66行和137行代码中该函数传递的是负的leaf_value,而不是正的？是不是AlphaZero的backup方式与AlphaGo Zero不同呢？我在看AlphaZero论文的时候没有看到这些细节的东西，希望能够得到你的指点。
针对每个状态的MCTS结束的时候会输出一个策略%pi，根据AlphaGo Zero论文中的描述，最终这个策略的公式是跟行为的执行次数和temp有关的，对应的应该是你的代码中函数softmax完成的功能，这个函数化简的结果确实是论文中的公式，但是为什么要减去最大值，即probs = np.exp(x - np.max(x))，而不是直接按照论文给出的公式计算各个行为的概率呢？
在selfplay时候，AlphaGo Zero论文中给行为增加noise中使用Dir(0.03)分布，而代码中使用的参数值为0.3，不知道这块怎么理解？
我看了AlphaGo Zero和AlphaZero的论文，还是没有明白这两个方法的区别，能否满烦你给指点指点！！
拜托了！！！！！！谢谢你的分享！

关于self.data_buffer.extend(play_data)

play_data=[1,2]，data_buffer为空队列的话，执行self.data_buffer.extend(play_data)想要的结果是data_buffer=[1,2]，（1,2分别为训练数据[s,矩阵,z]）,但实际结果会不会是data_buffer = [[1,2]]? 是不是应该把play data里的每一项提出来append到data buffer后面？

想请问一下楼主有试过在evaluate的时候用最新训练的model和上一次得到的model做对比吗？

instead of 和 pure MCTS 最对比，我想这样的话是不是无论在效果优化上会得到提升？并且不需要自己在写一个pure MCTS了。因为个人觉得对于pure MCTS来说，playout的次数增加对提高它胜率是一个diminishing return

我是参考了另一个github的repro https://github.com/suragnair/alpha-zero-general 他们的效果也很不错

BTW，谢谢楼主po这些代码出来以供学习，本人小白一枚。

采用人工对战数据加快收敛可行性分析

您好我现在想采用人工对战的数据用于加速收敛，请问这里人工对战的话，训练时候的mcts_probs_batch概率该如何设定呢，可否让采取当前action的概率为1 其他为0？

关于c_puct的问题

楼主你好，非常感谢你的分享！
我在tensorflow上试着跑了一下您的代码，可是在8*8的尺寸下不知为何效果不理想。于是把合法落子点限制在已有落子点的附近，但是效果还是不甚理想。想问一下如果减少合法动作数量的话，这个c_puct是否需要相应地增加或者减小呢？

Neural Network Architecture

This neural network architecture is quite different from that in Alphago Zero's paper, for instance, the latter took a resnet approach, using 1 convolutional block and 19 residual blocks.
Simply stacking layers may cause certain defects(e.g. speed and accuracy) in network training.