hijkzzz / alpha-zero-gomoku Goto Github PK

View Code? Open in Web Editor NEW

354.0 354.0 48.0 8.02 MB

A Multi-threaded Implementation of AlphaZero

Python 54.98% CMake 2.09% C++ 42.01% SWIG 0.91%

alphazero cpp gomoku-game libtorch multithreading

alpha-zero-gomoku's Introduction

🔭 I'm a Coding Lover.

alpha-zero-gomoku's People

Contributors

Stargazers

Watchers

Forkers

balancewing daoos ii0 marketxing1 wjsxlb2017 yinminggang fantianwen da-capo dleanjeans nkcr7 earnestxu snail-ju happypk finchchen wwxfromtju lionffen marvis awesome-archive trendingtechnology jingmouren citymap scape1989 liudengfeng solversa stjordanis aiyoungcino dougzheng robinwhy soysaucemo zhanyon cosmoshua ma-weijian xmgfx illumionous yuan6785 dikpoorcat draconiandesign antoniovleonti bingbin83 pwang649 duanyll henryslzhao princeofpersiav5 paomiantong cobets wangscu 321915514

alpha-zero-gomoku's Issues

训练20次时失败

训练了3次都是在第20次时失败，大佬可以看一下吗
前两次是如下报错：

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1，最后运行报错如下：

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward
    p_conv = self.p_conv
    res_layers = self.res_layers
    _0 = (res_layers).forward(inputs, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _1 = (p_bn).forward((p_conv).forward(_0, ), )
    _2 = (relu).forward(_1, )
  File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward
    _1 = getattr(self, "1")
    _0 = getattr(self, "0")
    _4 = (_1).forward((_0).forward(inputs, ), )
                       ~~~~~~~~~~~ <--- HERE
    return (_3).forward((_2).forward(_4, ), )
  File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward
    _1 = (conv2).forward((relu).forward(_0, ), )
    _2 = (bn2).forward(_1, )
    _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), )
                                  ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    input = torch.add_(_2, _3)
    return (relu).forward1(input, )
  File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward
    inputs: Tensor) -> Tensor:
    weight = self.weight
    input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
            ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input

Traceback of TorchScript, original code (most recent call last):
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model
/root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn
learner_test.py(17): <module>
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Aborted (core dumped)

ubuntu release 报错ImportError

libtorch包，官网下载的
https://download.pytorch.org/libtorch/cu101/libtorch-shared-with-deps-1.5.0%2Bcu101.zip
环境变量
export PATH=$PATH:/home/xxx/libtorch

python 是用anaconda安装的，一下是一些主要库的版本
python3.7.6
cuda10.1
pytorch1.1

模型从度盘下载，放到项目的models目录下

运行
python leaner_test.py play
报错
Traceback (most recent call last):
File "leaner_test.py", line 7, in
import learner
File "../src/learner.py", line 14, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

我还有什么地方没配置呢？

ImportError: ../build/_library.so: undefined symbol: _ZTIN3c1021AutogradMetaInterfaceE

I successfully compiled the library on ubuntu 18.04, but encountered the undefined symbol problem. Could you help me to solve the problem?

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: ../build/_library.so: undefined symbol: _ZTIN3c1021AutogradMetaInterfaceE

Thank you very much

bug

sym = self.get_symmetries(board, prob)

last action should be symmetrized too ?

并行MCTS代码不理解

大佬simulate的时候难道不会导致多个线程同时从while 中break，对同一个root执行inference吗

关于并行的问题

你好，我看了下描述，有10个仿真，每个仿真有4个线程搜索。我想问一下，每个仿真里都是公用变量的吗，比如说n_visited，还是说每个仿真都是独立的？GPU的神经网络是同时接受10个棋谱，还是4个，还是40啊？

以前的百度云的模型分享失效了

现在在做一个类似的项目，希望作者可以重新提交一下之前在的百度云上分享的模型

训练线程数是否能尽量大

你好，是否可以在机器允许的情况下，把训练线程数增大。
当线程数很大时，是否会很多线程都在等gpu返回计算结果，训练时间并不会明显减少

How did you create SWIG for libtorch?

Also, is C++ calling Python or libtorch calling Python?
I know you are loading the trained PT in Libtorch, but why do you need SWIG?

ITER：：1后直接闪退

"""
执行到此处开始闪退，f.result()无法执行，没有报错
环境
win10
python 3.7
libtorch 1.3.1
pytorch 1.3.1

"""
#learner.py
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_train_threads) as executor:
futures = [executor.submit(self.self_play, 1 if itr % 2 else -1, libtorch, k == 1) for k in
range(1, self.num_eps + 1)]

            for k, f in enumerate(futures):
                examples = f.result()
                itr_examples += examples
                # decrease libtorch batch size
                remain = min(len(futures) - (k + 1), self.num_train_threads)
                libtorch.set_batch_size(max(remain * self.num_mcts_threads, 1))
                
                print("EPS: {}, EXAMPLES: {}".format(k + 1, len(examples)))

Cmake ../build 报错

/home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:15:19: error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {
^
In file included from /usr/include/c++/6/memory:82:0,
from /home/ubuntu/libtorch/libtorch/include/c10/core/Allocator.h:4,
from /home/ubuntu/libtorch/libtorch/include/ATen/ATen.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/csrc/api/include/torch/types.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/script.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/./src/libtorch.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:1:
/usr/include/c++/6/bits/shared_ptr.h:327:7: note: candidate: std::shared_ptr<_Tp>::shared_ptr(const std::weak_ptr<_Tp>&, std::nothrow_t) [with _Tp = torch::jit::script::Module]
shared_ptr(const weak_ptr<_Tp>& __r, std::nothrow_t)
^~~~~~~~~~

Segmentation fault on Ubuntu

pygame 1.9.5
Hello from the pygame community. https://www.pygame.org/contribute.html
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ITER :: 1
Segmentation fault

Can it run on Ubuntu？

我在windows上经过一番折腾之后可以编译，在ubuntu上编译时就遇到了如下问题
(dl) root@d93dac271dfe:/input/alpha-zero-gomoku/build# cmake --build .
[ 16%] Built target library_swig_compilation
CMakeFiles/_library.dir/build.make:138: *** target pattern contains no '%'。停止。
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/_library.dir/all' failed
make[1]: *** [CMakeFiles/_library.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

Link to the 1.5-day pre-trained models is broken.

Link to the 1.5-day pre-trained models is broken.
https://pan.baidu.com/s/1c2Otxdl7VWFEXul-FyXaJA says 啊哦，你来晚了，分享的文件已经被取消了，下次要早点哟。.

Implement multiple game threads in C++

Python's GIL limits the concurrency of multiple game threads, making GPU utilization difficult to improve. If I have time, I will rewrite it in C++.

libtorch_cpu.so: undefined symbol

i have problem with learner_test.py
i tried pytorch1.0 and 1.7, neither can work,
the error is as followed, can someone take a look
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 16, in
from neural_network import NeuralNetWorkWrapper
File "../src/neural_network.py", line 6, in
import torch
File "/home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/init.py", line 197, in
from torch._C import * # noqa: F403
ImportError: /home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: _ZNK3c1010TensorImpl23shallow_copy_and_detachERKNS_15VariableVersionEb

Is the evaluation process supporting batching ?

运行速度问题

使用百度云盘上下载模型，再使用人机对战，发现电脑大概计算5分钟才会下一子。
这个速度正常吗？（用的是至强CPU，主频2.1，8核）
有什么办法加快速度吗？

感谢楼主的分享，楼主试过多机多卡的CPU训练？？

编译运行遇到错误了，麻烦帮忙看下

你好，请教一下，我clone你的源代码，从头编译，然后训练就会报这个错。
但是我下载你release里面的包，运行训练就没有问题，这个是啥情况引起的呢？

一些开源许可证方面的咨询

您好。

您的项目提供了一个非常优秀且高性能的五子棋智能体训练基线。阅读您的项目，对理解蒙特卡洛树搜索等AlphaZero中使用到的技术有很大的帮助。

我在您的项目的基础上，增加了例如联盟训练等等一些其他训练方法的实现。想问一下，我可否以MIT许可证的方式，使用您的项目的代码，并将我的实现开源。

非常感谢！

自己的C++实现非常慢

我把selfplay也用libtorch和C++写了一遍，但是发现跑的速度特别特别慢，然后训练效果奇差
想问一下作者采用的网络结构是什么，然后想问一下作者跑的时候平均selfplay一局需要多久

你好，请问你的项目在别的电脑上如何运行呢？

比起Python，C++环境的搭建实在是充满了困难。根据您有限的描述，实在难以让新手快速配置出可以运行您代码的环境。希望您如果有时间，可以完善一下这方面的信息，不胜感激。

windows下跑不了

vs怎么编译都是.lib，但是python说要dll才行

Which libtorch version is work? Can give ZIP link?

Thanks you

版本

请问 ubuntu 用什么版本还有 numpy
我用 libtorch-cxx11-abi-shared-with-deps-1.2.0.zip 和 torch 1.2.0
Segmentation fault (core dumped)
找不到合适的 libtorch 1.1.0

How to use your mcts with python

--build .之后报错

error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {

不知道是不是torch的版本的问题我用的是最新的torch的版本

exploration

'num_explore': 1,

just 1 step? why so small?

The results shared from Baidu Network Disk have failed, could you please load another one?

Value loss doesn't decrease

Hi hijkzzz,

I really appreciate the value of your work.

Currenly I'm trying to reimplement your work in pure c++ to accomodate my project's need. But after I reimplement your code in c++, I found out that the value loss doesn't decrese at all after even ~2000 iterations and the policy loss decreases pretty fast from the very begining.

Have you met any similar issue when you develop your alphazero project? And could you give some suggestion of the cause of my issue based on your experience?

Many appericates
Alex

buffer_len

'examples_buffer_max_len': 20,

why so small? in alphazero paper, it's 500000 games.

Was this tested against https://gomocup.org/?

Hello,
Is there any official score for the performance rating of this model?
Thanks!

No module named 'library' 問題

(tensorflow_gpuenv) C:\Users\William\source\repos\alpha-zero-gomoku\test>python leaner_test.py play
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src\learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
ModuleNotFoundError: No module named 'library'

這是我的畫面顯示，使用anaconda環境，python3.6