Coder Social home page Coder Social logo

alpha-zero-gomoku's Introduction

🔭 I'm a Coding Lover.

Jian Hu's GitHub stats

alpha-zero-gomoku's People

Contributors

dougzheng avatar hijkzzz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

alpha-zero-gomoku's Issues

训练20次时失败

训练了3次都是在第20次时失败,大佬可以看一下吗
前两次是如下报错:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:

terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward
    p_conv = self.p_conv
    res_layers = self.res_layers
    _0 = (res_layers).forward(inputs, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _1 = (p_bn).forward((p_conv).forward(_0, ), )
    _2 = (relu).forward(_1, )
  File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward
    _1 = getattr(self, "1")
    _0 = getattr(self, "0")
    _4 = (_1).forward((_0).forward(inputs, ), )
                       ~~~~~~~~~~~ <--- HERE
    return (_3).forward((_2).forward(_4, ), )
  File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward
    _1 = (conv2).forward((relu).forward(_0, ), )
    _2 = (bn2).forward(_1, )
    _3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), )
                                  ~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    input = torch.add_(_2, _3)
    return (relu).forward1(input, )
  File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward
    inputs: Tensor) -> Tensor:
    weight = self.weight
    input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
            ~~~~~~~~~~~~~~~~~~ <--- HERE
    return input

Traceback of TorchScript, original code (most recent call last):
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model
/root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn
learner_test.py(17): <module>
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Aborted (core dumped)

ubuntu release 报错ImportError

libtorch包,官网下载的
https://download.pytorch.org/libtorch/cu101/libtorch-shared-with-deps-1.5.0%2Bcu101.zip
环境变量
export PATH=$PATH:/home/xxx/libtorch

python 是用anaconda安装的,一下是一些主要库的版本
python3.7.6
cuda10.1
pytorch1.1

模型从度盘下载,放到项目的models目录下

运行
python leaner_test.py play
报错
Traceback (most recent call last):
File "leaner_test.py", line 7, in
import learner
File "../src/learner.py", line 14, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory

我还有什么地方没配置呢?

ImportError: ../build/_library.so: undefined symbol: _ZTIN3c1021AutogradMetaInterfaceE

I successfully compiled the library on ubuntu 18.04, but encountered the undefined symbol problem. Could you help me to solve the problem?

pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: ../build/_library.so: undefined symbol: _ZTIN3c1021AutogradMetaInterfaceE

Thank you very much

bug

sym = self.get_symmetries(board, prob)

last action should be symmetrized too ?

并行MCTS代码不理解

大佬simulate的时候难道不会导致多个线程同时从while 中break,对同一个root执行inference吗

关于并行的问题

你好,我看了下描述,有10个仿真,每个仿真有4个线程搜索。我想问一下,每个仿真里都是公用变量的吗,比如说n_visited,还是说每个仿真都是独立的?GPU的神经网络是同时接受10个棋谱,还是4个,还是40啊?

训练线程数是否能尽量大

你好,是否可以在机器允许的情况下,把训练线程数增大。
当线程数很大时,是否会很多线程都在等gpu返回计算结果,训练时间并不会明显减少

ITER::1后直接闪退

"""
执行到此处开始闪退,f.result()无法执行,没有报错
环境
win10
python 3.7
libtorch 1.3.1
pytorch 1.3.1

"""
#learner.py
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_train_threads) as executor:
futures = [executor.submit(self.self_play, 1 if itr % 2 else -1, libtorch, k == 1) for k in
range(1, self.num_eps + 1)]

            for k, f in enumerate(futures):
                examples = f.result()
                itr_examples += examples
                # decrease libtorch batch size
                remain = min(len(futures) - (k + 1), self.num_train_threads)
                libtorch.set_batch_size(max(remain * self.num_mcts_threads, 1))
                
                print("EPS: {}, EXAMPLES: {}".format(k + 1, len(examples)))

Cmake ../build 报错

/home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:15:19: error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {
^
In file included from /usr/include/c++/6/memory:82:0,
from /home/ubuntu/libtorch/libtorch/include/c10/core/Allocator.h:4,
from /home/ubuntu/libtorch/libtorch/include/ATen/ATen.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/csrc/api/include/torch/types.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/script.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/./src/libtorch.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:1:
/usr/include/c++/6/bits/shared_ptr.h:327:7: note: candidate: std::shared_ptr<_Tp>::shared_ptr(const std::weak_ptr<_Tp>&, std::nothrow_t) [with _Tp = torch::jit::script::Module]
shared_ptr(const weak_ptr<_Tp>& __r, std::nothrow_t)
^~~~~~~~~~

Segmentation fault on Ubuntu

pygame 1.9.5
Hello from the pygame community. https://www.pygame.org/contribute.html
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ITER :: 1
Segmentation fault

Can it run on Ubuntu?

我在windows上经过一番折腾之后可以编译,在ubuntu上编译时就遇到了如下问题
(dl) root@d93dac271dfe:/input/alpha-zero-gomoku/build# cmake --build .
[ 16%] Built target library_swig_compilation
CMakeFiles/_library.dir/build.make:138: *** target pattern contains no '%'。 停止。
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/_library.dir/all' failed
make[1]: *** [CMakeFiles/_library.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2

libtorch_cpu.so: undefined symbol

i have problem with learner_test.py
i tried pytorch1.0 and 1.7, neither can work,
the error is as followed, can someone take a look
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 16, in
from neural_network import NeuralNetWorkWrapper
File "../src/neural_network.py", line 6, in
import torch
File "/home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/init.py", line 197, in
from torch._C import * # noqa: F403
ImportError: /home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: _ZNK3c1010TensorImpl23shallow_copy_and_detachERKNS_15VariableVersionEb

运行速度问题

使用百度云盘上下载模型,再使用人机对战,发现电脑大概计算5分钟才会下一子。
这个速度正常吗?(用的是至强CPU,主频2.1,8核)
有什么办法加快速度吗?

编译运行遇到错误了,麻烦帮忙看下

image
你好,请教一下,我clone你的源代码,从头编译,然后训练就会报这个错。
但是我下载你release里面的包,运行训练就没有问题,这个是啥情况引起的呢?

一些开源许可证方面的咨询

您好。

您的项目提供了一个非常优秀且高性能的五子棋智能体训练基线。阅读您的项目,对理解蒙特卡洛树搜索等AlphaZero中使用到的技术有很大的帮助。

我在您的项目的基础上,增加了例如联盟训练等等一些其他训练方法的实现。想问一下,我可否以MIT许可证的方式,使用您的项目的代码,并将我的实现开源。

非常感谢!

自己的C++实现非常慢

我把selfplay也用libtorch和C++写了一遍,但是发现跑的速度特别特别慢,然后训练效果奇差
想问一下作者采用的网络结构是什么,然后想问一下作者跑的时候平均selfplay一局需要多久

版本

请问 ubuntu 用什么版本 还有 numpy
我用 libtorch-cxx11-abi-shared-with-deps-1.2.0.zip 和 torch 1.2.0
Segmentation fault (core dumped)
找不到合适的 libtorch 1.1.0

--build .之后报错

error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {

不知道是不是torch的版本的问题 我用的是最新的torch的版本

exploration

'num_explore': 1,

just 1 step? why so small?

Value loss doesn't decrease

Hi hijkzzz,

I really appreciate the value of your work.

Currenly I'm trying to reimplement your work in pure c++ to accomodate my project's need. But after I reimplement your code in c++, I found out that the value loss doesn't decrese at all after even ~2000 iterations and the policy loss decreases pretty fast from the very begining.

Have you met any similar issue when you develop your alphazero project? And could you give some suggestion of the cause of my issue based on your experience?

Many appericates
Alex

buffer_len

'examples_buffer_max_len': 20,

why so small? in alphazero paper, it's 500000 games.

No module named 'library' 問題

(tensorflow_gpuenv) C:\Users\William\source\repos\alpha-zero-gomoku\test>python leaner_test.py play
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src\learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
ModuleNotFoundError: No module named 'library'

這是我的畫面顯示,使用anaconda環境,python3.6

为什么pretrained model这么弱

你好,感谢你的code,我试了试pretrained model,发现AI下的飘忽不定,而且连活三都不会堵,我执X,AI执O
截屏2021-04-26 上午8 04 21
请问这个model训练了几局?有什么想法吗?

关于mcts并行的一些困惑

作者您好,我想问一下您的mcts并行大概是如何实现的?线程之间virtual loss是怎么传递的?与人对弈是也是采取batch预测的方式吗? 感谢您的帮助

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.