🔭 I'm a Coding Lover.
hijkzzz / alpha-zero-gomoku Goto Github PK
View Code? Open in Web Editor NEWA Multi-threaded Implementation of AlphaZero
A Multi-threaded Implementation of AlphaZero
训练了3次都是在第20次时失败,大佬可以看一下吗
前两次是如下报错:
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4bfb40d4d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f4bfb3d736b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f4b946cdb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1985457 (0x7f4b9696d457 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x1d4b680 (0x7f4be3baa680 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #5: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x62 (0x7f4be3bab812 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x15f (0x7f4be481a7bf in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::_to_copy(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1b6b (0x7f4be3e9e2ab in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d2206b (0x7f4be4b8106b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0x2b5b453 (0x7f4be49ba453 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0xf5 (0x7f4be4368455 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x4015f9b (0x7f4be5e74f9b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x401641e (0x7f4be5e7541e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, c10::optional<c10::MemoryFormat>) + 0x1f9 (0x7f4be43ee819 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::native::to(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x11b (0x7f4be3e94e5b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x2eeef81 (0x7f4be4d4df81 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, bool, bool, c10::optional<c10::MemoryFormat>) + 0x20e (0x7f4be456d15e in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x132 (0x7f4bfb869d22 in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #19: NeuralNetwork::infer() + 0xb6b (0x7f4bfb86777b in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #20: <unknown function> + 0x5972d (0x7f4bfb86872d in /root/autodl-tmp/alpha-zero-gomoku/test/../build/_library.so)
frame #21: <unknown function> + 0x145a0 (0x7f4bfba115a0 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #22: <unknown function> + 0x8609 (0x7f4c1b7ff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #23: clone + 0x43 (0x7f4c1b724133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
后一次根据报错的建议在运行前设CUDA_LAUNCH_BLOCKING=1,最后运行报错如下:
terminate called after throwing an instance of 'std::runtime_error'
what(): The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/__torch__/neural_network/___torch_mangle_1624.py", line 30, in forward
p_conv = self.p_conv
res_layers = self.res_layers
_0 = (res_layers).forward(inputs, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
_1 = (p_bn).forward((p_conv).forward(_0, ), )
_2 = (relu).forward(_1, )
File "code/__torch__/torch/nn/modules/container/___torch_mangle_1613.py", line 16, in forward
_1 = getattr(self, "1")
_0 = getattr(self, "0")
_4 = (_1).forward((_0).forward(inputs, ), )
~~~~~~~~~~~ <--- HERE
return (_3).forward((_2).forward(_4, ), )
File "code/__torch__/neural_network/___torch_mangle_1594.py", line 25, in forward
_1 = (conv2).forward((relu).forward(_0, ), )
_2 = (bn2).forward(_1, )
_3 = (downsample_bn).forward((downsample_conv).forward(inputs, ), )
~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
input = torch.add_(_2, _3)
return (relu).forward1(input, )
File "code/__torch__/torch/nn/modules/conv/___torch_mangle_1592.py", line 10, in forward
inputs: Tensor) -> Tensor:
weight = self.weight
input = torch._convolution(inputs, weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
~~~~~~~~~~~~~~~~~~ <--- HERE
return input
Traceback of TorchScript, original code (most recent call last):
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(459): _conv_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py(463): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(47): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py(217): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(84): forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1488): _slow_forward
/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py(1501): _call_impl
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(1056): trace_module
/root/miniconda3/lib/python3.8/site-packages/torch/jit/_trace.py(794): trace
/root/autodl-tmp/alpha-zero-gomoku/test/../src/neural_network.py(279): save_model
/root/autodl-tmp/alpha-zero-gomoku/test/../src/learner.py(114): learn
learner_test.py(17): <module>
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Aborted (core dumped)
libtorch包,官网下载的
https://download.pytorch.org/libtorch/cu101/libtorch-shared-with-deps-1.5.0%2Bcu101.zip
环境变量
export PATH=$PATH:/home/xxx/libtorch
python 是用anaconda安装的,一下是一些主要库的版本
python3.7.6
cuda10.1
pytorch1.1
模型从度盘下载,放到项目的models目录下
运行
python leaner_test.py play
报错
Traceback (most recent call last):
File "leaner_test.py", line 7, in
import learner
File "../src/learner.py", line 14, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: libpython3.7m.so.1.0: cannot open shared object file: No such file or directory
我还有什么地方没配置呢?
I successfully compiled the library on ubuntu 18.04, but encountered the undefined symbol problem. Could you help me to solve the problem?
pygame 1.9.6
Hello from the pygame community. https://www.pygame.org/contribute.html
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
File "../build/library.py", line 15, in
import _library
ImportError: ../build/_library.so: undefined symbol: _ZTIN3c1021AutogradMetaInterfaceE
Thank you very much
sym = self.get_symmetries(board, prob)
last action should be symmetrized too ?
大佬simulate的时候难道不会导致多个线程同时从while 中break,对同一个root执行inference吗
你好,我看了下描述,有10个仿真,每个仿真有4个线程搜索。我想问一下,每个仿真里都是公用变量的吗,比如说n_visited,还是说每个仿真都是独立的?GPU的神经网络是同时接受10个棋谱,还是4个,还是40啊?
现在在做一个类似的项目,希望作者可以重新提交一下之前在的百度云上分享的模型
你好,是否可以在机器允许的情况下,把训练线程数增大。
当线程数很大时,是否会很多线程都在等gpu返回计算结果,训练时间并不会明显减少
Also, is C++ calling Python or libtorch calling Python?
I know you are loading the trained PT in Libtorch, but why do you need SWIG?
"""
执行到此处开始闪退,f.result()无法执行,没有报错
环境
win10
python 3.7
libtorch 1.3.1
pytorch 1.3.1
"""
#learner.py
with concurrent.futures.ThreadPoolExecutor(max_workers=self.num_train_threads) as executor:
futures = [executor.submit(self.self_play, 1 if itr % 2 else -1, libtorch, k == 1) for k in
range(1, self.num_eps + 1)]
for k, f in enumerate(futures):
examples = f.result()
itr_examples += examples
# decrease libtorch batch size
remain = min(len(futures) - (k + 1), self.num_train_threads)
libtorch.set_batch_size(max(remain * self.num_mcts_threads, 1))
print("EPS: {}, EXAMPLES: {}".format(k + 1, len(examples)))
/home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:15:19: error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {
^
In file included from /usr/include/c++/6/memory:82:0,
from /home/ubuntu/libtorch/libtorch/include/c10/core/Allocator.h:4,
from /home/ubuntu/libtorch/libtorch/include/ATen/ATen.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/csrc/api/include/torch/types.h:3,
from /home/ubuntu/libtorch/libtorch/include/torch/script.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/./src/libtorch.h:3,
from /home/ubuntu/myCode/alpha-zero-gomoku/src/libtorch.cpp:1:
/usr/include/c++/6/bits/shared_ptr.h:327:7: note: candidate: std::shared_ptr<_Tp>::shared_ptr(const std::weak_ptr<_Tp>&, std::nothrow_t) [with _Tp = torch::jit::script::Module]
shared_ptr(const weak_ptr<_Tp>& __r, std::nothrow_t)
^~~~~~~~~~
pygame 1.9.5
Hello from the pygame community. https://www.pygame.org/contribute.html
ALSA lib confmisc.c:768:(parse_card) cannot find card '0'
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1251:(snd_func_refer) error evaluating name
ALSA lib conf.c:4292:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:4771:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2266:(snd_pcm_open_noupdate) Unknown PCM default
ITER :: 1
Segmentation fault
我在windows上经过一番折腾之后可以编译,在ubuntu上编译时就遇到了如下问题
(dl) root@d93dac271dfe:/input/alpha-zero-gomoku/build# cmake --build .
[ 16%] Built target library_swig_compilation
CMakeFiles/_library.dir/build.make:138: *** target pattern contains no '%'。 停止。
CMakeFiles/Makefile2:72: recipe for target 'CMakeFiles/_library.dir/all' failed
make[1]: *** [CMakeFiles/_library.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
Link to the 1.5-day pre-trained models is broken.
https://pan.baidu.com/s/1c2Otxdl7VWFEXul-FyXaJA
says 啊哦,你来晚了,分享的文件已经被取消了,下次要早点哟。
.
Python's GIL limits the concurrency of multiple game threads, making GPU utilization difficult to improve. If I have time, I will rewrite it in C++.
i have problem with learner_test.py
i tried pytorch1.0 and 1.7, neither can work,
the error is as followed, can someone take a look
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src/learner.py", line 16, in
from neural_network import NeuralNetWorkWrapper
File "../src/neural_network.py", line 6, in
import torch
File "/home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/init.py", line 197, in
from torch._C import * # noqa: F403
ImportError: /home/jinxiaoyang/anaconda3/envs/multiAlphaGomoku/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: _ZNK3c1010TensorImpl23shallow_copy_and_detachERKNS_15VariableVersionEb
使用百度云盘上下载模型,再使用人机对战,发现电脑大概计算5分钟才会下一子。
这个速度正常吗?(用的是至强CPU,主频2.1,8核)
有什么办法加快速度吗?
您好。
您的项目提供了一个非常优秀且高性能的五子棋智能体训练基线。阅读您的项目,对理解蒙特卡洛树搜索等AlphaZero中使用到的技术有很大的帮助。
我在您的项目的基础上,增加了例如联盟训练等等一些其他训练方法的实现。想问一下,我可否以MIT许可证的方式,使用您的项目的代码,并将我的实现开源。
非常感谢!
我把selfplay也用libtorch和C++写了一遍,但是发现跑的速度特别特别慢,然后训练效果奇差
想问一下作者采用的网络结构是什么,然后想问一下作者跑的时候平均selfplay一局需要多久
比起Python,C++环境的搭建实在是充满了困难。根据您有限的描述,实在难以让新手快速配置出可以运行您代码的环境。希望您如果有时间,可以完善一下这方面的信息,不胜感激。
vs怎么编译都是.lib,但是python说要dll才行
Thanks you
请问 ubuntu 用什么版本 还有 numpy
我用 libtorch-cxx11-abi-shared-with-deps-1.2.0.zip 和 torch 1.2.0
Segmentation fault (core dumped)
找不到合适的 libtorch 1.1.0
error: no matching function for call to ‘std::shared_ptrtorch::jit::script::Module::shared_ptr(torch::jit::script::Module)’
loop(nullptr) {
不知道是不是torch的版本的问题 我用的是最新的torch的版本
'num_explore': 1,
just 1 step? why so small?
Hi hijkzzz,
I really appreciate the value of your work.
Currenly I'm trying to reimplement your work in pure c++ to accomodate my project's need. But after I reimplement your code in c++, I found out that the value loss doesn't decrese at all after even ~2000 iterations and the policy loss decreases pretty fast from the very begining.
Have you met any similar issue when you develop your alphazero project? And could you give some suggestion of the cause of my issue based on your experience?
Many appericates
Alex
'examples_buffer_max_len': 20,
why so small? in alphazero paper, it's 500000 games.
Hello,
Is there any official score for the performance rating of this model?
Thanks!
(tensorflow_gpuenv) C:\Users\William\source\repos\alpha-zero-gomoku\test>python leaner_test.py play
Traceback (most recent call last):
File "leaner_test.py", line 6, in
import learner
File "../src\learner.py", line 17, in
from library import MCTS, Gomoku, NeuralNetwork
ModuleNotFoundError: No module named 'library'
這是我的畫面顯示,使用anaconda環境,python3.6
作者您好,我想问一下您的mcts并行大概是如何实现的?线程之间virtual loss是怎么传递的?与人对弈是也是采取batch预测的方式吗? 感谢您的帮助
https://github.com/junxiaosong/AlphaZero_Gomoku
这个repo里的input channel有4个,分别是current state, opponent state, last move和current color,
请问你的model的channel分别是什么?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.