openbmb / bmtrain Goto Github PK

View Code? Open in Web Editor NEW

526.0 526.0 72.0 2.45 MB

Efficient Training (including pre-training and fine-tuning) for Big Models

License: Apache License 2.0

Python 84.89% C++ 7.78% Cuda 5.34% C 0.21% Shell 0.16% Dockerfile 0.17% CMake 1.45%

bmtrain's People

Contributors

Stargazers

Watchers

Forkers

achazwl kunlun-zhu zzy14 qiaoziqing zh-zheng xiaoqingnlp shuo-git forrestbing matrixplayer xcjthu yushengsu-thu chestnut1999 gaohuan2015 maydomine alphagem clancy-zhu decoder666 wangyurzee7 xzlstorm gongbaitao ftgreat lhj-git blacker603 ericxsun gnap unix1986 lichaonetuser fudp cp625128752 clark0203 rayjue woodstone121 petercao marscrazy xingyuxie cocoliu zjdyzww fjteam thelongestusernameofall benkerd22 gaoxiaooce xiedongmingming kfiring ranchizhao w32zhong approach0 xwang20 54457616 zkh2016 jerryyin777 darcstar-solutions-tech isuyu ttltwlj timaos123 zhangzhao2010 dorucioclea msgpo whuhxb tonywhite11 sanyaade-projects silasdao 5l1v3r1 haorenkk123 issacstudent brunoscaglione id-2 gryffindor-rr nov11 carryfun beinggod

bmtrain's Issues

Would you publish the performance data in detail about how to save 90%?

What can I do to handle the overflow?

About CPU Offloading

If I wanna use CPU Offloading in my model, which API should I call in BMTrain?

禁用ZeRO Optimization

在使用该框架的时候，默认使用了zero optimization，backward时间是forward的3倍左右，以换取更少的显存占用；如果不缺显存的话，我希望获得更快的训练速度，请问我怎样禁用zero optimization呢？

python setup.py install
running install
running bdist_egg
running egg_info
writing bmtrain.egg-info\PKG-INFO
writing dependency_links to bmtrain.egg-info\dependency_links.txt
writing requirements to bmtrain.egg-info\requires.txt
writing top-level names to bmtrain.egg-info\top_level.txt
reading manifest file 'bmtrain.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'bmtrain.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
running build_ext
building 'bmtrain.nccl._C' extension
Emitting ninja build file D:\code\python\nlp\BMTrain-main\build\temp.win-amd64-3.7\Release\build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
F:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:F:\Python\Python37\lib\site-packages\torch\lib "/LIBPATH:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7\lib\x64" /LIBPATH:F:\Python\Python37\libs /LIBPATH:F:\Python\Python37\PCbuild\amd64 "/LIBPATH:F:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\ATLMFC\lib\x64" "/LIBPATH:F:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.22000.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.22000.0\um\x64" c10.lib torch.lib torch_cpu.lib torch_python.lib cudart.lib c10_cuda.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EXPORT:PyInit__C D:\code\python\nlp\BMTrain-main\build\temp.win-amd64-3.7\Release\csrc/nccl.obj /OUT:build\lib.win-amd64-3.7\bmtrain\nccl_C.cp37-win_amd64.pyd /IMPLIB:D:\code\python\nlp\BMTrain-main\build\temp.win-amd64-3.7\Release\csrc_C.cp37-win_amd64.lib
正在创建库 D:\code\python\nlp\BMTrain-main\build\temp.win-amd64-3.7\Release\csrc_C.cp37-win_amd64.lib 和对象 D:\code\python\nlp\BMTrain-main\build\temp.win-amd64-3.7\Release\csrc_C.cp37-win_amd64.exp
nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommInitRank
nccl.obj : error LNK2001: 无法解析的外部符号 ncclReduce
nccl.obj : error LNK2001: 无法解析的外部符号 ncclRecv
nccl.obj : error LNK2001: 无法解析的外部符号 ncclGroupEnd
nccl.obj : error LNK2001: 无法解析的外部符号 ncclSend
nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommCount
nccl.obj : error LNK2001: 无法解析的外部符号 ncclGetUniqueId
nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommDestroy
nccl.obj : error LNK2001: 无法解析的外部符号 ncclBroadcast
nccl.obj : error LNK2001: 无法解析的外部符号 ncclGroupStart
nccl.obj : error LNK2001: 无法解析的外部符号 ncclCommUserRank
nccl.obj : error LNK2001: 无法解析的外部符号 ncclReduceScatter
nccl.obj : error LNK2001: 无法解析的外部符号 ncclAllGather
nccl.obj : error LNK2001: 无法解析的外部符号 ncclAllReduce
nccl.obj : error LNK2001: 无法解析的外部符号 ncclGetErrorString
build\lib.win-amd64-3.7\bmtrain\nccl_C.cp37-win_amd64.pyd : fatal error LNK1120: 15 个无法解析的外部命令
error: command 'F:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\bin\HostX86\x64\link.exe' failed with exit status 1120

cuda 11.7
torch 1.13.1+cu117
torchaudio 0.13.1+cu117
torchvision 0.14.1+cu117
vs 2022

I can't find a solution. I tried different versions of torch and vs.
试了很多个版本的torch vs，还是不行，是因为cuda版本太高了吗？我显卡是3070，windows这么难编译的吗

Model extensibility

Thanks for your great work,

On the official website, BMTrain support the following structure:
Encoder(bert-base-cased bert-base-uncased bert-large-cased bert-large-uncased bert-base-chinese bert-base-multilingual-cased)
Decoder(CPM-1(large) GPT-2(base) GPT-2(medium) GPT-2(large) GPT-2(XL) GPT-J(6B))
Encoder-Decoder(CPM-2(large) T5-small T5-base T5-large T5(3B) T5(11B))

Whether BMTrain support other model that outside the list?(e.g. resnet and so on)
Is there a tutorial?

Looking forward your prompt reply.

如果一台机器上的显存不够加大载模型时，是否将加载到其他机器上？

您好，问题如下：
4台机器，每台2张2080ti(11G)，如果模型很大，一台机器加载不了，是否会通过模型并行加载到其他机器上？
如果是，各节点是依次执行以下代码实现训练吗？
torchrun --nnodes=4 --nproc_per_node=2 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=xxx.xxx.xxx.xx:88688} train.py
注：各节点执行以上代码时时，将指令中的rdzv_id=更改为1、2、3、4，以对应四台机器，是这样吗？

How can I apply checkpoint block on cpm-1?

The code like this always give me a div 0 error in cpm line

安装成功，但是导入错误

我的环境是cuda11.2，torch 1.12.1+cu113，能够安装成功bmtrain，但是导入错误如下：辛苦请教下这个是什么原因呢？比较急想用BMTrain，辛苦～
import bmtrain
Traceback (most recent call last):
File "", line 1, in
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/init.py", line 16, in
from . import optim
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/optim/init.py", line 1, in
from .adam import AdamOptimizer
File "/root/anaconda3/lib/python3.8/site-packages/bmtrain/optim/adam.py", line 4, in
from . import _cuda as C
ImportError: /root/anaconda3/lib/python3.8/site-packages/bmtrain/optim/_cuda.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor8data_ptrIhEEPT_v

请教训练时长

请问您的这个表格里面，该模型的训练时长是多少呢？

stuck during synchronize

Hi,

When using BMCOOK with BMTrain I encountered a bug that the second bmtrain.synchronize() is always stuck. Do you probably have any ideas?

Below is the code:

import os
import json
import torch
import random
import time
import bmtrain as bmt
from data import MMapIndexedDataset, Dataset
from bmcook import CookTrainer
from bmcook.utils.config import ConfigParser
from bmcook.utils.arguments import parse_args
from pathlib import Path

bmt.init_distributed()
args = parse_args()
save_dir = Path(args.save_dir)
ckpt_dir = save_dir / 'checkpoints'
os.makedirs(ckpt_dir, exist_ok=True)
json.dump(vars(args), open(save_dir / 'train_args.json', 'w'), indent=2)
model_config = config_map[args.model].from_pretrained(args.model)
model = model_map[args.model].from_pretrained(args.model, config=model_config)
# teacher model has the same config as the student model
teacher = model_map[args.model].from_pretrained(args.model, config=model_config)
bmt.synchronize() #this works

...

CookTrainer.set_compression(config, model, optimizer, teacher)    #this step uses another bmt.synchronize() where I stuck

BMtrain insatll failed which my environment is gcc5.4.0, torch1.7.0, cudnn10.2. I have try other torch version, for example 1.12.0, filed again.

error massage: csrc/adam_cpu.cpp: 158:27 error const class at::tensor has no member named is_cpu

install error with WinError 5

(cpm) D:\GitHub\BMTrain>python setup.py install
running install
C:\ProgramData\Anaconda3\envs\cpm\lib\site-packages\setuptools\command\install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
C:\ProgramData\Anaconda3\envs\cpm\lib\site-packages\setuptools\command\easy_install.py:147: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
EasyInstallDeprecationWarning,
running bdist_egg
running egg_info
writing bmtrain.egg-info\PKG-INFO
writing dependency_links to bmtrain.egg-info\dependency_links.txt
writing requirements to bmtrain.egg-info\requires.txt
writing top-level names to bmtrain.egg-info\top_level.txt
reading manifest file 'bmtrain.egg-info\SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'bmtrain.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_py
creating build\lib.win-amd64-cpython-37
creating build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\block_layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\checkpointing.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\debug.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\global_var.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\parameter.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\param_init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\pipe_layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\store.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\synchronize.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\utils.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\wrapper.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain_init_.py -> build\lib.win-amd64-cpython-37\bmtrain
creating build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\all_gather.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\reduce_scatter.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\send_recv.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\shape.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\utils.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
creating build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed\ops.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
creating build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\format.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\model.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\tensor.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
creating build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss\cross_entropy.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
creating build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\cosine.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\exponential.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\linear.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\noam.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\no_decay.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\warmup.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
creating build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl\enums.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
creating build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam_offload.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\optim_manager.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
running build_ext
error: [WinError 5] 拒绝访问。

[Feature Request] Synchronizing scalers of multiple optimizers

Sometimes, we need to use multiple optimizers for different parameters so that we can turn on and off the optimization of different parameters easily.

However, in the current implementation of BMTrain, every optimizer has its own scale. To make the gradient correct, either I need to put all parameters into one optimizer, or I need to call backward for multiple times for each optimizer with their own scaler (and I'm not sure if this works; not tried yet).

So I request for a utility that synchronizes the scalers of multiple optimizers, which takes the loss and a list of optimizers as parameters and works like this roughly as far I can see:

... # initialize
for optimizer in optimizers:
  if optimizer.scale < min_scale:
    min_scale = optimizer.scale
for optimizer in optimizers:
  optimizer.scale = min_scale
loss = loss * min_scale  ... # scale the loss

安装失败

Building wheels for collected packages: bmtrain
Building wheel for bmtrain (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [156 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/backward.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/block_layer.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/checkpointing.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/debug.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/global_var.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/layer.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/param_init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/parameter.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/store.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/synchronize.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/utils.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/wrapper.py -> build/lib.linux-x86_64-3.8/bmtrain
creating build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/init.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/all_gather.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/reduce_scatter.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/shape.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/utils.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
creating build/lib.linux-x86_64-3.8/bmtrain/distributed
copying bmtrain/distributed/init.py -> build/lib.linux-x86_64-3.8/bmtrain/distributed
copying bmtrain/distributed/ops.py -> build/lib.linux-x86_64-3.8/bmtrain/distributed
creating build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/init.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/format.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/model.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/tensor.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
creating build/lib.linux-x86_64-3.8/bmtrain/loss
copying bmtrain/loss/init.py -> build/lib.linux-x86_64-3.8/bmtrain/loss
copying bmtrain/loss/cross_entropy.py -> build/lib.linux-x86_64-3.8/bmtrain/loss
creating build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/init.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/cosine.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/exponential.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/linear.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/no_decay.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/noam.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/warmup.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
creating build/lib.linux-x86_64-3.8/bmtrain/nccl
copying bmtrain/nccl/init.py -> build/lib.linux-x86_64-3.8/bmtrain/nccl
copying bmtrain/nccl/enums.py -> build/lib.linux-x86_64-3.8/bmtrain/nccl
creating build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/init.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/adam.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/adam_offload.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/clip_grad.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
running build_ext
/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py:329: UserWarning:

                                 !! WARNING !!
  
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  Your compiler (g++ 4.8.5) may be ABI-incompatible with PyTorch!
  Please use a compiler that is ABI-compatible with GCC 5.0 and above.
  See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
  
  See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
  for instructions on how to install GCC 5 or higher.
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  
                                !! WARNING !!
  
    warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
  building 'bmtrain.nccl._C' extension
  creating /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8
  creating /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc
  /usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py:329: UserWarning:
  
                                 !! WARNING !!
  
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
  Please use a compiler that is ABI-compatible with GCC 5.0 and above.
  See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
  
  See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
  for instructions on how to install GCC 5 or higher.
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  
                                !! WARNING !!
  
    warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
  Emitting ninja build file /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/1] c++ -MMD -MF /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Icsrc/nccl/build/include -I/usr/local/lib/python3.8/site-packages/torch/include -I/usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/site-packages/torch/include/TH -I/usr/local/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/local/include/python3.8 -c -c /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/csrc/nccl.cpp -o /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  FAILED: /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o
  c++ -MMD -MF /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Icsrc/nccl/build/include -I/usr/local/lib/python3.8/site-packages/torch/include -I/usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/site-packages/torch/include/TH -I/usr/local/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/local/include/python3.8 -c -c /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/csrc/nccl.cpp -o /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  c++: error: unrecognized command line option ‘-std=c++14’
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
      subprocess.run(
    File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/setup.py", line 74, in <module>
      setup(
    File "/usr/local/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/usr/local/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/local/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 325, in run
      self.run_command("build")
    File "/usr/local/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/local/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 741, in build_extensions
      build_ext.build_extensions(self)
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
      self._build_extensions_serial()
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
      self.build_extension(ext)
    File "/usr/local/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
      _build_ext.build_extension(self, ext)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
      objects = self.compiler.compile(sources,
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 562, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1419, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bmtrain
Running setup.py clean for bmtrain
Failed to build bmtrain
Installing collected packages: bmtrain, blis, absl-py, requests-oauthlib, pathy, markdown, google-auth, confection, thinc, google-auth-oauthlib, tensorboard, spacy
Running setup.py install for bmtrain ... error
error: subprocess-exited-with-error

× Running setup.py install for bmtrain did not run successfully.
│ exit code: 1
╰─> [158 lines of output]
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.8
creating build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/backward.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/block_layer.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/checkpointing.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/debug.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/global_var.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/layer.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/param_init.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/parameter.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/store.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/synchronize.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/utils.py -> build/lib.linux-x86_64-3.8/bmtrain
copying bmtrain/wrapper.py -> build/lib.linux-x86_64-3.8/bmtrain
creating build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/init.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/all_gather.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/reduce_scatter.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/shape.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
copying bmtrain/benchmark/utils.py -> build/lib.linux-x86_64-3.8/bmtrain/benchmark
creating build/lib.linux-x86_64-3.8/bmtrain/distributed
copying bmtrain/distributed/init.py -> build/lib.linux-x86_64-3.8/bmtrain/distributed
copying bmtrain/distributed/ops.py -> build/lib.linux-x86_64-3.8/bmtrain/distributed
creating build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/init.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/format.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/model.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
copying bmtrain/inspect/tensor.py -> build/lib.linux-x86_64-3.8/bmtrain/inspect
creating build/lib.linux-x86_64-3.8/bmtrain/loss
copying bmtrain/loss/init.py -> build/lib.linux-x86_64-3.8/bmtrain/loss
copying bmtrain/loss/cross_entropy.py -> build/lib.linux-x86_64-3.8/bmtrain/loss
creating build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/init.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/cosine.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/exponential.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/linear.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/no_decay.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/noam.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/warmup.py -> build/lib.linux-x86_64-3.8/bmtrain/lr_scheduler
creating build/lib.linux-x86_64-3.8/bmtrain/nccl
copying bmtrain/nccl/init.py -> build/lib.linux-x86_64-3.8/bmtrain/nccl
copying bmtrain/nccl/enums.py -> build/lib.linux-x86_64-3.8/bmtrain/nccl
creating build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/init.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/adam.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/adam_offload.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
copying bmtrain/optim/clip_grad.py -> build/lib.linux-x86_64-3.8/bmtrain/optim
running build_ext
/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py:329: UserWarning:

                                 !! WARNING !!
  
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  Your compiler (g++ 4.8.5) may be ABI-incompatible with PyTorch!
  Please use a compiler that is ABI-compatible with GCC 5.0 and above.
  See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
  
  See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
  for instructions on how to install GCC 5 or higher.
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  
                                !! WARNING !!
  
    warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
  building 'bmtrain.nccl._C' extension
  creating /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8
  creating /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc
  /usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py:329: UserWarning:
  
                                 !! WARNING !!
  
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch!
  Please use a compiler that is ABI-compatible with GCC 5.0 and above.
  See https://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html.
  
  See https://gist.github.com/goldsborough/d466f43e8ffc948ff92de7486c5216d6
  for instructions on how to install GCC 5 or higher.
  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
  
                                !! WARNING !!
  
    warnings.warn(ABI_INCOMPATIBILITY_WARNING.format(compiler))
  Emitting ninja build file /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/build.ninja...
  Compiling objects...
  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
  [1/1] c++ -MMD -MF /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Icsrc/nccl/build/include -I/usr/local/lib/python3.8/site-packages/torch/include -I/usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/site-packages/torch/include/TH -I/usr/local/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/local/include/python3.8 -c -c /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/csrc/nccl.cpp -o /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  FAILED: /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o
  c++ -MMD -MF /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -Icsrc/nccl/build/include -I/usr/local/lib/python3.8/site-packages/torch/include -I/usr/local/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/site-packages/torch/include/TH -I/usr/local/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/local/include/python3.8 -c -c /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/csrc/nccl.cpp -o /tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/build/temp.linux-x86_64-3.8/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
  c++: error: unrecognized command line option ‘-std=c++14’
  ninja: build stopped: subcommand failed.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
      subprocess.run(
    File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
      raise CalledProcessError(retcode, process.args,
  subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
  
  The above exception was the direct cause of the following exception:
  
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-9z6c_ooh/bmtrain_926bd6a4c18c493a82619c6ec3553d2e/setup.py", line 74, in <module>
      setup(
    File "/usr/local/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
      return distutils.core.setup(**attrs)
    File "/usr/local/lib/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/local/lib/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run
      return orig.install.run(self)
    File "/usr/local/lib/python3.8/distutils/command/install.py", line 545, in run
      self.run_command('build')
    File "/usr/local/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/local/lib/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/local/lib/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 741, in build_extensions
      build_ext.build_extensions(self)
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
      self._build_extensions_serial()
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
      self.build_extension(ext)
    File "/usr/local/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 196, in build_extension
      _build_ext.build_extension(self, ext)
    File "/usr/local/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
      objects = self.compiler.compile(sources,
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 562, in unix_wrap_ninja_compile
      _write_ninja_file_and_compile_objects(
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1419, in _write_ninja_file_and_compile_objects
      _run_ninja_build(
    File "/usr/local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1756, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> bmtrain

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

BUG:TypeError: linear(): argument 'input' (position 1) must be Tensor, not NoneType

several error occur, when I try to reproduce the project
https://modelcenter.readthedocs.io/en/latest/notes/quickstart.html

1.TypeError__init__() got an unexpected keyword argument 'architectures'
It seems like Hugging face update the configuration file. Problem fixed when I empty configuration file.

2.TypeError: linear(): argument ‘input‘ (position 1) must be Tensor, not NoneType
Traceback prompt that it occur from
logits = model(input_ids, attention_mask)
...
TypeError: linear(): argument ‘input‘ (position 1) must be Tensor, not NoneType
return torch._C._nn.linear(input, weight, bias)

[replace usage of tensor.storage()]

Many thanks for your fantastic project. Just a kind suggestion: while using bmtrain, it will throw out a lot warnings and they are all corresponding to torch.storage as following:

/opt/conda/envs/compression/lib/python3.10/site-packages/bmtrain/store.py:178: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()

it seems that it is related to feature of torch. You may replace all the tensor.storage() with tensor.untyped_storage() so that we can get the clean logs:)

How can I train T5 model with my own dataset

If I want to train T5 model with BMTrain,how can I do it?

About DistributedDataloader

Hi, is there a DistributedDataloader design necessary to work with the bmtrain for the accelerating, or the bmtrain method itself would realize the optimization for both the memory and the speed?

【Error】 in Adam implementation

我看到有一个scale实现上的错误，可能会导致 Nan的问题。

这个eps*scale 应该需要是 eps*sqrtf(scale）
https://github.com/OpenBMB/BMTrain/blob/9cc975593f628a3fcc8c71328081e238914eca1d/csrc/cuda/adam.cu#LL29C115-L29C115

[FeatureRequest]`bmt.OpTransformerBlockList` DO NOT support multiple return values of transformer block's forward propogation

1. Currently `bmt.OpTransformerBlockList` can only handle the hidden states returned by transformer block.

Recent released flash_atten implemented transformer block returns hidden_states as well as residual in order to fuse Dropout -> Add -> LN. Additionally, the above two will be passed to the next block as input;

 class Block(nn.Module):
     def forward(self, hidden_states: Tensor, residual: Optional[Tensor] = None,
           mixer_subset=None, mixer_kwargs=None):
         if self.prenorm:
              ...
             return hidden_states, residual
         ...

https://github.com/HazyResearch/flash-attention/blob/v1.0.4/flash_attn/modules/block.py#L172

Above case seemed not be considered by our bmt.OpTransformerBlockList and cannot be properly handled by us.
- https://github.com/OpenBMB/BMTrain/blob/0.2.1/bmtrain/block_layer.py#L672

2. Request to support the above case which returns multiple values by a transformer block.

TypeError: object of type 'TransformerBlockList' has no len()

I'm trying to run EleutherAI/gpt-j-6B on a Titan V (12 GB), which can't fit the weights, so after loading the PyTorch model I'm doing:

torch.cuda.device(0):  model = bminf.wrapper(model)

but, now I get the error (full error log at the end):

TypeError: object of type 'TransformerBlockList' has no len()

So I tried patching class TransformerBlockList(torch.nn.Module) by adding a __len__ method:

    def __len__(self):
        return len(self.layers)

and the model finally runs, but I'm getting nonsense (random words) as output.

The code is roughly:

import transformers

model = transformers.GPTJForCausalLM.from_pretrained(f'EleutherAI/{MODELN}', revision='float16', torch_dtype=torch.float16, low_cpu_mem_usage=True)
torch.save(model, MODELN)
print(model)

if not os.path.exists(f'{DATAD}/{MODELN}.pt'):  shutil.copy(f'{MODELN}.pt', DATAD)
model     = torch.load(f'{DATAD}/{MODELN}.pt')                                  # 11 s @ /dev/shm
tokenizer = transformers.AutoTokenizer.from_pretrained(f'EleutherAI/{MODELN}')  #  8 s
with torch.cuda.device(0):  model = bminf.wrapper(model)                        # 14 s

pl    = transformers.pipeline('text-generation', model=model, tokenizer=tokenizer, max_length=100, device=0)  # create pipeline
otxts = pl(prompt)
for txt in otxts:
	print(f"\x1b[32m{txt['generated_text']}\x1b[0m")

Full error log:

/home/da/py38/lib/python3.8/site-packages/torch/nn/modules/module.py:673: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:480.)
  if param.grad is not None:
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
/home/da/.local/lib/python3.8/site-packages/transformers/generation/utils.py:1387: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 50 (`self.config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Traceback (most recent call last):
  File "./app.py", line 84, in <module>
    txt = pl(prompt)  # inf  # [{'generated_text': 'My Name is philipp k. and I live just outside of Detroit....
  File "/home/da/.local/lib/python3.8/site-packages/transformers/pipelines/text_generation.py", line 202, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1074, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1081, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/pipelines/base.py", line 990, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/pipelines/text_generation.py", line 244, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home/da/py38/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/generation/utils.py", line 1571, in generate
    return self.sample(
  File "/home/da/.local/lib/python3.8/site-packages/transformers/generation/utils.py", line 2534, in sample
    outputs = self(
  File "/home/da/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 821, in forward
    transformer_outputs = self.transformer(
  File "/home/da/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/da/.local/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py", line 587, in forward
    past_key_values = tuple([None] * len(self.h))
TypeError: object of type 'TransformerBlockList' has no len()

RuntimeError: NCCL Error: unhandled cuda error

When I tried to use bmt.init_distributed(seed=0), I met the following problem.

Traceback (most recent call last):

  File "train_inner.py", line 19, in main
    bmt.init_distributed(seed=0)
  
File "/python3.9/site-packages/bmtrain/init.py", line 88, in init_distributed
    config['comm'] = nccl.commInitRank(unique_id, world_size, rank)
  
File "//python3.9/site-packages/bmtrain/nccl/__init__.py", line 77, in commInitRank
    return NCCLCommunicator(C.ncclCommInitRank(unique_id, world_size, rank))

Any idea why would this occur? Or how may I solve this problem? Thanks.

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost train.py
This is how I run the program.

ERROR: Command errored out with exit status 1:

ERROR: Command errored out with exit status 1: 'C:\Users\46213\anaconda3\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\46213\AppData\Local\Temp\pip-install-0s5mkooc\bmtrain_23dc03e3d7e841b88ef095cfb3ede34b\setup.py'"'"'; file='"'"'C:\Users\46213\AppData\Local\Temp\pip-install-0s5mkooc\bmtrain_23dc03e3d7e841b88ef095cfb3ede34b\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\46213\AppData\Local\Temp\pip-record-z5irnjba\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\46213\anaconda3\Include\bmtrain' Check the logs for full command output.

[问题]优化器状态加载

多机并行训练（zero3）保存ckpt时会保存每个cpu对应的optim状态到 file.${bmt.rank()}。
加载优化器继续训练时遇到shape不对的问题，想问下什么原因？
重新启动时master节点不变，各节点的node rank也没有改变。

请问单机单卡运行，需要修改什么地方呢？

[Install Error] CUDA 12.1 mismatch Pytorch

执行命令：
python setup.py install

报错信息：
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.8). Please make sure to use the same CUDA versions.

疑问：
目前pytorch的话还没CUDA12.1 的版本，这样的报错有啥办法可以解决吗？

Output hidden states and attention scores for each transformer layer

Existing TransformerBlockList cannot output the hidden states and attention scores for each transformer layer. Sometimes we want to get the hiddens and attention scores to conduct analysis and feed them into the next modules.

怎么在使用bmtrain训练的时候读取已经训练好的增量微调的权重？？

我尝试了多种方式：
1 model.load_state_dict(torch.load(args.model_path), strict=False)
2 bmt.load(model, args.LoRA_path,strict=False)
但是，打印模型参数后发现并没有被读取进去。为什么发生这样的情况。

load model

bmt.init_distributed(seed=0)
config = CPMAntConfig.from_json_file(args.config_path)
model = CPMAntPlus(config=config)
bmt.load(model, args.model_path,strict=True)

insert LoRA

#delta_model = AutoDeltaModel.from_finetuned(args.LoRA_path, backbone_model=model)
delta_model = LoraModel(backbone_model=model, modified_modules=["project_q", "project_v"], backend="bmt")
delta_model.freeze_module(exclude=["deltas"], set_state_dict=True)

training script is stuck in bmt.init_distributed(seed=0)

GPU：V100
torch version：1.10.0+cu113
Python 3.7.13
bmtrain：0.2.1

cuda extention添加的算子不能用bmtrain？

我用cuda extention 的方式添加了一个op，用bmtrain框架跑会报OOM，应该是ZeRO没有起效，请问这个问题怎么解决？

Segment fault when the code ends

When using bmtrain to train my model, the code successfullt run to the last line but give me a segment fault.

[BUG] KeyError: 'LOCAL_RANK'

''''''
File "train.py", line 15, in main
bmt.init_distributed(
File "lib/python3.9/site-packages/bmtrain/init.py", line 40, in init_distributed
local_rank = int(os.environ["LOCAL_RANK"])
File "lib/python3.9/os.py", line 679, in getitem
raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
''''''
An Error occured when calling bmt.init_distributed function in train.py,
After I check 'os.environ.keys()', couldn't find 'LOCAL_RANK'.
It seems that 'bmtrain' wasn't successfully installed

Does BMTrain 0.2.0 support cuda 11.1?

In the environment of pytorch 1.12.1, cuda 11.1, and python 3.8.0, I failed to install BMTrain 0.2.0 using "pip setup.py install". Prompt that cuda needs to meet version 11.3. Does BMTrain 0.2.0 support cuda 11.1?

[Bug] Tensor with a size of 1 being splitted incorrectly when broadcasting

Say, I have a tensor z with a size of [1], and a tensor x with a size of [batch_size, intermediate_dim, model_dim].

When calculating z*x, z should be broadcasted to the same size as x.

However, in the current implementation of BMTrain, if we have 4 GPUs, as far as I can see, z would be firstly split into tensors z0 to z3 with sizes of [1], [0], [0], [0]; and x would be also split into 4 tensors x0 to x3. Then things like zi*xi would be calculated. However, z1*x1 fails because the tensor z1 with a size of [0] does not match the tensor x1 in size.

Code causing the problem would be attached later.

感觉英语有点工地，这是中文版本：

假如我有张量 z，大小是 [1]，还有张量 x，大小是 [batch_size, intermediate_dim, model_dim]。

算 z*x 的时候，z 应该被广播到 x 一样的大小。

但是在 BMTrain 现有的实现下，假设有 4 个 GPU，依我所见，z 会被切成四个张量 z0 到 z3，大小分别是 [1], [0], [0], [0]; x 也会被切成四个张量 x0 到 x3。然后会算 zi*xi 之类的东西。但是 z1*x1 计算会失败，因为张量 z1 大小是 [0]，与张量 x1 大小不符。

产生问题的代码稍后附上。

Is there a docker that i can use bmtrain directly?

The installation fails with "RuntimeError: Error compiling objects for extension".

Thank you!

Install error

ValueError: Unknown CUDA arch (9.0+PTX) or GPU not supported

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0

[Bug]TypeError: gather() received an invalid combination of arguments

Hi developer, when I tried to use 'gather()' method from the 'Distributedparameter', I received the following error:

, line 43, in __init__
    self.rel_embed.weight /= torch.norm(self.rel_embed.weight.gather().detach(), p=self.p_norm, dim=-1)[:, None]
TypeError: gather() received an invalid combination of arguments - got (), but expected one of:
 * (int dim, Tensor index, *, bool sparse_grad)
 * (name dim, Tensor index, *, bool sparse_grad)

I coundn't find any information about this set of arguments required, any idea why this may occur or how may I solve this issue?
Thanks a lot.

can't run the example of BMTrain's implementation of GPT-2.

As mentioned, try to run the example provided in example folder by using run.sh script. But throw me this

WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 37831) of binary: /usr/local/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-03-21_01:00:08
host : 2b0ea7cdb636
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 37832)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 37832
[2]:
time : 2023-03-21_01:00:08
host : 2b0ea7cdb636
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 37833)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 37833
[3]:
time : 2023-03-21_01:00:08
host : 2b0ea7cdb636
rank : 3 (local_rank: 3)
exitcode : -7 (pid: 37834)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 37834

Root Cause (first observed failure):
[0]:
time : 2023-03-21_01:00:08
host : 2b0ea7cdb636
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 37831)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 37831

I am using a server with
8*3090
630GB cpu RAM
python 3.8
cuda 11.8

Failed to install BMTrain: ~/has_inf_nan.cu(11): error: identifier "__heq" is undefined

When I run the following command to install BMTrain
python setup.py install
I met the error of


running install
/home/yhlin/torch_env/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
 warnings.warn(
/home/yhlin/torch_env/lib/python3.8/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
 warnings.warn(
running bdist_egg
running egg_info
writing bmtrain.egg-info/PKG-INFO
writing dependency_links to bmtrain.egg-info/dependency_links.txt
writing requirements to bmtrain.egg-info/requires.txt
writing top-level names to bmtrain.egg-info/top_level.txt
reading manifest file 'bmtrain.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'bmtrain.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/lib.linux-x86_64-cpython-38
creating build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/layer.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/block_layer.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/synchronize.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/parameter.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/param_init.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/global_var.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/wrapper.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/utils.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/store.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/pipe_layer.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/init.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/debug.py -> build/lib.linux-x86_64-cpython-38/bmtrain
copying bmtrain/checkpointing.py -> build/lib.linux-x86_64-cpython-38/bmtrain
creating build/lib.linux-x86_64-cpython-38/bmtrain/loss
copying bmtrain/loss/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/loss
copying bmtrain/loss/cross_entropy.py -> build/lib.linux-x86_64-cpython-38/bmtrain/loss
creating build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/all_gather.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/reduce_scatter.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/send_recv.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/shape.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/utils.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
copying bmtrain/benchmark/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/benchmark
creating build/lib.linux-x86_64-cpython-38/bmtrain/distributed
copying bmtrain/distributed/ops.py -> build/lib.linux-x86_64-cpython-38/bmtrain/distributed
copying bmtrain/distributed/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/distributed
creating build/lib.linux-x86_64-cpython-38/bmtrain/optim
copying bmtrain/optim/adam.py -> build/lib.linux-x86_64-cpython-38/bmtrain/optim
copying bmtrain/optim/adam_offload.py -> build/lib.linux-x86_64-cpython-38/bmtrain/optim
copying bmtrain/optim/optim_manager.py -> build/lib.linux-x86_64-cpython-38/bmtrain/optim
copying bmtrain/optim/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/optim
creating build/lib.linux-x86_64-cpython-38/bmtrain/nccl
copying bmtrain/nccl/enums.py -> build/lib.linux-x86_64-cpython-38/bmtrain/nccl
copying bmtrain/nccl/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/nccl
creating build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/no_decay.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/linear.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/noam.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/exponential.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/warmup.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
copying bmtrain/lr_scheduler/cosine.py -> build/lib.linux-x86_64-cpython-38/bmtrain/lr_scheduler
creating build/lib.linux-x86_64-cpython-38/bmtrain/inspect
copying bmtrain/inspect/model.py -> build/lib.linux-x86_64-cpython-38/bmtrain/inspect
copying bmtrain/inspect/format.py -> build/lib.linux-x86_64-cpython-38/bmtrain/inspect
copying bmtrain/inspect/__init__.py -> build/lib.linux-x86_64-cpython-38/bmtrain/inspect
copying bmtrain/inspect/tensor.py -> build/lib.linux-x86_64-cpython-38/bmtrain/inspect
running build_ext
building 'bmtrain.nccl._C' extension
creating /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38
creating /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc
Emitting ninja build file /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] c++ -MMD -MF /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/nccl.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -Icsrc/nccl/build/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/TH -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/yhlin/torch_env/include -I/opt/conda/include/python3.8 -c -c /home/yhlin/BMTrain/csrc/nccl.cpp -o /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/nccl.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
g++ -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -pthread -shared -B /opt/conda/compiler_compat -L/opt/conda/lib -Wl,-rpath=/opt/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/nccl.o -L/home/yhlin/torch_env/lib/python3.8/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-cpython-38/bmtrain/nccl/_C.cpython-38-x86_64-linux-gnu.so
building 'bmtrain.optim._cuda' extension
creating /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/cuda
Emitting ninja build file /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/adam_cuda.o.d -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/TH -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/yhlin/torch_env/include -I/opt/conda/include/python3.8 -c -c /home/yhlin/BMTrain/csrc/adam_cuda.cpp -o /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/adam_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[2/3] /usr/local/cuda/bin/nvcc  -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/TH -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/yhlin/torch_env/include -I/opt/conda/include/python3.8 -c -c /home/yhlin/BMTrain/csrc/cuda/has_inf_nan.cu -o /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/cuda/has_inf_nan.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14
FAILED: /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/cuda/has_inf_nan.o
/usr/local/cuda/bin/nvcc  -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/TH -I/home/yhlin/torch_env/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/yhlin/torch_env/include -I/opt/conda/include/python3.8 -c -c /home/yhlin/BMTrain/csrc/cuda/has_inf_nan.cu -o /home/yhlin/BMTrain/build/temp.linux-x86_64-cpython-38/csrc/cuda/has_inf_nan.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14
/home/yhlin/BMTrain/csrc/cuda/has_inf_nan.cu(11): error: identifier "__heq" is undefined

1 error detected in the compilation of "/home/yhlin/BMTrain/csrc/cuda/has_inf_nan.cu".

I went into the code of has_inf_nan.cu, and found there was no anyother place to define "__heq". Can you help to solve it? Thanks!

Comparision with Google Jax & Nvidia Apex

Could BMTrain use together with the tools like Jax or Apex, or any comparisions or experiments plan with these tools? Thanks

Can BMTrain be directly used for huggingface transformer models?

windows下无法正常编译安装

pip install bmtrain
Collecting bmtrain
Using cached bmtrain-0.1.8.post1.tar.gz (48 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy in c:\users\chenliyu\anaconda3\envs\cpm\lib\site-packages (from bmtrain) (1.21.6)
Building wheels for collected packages: bmtrain
Building wheel for bmtrain (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully.
│ exit code: 1
╰─> [55 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-37
creating build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\backward.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\block_layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\checkpointing.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\debug.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\global_var.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\parameter.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\param_init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\store.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\synchronize.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\utils.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\wrapper.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain_init_.py -> build\lib.win-amd64-cpython-37\bmtrain
creating build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\all_gather.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\reduce_scatter.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\shape.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\utils.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
creating build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed\ops.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
creating build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\format.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\model.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\tensor.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
creating build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss\cross_entropy.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
creating build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\cosine.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\exponential.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\linear.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\noam.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\no_decay.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\warmup.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
creating build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl\enums.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
creating build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam_offload.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\clip_grad.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
running build_ext
error: [WinError 2] 系统找不到指定的文件。
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bmtrain
Running setup.py clean for bmtrain
Failed to build bmtrain
Installing collected packages: bmtrain
Running setup.py install for bmtrain ... error
error: subprocess-exited-with-error

× Running setup.py install for bmtrain did not run successfully.
│ exit code: 1
╰─> [57 lines of output]
running install
C:\Users\chenliyu\anaconda3\envs\cpm\lib\site-packages\setuptools\command\install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-37
creating build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\backward.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\block_layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\checkpointing.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\debug.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\global_var.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\layer.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\parameter.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\param_init.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\store.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\synchronize.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\utils.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain\wrapper.py -> build\lib.win-amd64-cpython-37\bmtrain
copying bmtrain_init_.py -> build\lib.win-amd64-cpython-37\bmtrain
creating build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\all_gather.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\reduce_scatter.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\shape.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark\utils.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
copying bmtrain\benchmark_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\benchmark
creating build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed\ops.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
copying bmtrain\distributed_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\distributed
creating build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\format.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\model.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect\tensor.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
copying bmtrain\inspect_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\inspect
creating build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss\cross_entropy.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
copying bmtrain\loss_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\loss
creating build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\cosine.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\exponential.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\linear.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\noam.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\no_decay.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler\warmup.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
copying bmtrain\lr_scheduler_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\lr_scheduler
creating build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl\enums.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
copying bmtrain\nccl_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\nccl
creating build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\adam_offload.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim\clip_grad.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
copying bmtrain\optim_init_.py -> build\lib.win-amd64-cpython-37\bmtrain\optim
running build_ext
error: [WinError 2] 系统找不到指定的文件。

已经将cl加入环境变量，python版本为3.7，已更新vc++build tool

BMTrain安装失败？FAILED: /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/adam.o

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/adam_cuda.o.d -pthread -B /data/home/youzan/xiaoyang/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/TH -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/data/home/youzan/xiaoyang/py37env_tf2/include -I/data/home/youzan/xiaoyang/anaconda3/include/python3.7m -c -c /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/csrc/adam_cuda.cpp -o /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/adam_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
[2/3] /usr/local/cuda/bin/nvcc -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/TH -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/data/home/youzan/xiaoyang/py37env_tf2/include -I/data/home/youzan/xiaoyang/anaconda3/include/python3.7m -c -c /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/csrc/cuda/has_inf_nan.cu -o /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/has_inf_nan.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++14
FAILED: /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/has_inf_nan.o
/usr/local/cuda/bin/nvcc -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/TH -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/data/home/youzan/xiaoyang/py37env_tf2/include -I/data/home/youzan/xiaoyang/anaconda3/include/python3.7m -c -c /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/csrc/cuda/has_inf_nan.cu -o /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/has_inf_nan.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++14
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
__p->_M_set_sharable();
~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; Alloc = std::allocator<char32_t>; std::basic_string<CharT, Traits, Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<CharT, Traits, Alloc>::Rep::M_set_sharable() [with CharT = char32_t; Traits = std::char_traits<char32_t>; Alloc = std::allocator<char32_t>]’ without object
[3/3] /usr/local/cuda/bin/nvcc -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/TH -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/data/home/youzan/xiaoyang/py37env_tf2/include -I/data/home/youzan/xiaoyang/anaconda3/include/python3.7m -c -c /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/csrc/cuda/adam.cu -o /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/adam.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++14
FAILED: /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/adam.o
/usr/local/cuda/bin/nvcc -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/TH -I/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/include/THC -I/usr/local/cuda/include -I/data/home/youzan/xiaoyang/py37env_tf2/include -I/data/home/youzan/xiaoyang/anaconda3/include/python3.7m -c -c /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/csrc/cuda/adam.cu -o /tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/build/temp.linux-x86_64-cpython-37/csrc/cuda/adam.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_61,code=sm_61 -std=c++14
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char16_t*; _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6688:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char16_t; _Traits = std::char_traits<char16_t>; _Alloc = std::allocator<char16_t>]’ without object
__p->_M_set_sharable();
~~~~~~~~~^~
/usr/include/c++/7/bits/basic_string.tcc: In instantiation of ‘static std::basic_string<_CharT, _Traits, _Alloc>::_Rep* std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_S_create(std::basic_string<_CharT, _Traits, _Alloc>::size_type, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’:
/usr/include/c++/7/bits/basic_string.tcc:578:28: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&, std::forward_iterator_tag) [with _FwdIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5042:20: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct_aux(_InIterator, _InIterator, const _Alloc&, std::__false_type) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.h:5063:24: required from ‘static _CharT* std::basic_string<_CharT, _Traits, _Alloc>::_S_construct(_InIterator, _InIterator, const _Alloc&) [with _InIterator = const char32_t*; _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’
/usr/include/c++/7/bits/basic_string.tcc:656:134: required from ‘std::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, std::basic_string<_CharT, _Traits, _Alloc>::size_type, const _Alloc&) [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>; std::basic_string<_CharT, _Traits, _Alloc>::size_type = long unsigned int]’
/usr/include/c++/7/bits/basic_string.h:6693:95: required from here
/usr/include/c++/7/bits/basic_string.tcc:1067:16: error: cannot call member function ‘void std::basic_string<_CharT, _Traits, _Alloc>::_Rep::_M_set_sharable() [with _CharT = char32_t; _Traits = std::char_traits<char32_t>; _Alloc = std::allocator<char32_t>]’ without object
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build
env=env)
File "/data/home/youzan/xiaoyang/anaconda3/lib/python3.7/subprocess.py", line 468, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

  The above exception was the direct cause of the following exception:

  Traceback (most recent call last):
    File "<string>", line 36, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-c6m09ftc/bmtrain_85cbac71a04c4746abbb40699f06db93/setup.py", line 86, in <module>
      'build_ext': BuildExtension
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 148, in setup
      return run_commands(dist)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 163, in run_commands
      dist.run_commands()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 967, in run_commands
      self.run_command(cmd)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/dist.py", line 1224, in run_command
      super().run_command(command)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run
      self.run_command('build')
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/dist.py", line 1224, in run_command
      super().run_command(command)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/command/build.py", line 136, in run
      self.run_command(cmd_name)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/dist.py", line 1224, in run_command
      super().run_command(command)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 986, in run_command
      cmd_obj.run()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
      _build_ext.run(self)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 735, in build_extensions
      build_ext.build_extensions(self)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 448, in build_extensions
      self._build_extensions_serial()
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 473, in _build_extensions_serial
      self.build_extension(ext)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
      _build_ext.build_extension(self, ext)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/_distutils/command/build_ext.py", line 534, in build_extension
      depends=ext.depends)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile
      with_cuda=with_cuda)
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects
      error_prefix='Error compiling objects for extension')
    File "/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build
      raise RuntimeError(message) from e
  RuntimeError: Error compiling objects for extension
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for bmtrain
Running setup.py clean for bmtrain
Successfully built model-center
Failed to build bmtrain
Installing collected packages: bmtrain, model-center
Running setup.py install for bmtrain ... error
error: subprocess-exited-with-error

× Running setup.py install for bmtrain did not run successfully.
│ exit code: 1
╰─> [183 lines of output]
running install
/data/home/youzan/xiaoyang/py37env_tf2/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.

[Feature] Supporting the `maximize` parameter of Adam optimizer when `dtype` is `torch.half`

Procedure: simply reversing p.grad when ('maximize' in group) and (group['maximize'] is True), making the code match the description in https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam better.

[问题]bf16 & pipeline parallel

Bmtrain 提到支持 bf16 和 pipeline parallel。
请问有没有使用例子， pipeline parallel 和 zero 可以同时使用吗，谢谢

bf16 optimizer

请问 adam和adam_offload 有计划支持 bf16 么？谢谢

Can BMTrain work with Megatron-LM?

We want to try a large LM model (>30B).
Are there any examples to do that?

support torch 1.12.0

can not load model using BMTrain using Torch==1.12.0

  File "/home/ubuntu/anaconda3/envs/xx/lib/python3.9/site-packages/model_center/model/basemodel.py", line 33, in from_pretrained
    bmt.load(model, os.path.join(path, 'pytorch_model.pt'), strict=False)
  File "/home/ubuntu/anaconda3/envs/xx/lib/python3.9/site-packages/bmtrain/store.py", line 197, in load
    ret = model.load_state_dict(
  File "/home/ubuntu/anaconda3/envs/xx/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1559, in load_state_dict
    raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
TypeError: Expected state_dict to be dict-like, got <class 'bmtrain.store.DistributedStateDictWrapper'>.

this is because the code in torch: torch/nn/modules/module.py add type restriction.

    def load_state_dict(self, state_dict: Mapping[str, Any],
                        strict: bool = True):
        r"""Copies parameters and buffers from :attr:`state_dict` into
        this module and its descendants. If :attr:`strict` is ``True``, then
        the keys of :attr:`state_dict` must exactly match the keys returned
        by this module's :meth:`~torch.nn.Module.state_dict` function.

        Args:
            state_dict (dict): a dict containing parameters and
                persistent buffers.
            strict (bool, optional): whether to strictly enforce that the keys
                in :attr:`state_dict` match the keys returned by this module's
                :meth:`~torch.nn.Module.state_dict` function. Default: ``True``

        Returns:
            ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:
                * **missing_keys** is a list of str containing the missing keys
                * **unexpected_keys** is a list of str containing the unexpected keys

        Note:
            If a parameter or buffer is registered as ``None`` and its corresponding key
            exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a
            ``RuntimeError``.
        """
        if not isinstance(state_dict, Mapping):
            raise TypeError("Expected state_dict to be dict-like, got {}.".format(type(state_dict)))
        #....

Some save and load problems when incorporated with BMCook

Hi, I found there might be some problems between bmt.save() and bmt.load().
Following the examples of BMCook, I loaded a gpt2-base model and trained it for several epoches. Notice that all operations of BMCook had been disabled in --cook-config. After training I invoked the bmt.save() method to save the checkpoint. However, this checkpoint seems to be mismatched with an initialized model parameters:

Traceback (most recent call last):
  File "eval.py", line 207, in <module>
 main()
  File "eval.py", line 202, in main
    bmt.load(model, args.load_path)
  File "/opt/conda/lib/python3.8/site-packages/bmtrain/store.py", line 202, in load
ret = model.load_state_dict()
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict^M
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format()
RuntimeError: Error(s) in loading state_dict for GPT2:
        While copying the parameter named "encoder.layers.0.self_att.self_attention.project_q.weight", whose dimensions in the model are torch.Size([768, 768]) and whose dimensions in the checkpoint are torch.Size([768, 768]), an exception occurred : ('The size of tensor a (768) must match the size of tensor b (589824) at non-singleton dimension 1',).
        While copying the parameter named "encoder.layers.0.self_att.self_attention.project_k.weight", whose dimensions in the model are torch.Size([768, 768]) and whose dimensions in the checkpoint are torch.Size([768, 768]), an exception occurred : ('The size of tensor a (768) must match the size of tensor b (589824) at non-singleton dimension 1',).
        While copying the parameter named "encoder.layers.0.self_att.self_attention.project_v.weight", whose dimensions in the model are torch.Size([768, 768]) and whose dimensions in the checkpoint are torch.Size([768, 768]), an exception occurred : ('The size of tensor a (768) must match the size of tensor b (589824) at non-singleton dimension 1',).
        While copying the parameter named "encoder.layers.0.self_att.self_attention.attention_out.weight", whose dimensions in the model are torch.Size([768, 768]) and whose dimensions in the checkpoint are torch.Size([768, 768]), an exception occurred : ('The size of tensor a (768) must match the size of tensor b (589824) at non-singleton dimension 1',).
        While copying the parameter named "encoder.layers.0.ffn.ffn.w_in.w.weight", whose dimensions in the model are torch.Size([3072, 768]) and whose dimensions in the checkpoint are torch.Size([3072, 768]), an exception occurred : ('The size of tensor a (768) must match the size of tensor b (2359296) at non-singleton dimension 1',).
...
        While copying the parameter named "encoder.layers.11.ffn.ffn.w_out.weight", whose dimensions in the model are torch.Size([768, 3072]) and whose dimensions in the checkpoint are torch.Size([768, 3072]), an exception occurred : ('The size of tensor a (3072) must match the size of tensor b (2359296) at non-singleton dimension 1',).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 253443) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError()
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
eval.py FAILED

It seems that the bmt.load just could not align the saved parameters and the flatten parameters of an initialized model. I'm not sure this is caused by BMCook or BMTrain. All hyper-parameters have been aligned, including preprocess of BMCook, this is the main part of my code, which is the same as the given example of BMCook except for the final bmt.load():

bmt.init_distributed()
args = parse_args()
save_dir = Path(args.save_dir)
ckpt_dir = save_dir / 'checkpoints'
os.makedirs(ckpt_dir, exist_ok=True)
json.dump(vars(args), open(save_dir / 'train_args.json', 'w'), indent=2)

model_config = config_map[args.model].from_pretrained(args.model)
model = model_map[args.model].from_pretrained(args.model, config=model_config)
# teacher model has the same config as the student model
teacher = model_map[args.model].from_pretrained(args.model, config=model_config)

def new_forward(model_self, enc_input, enc_length, dec_input, dec_length, return_logits=False):
    return model_self.forward_old(dec_input, dec_length, output_logits=return_logits)

model.forward_old = model.forward
model.forward = types.MethodType(new_forward, model)
teacher.forward_old = teacher.forward
teacher.forward = types.MethodType(new_forward, teacher)

bmt.synchronize()

# data
batch_size = 8
dec_len = 512

loss_func = torch.nn.CrossEntropyLoss(ignore_index=-100)
optimizer = bmt.optim.AdamOptimizer(model.parameters(), scale=2**20)
lr_scheduler = bmt.lr_scheduler.Noam(optimizer, start_lr=args.start_lr, warmup_iter=2000, end_iter=100000)

# bmcook config
from bmcook.utils.config import ConfigParser
config = ConfigParser(args.cook_config)

# remove checkpointing
for _, v in model.named_modules():

    if isinstance(v, bmt.TransformerBlockList):

        def new_func(list_self, hidden_states, *args):
            for i in range(len(list_self._modules)):
                hidden_states = list_self._modules[str(i)](hidden_states, *args)
            return hidden_states

        v.forward = types.MethodType(new_func, v)

        for k in v._modules.keys():
            state_dict = v._modules[k].state_dict()
            for kk, vv in v._modules[k]._module.named_modules():
                if kk+'.weight' in state_dict:
                    vv.weight.data = state_dict[kk+'.weight'].clone().cuda()
                if kk+'.bias' in state_dict:
                    vv.bias.data = state_dict[kk+'.bias'].clone().cuda()
            v._modules[k] = v._modules[k]._module

# for distillation
Trainer.forward = BMDistill.set_forward(model, teacher, Trainer.forward, config)

# for pruning
BMPrune.compute_mask(model, config)
BMPrune.set_optim_for_pruning(optimizer)

# for quantization
BMQuant.quantize(model, config)

# for moefication
Trainer.forward = BMMoE.get_hidden(model, config, Trainer.forward)

bmt.synchronize()

average_time = 0
average_time_shift = 0.9

dataset = Dataset(
    MMapIndexedDataset(args.data_path),
    dec_len
)

if config.get('MoEfication')['is_moefy']:
    os.makedirs(save_dir / 'hiddens', exist_ok=True)
    model.eval()

    for iteration, data in enumerate(Trainer.batch_iter(dataset, batch_size, bmt.rank(), bmt.world_size())):

        if iteration == 100:
            break

        dec_input = data["ctx"].int()
        dec_length = data["len_ctx"].int()
        dec_mask = torch.arange(dec_len)[None, :].repeat(batch_size, 1) < dec_length[:, None]
        targets = torch.where(dec_mask, data["target"].long(), torch.scalar_tensor(-100, dtype=torch.long))

        targets = targets.cuda()
        dec_input = dec_input.cuda()
        dec_length = dec_length.cuda()
        
        with torch.no_grad():
            outputs = Trainer.forward(model, None, None, dec_input, dec_length, targets, loss_func)
        
        torch.save(outputs[-1], save_dir / 'hiddens' / '{}_{}'.format(iteration, bmt.rank()))
           
        bmt.print_rank("Iteration:", iteration)
    exit()

do_distill = True
distill_config = config.get('distillation')
if distill_config['ce_scale'] + distill_config['mse_hidn_scale'] + distill_config['mse_att_scale'] == 0:
    do_distill = False

bmt.load(model, args.load_path)
# model.train()
teacher.eval()
model.eval()