Coder Social home page Coder Social logo

microsoft / deepspeed Goto Github PK

View Code? Open in Web Editor NEW
32.6K 327.0 3.8K 207.47 MB

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Home Page: https://www.deepspeed.ai/

License: Apache License 2.0

Python 68.45% Shell 0.37% C++ 20.60% Cuda 10.09% Dockerfile 0.12% C 0.37% Batchfile 0.01%
deep-learning pytorch gpu machine-learning billion-parameters data-parallelism model-parallelism inference pipeline-parallelism compression

deepspeed's Introduction

License Apache 2.0 PyPI version Downloads Build Twitter Japanese Twitter Chinese Zhihu

Latest News

DeepSpeed empowers ChatGPT-like model training with a single click, offering 15x speedup over SOTA RLHF systems with unprecedented cost reduction at all scales; learn how.

More news

Extreme Speed and Scale for DL Training and Inference

DeepSpeed enables world's most powerful language models like MT-530B and BLOOM. It is an easy-to-use deep learning optimization software suite that powers unprecedented scale and speed for both training and inference. With DeepSpeed you can:

  • Train/Inference dense or sparse models with billions or trillions of parameters
  • Achieve excellent system throughput and efficiently scale to thousands of GPUs
  • Train/Inference on resource constrained GPU systems
  • Achieve unprecedented low latency and high throughput for inference
  • Achieve extreme compression for an unparalleled inference latency and model size reduction with low costs

DeepSpeed's four innovation pillars

DeepSpeed-Training

DeepSpeed offers a confluence of system innovations, that has made large scale DL training effective, and efficient, greatly improved ease of use, and redefined the DL training landscape in terms of scale that is possible. These innovations such as ZeRO, 3D-Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc. fall under the training pillar. Learn more: DeepSpeed-Training

DeepSpeed-Inference

DeepSpeed brings together innovations in parallelism technology such as tensor, pipeline, expert and ZeRO-parallelism, and combines them with high performance custom inference kernels, communication optimizations and heterogeneous memory technologies to enable inference at an unprecedented scale, while achieving unparalleled latency, throughput and cost reduction. This systematic composition of system technologies for inference falls under the inference pillar. Learn more: DeepSpeed-Inference

DeepSpeed-Compression

To further increase the inference efficiency, DeepSpeed offers easy-to-use and flexible-to-compose compression techniques for researchers and practitioners to compress their models while delivering faster speed, smaller model size, and significantly reduced compression cost. Moreover, SoTA innovations on compression like ZeroQuant and XTC are included under the compression pillar. Learn more: DeepSpeed-Compression

DeepSpeed4Science

In line with Microsoft's mission to solve humanity's most pressing challenges, the DeepSpeed team at Microsoft is responding to this opportunity by launching a new initiative called DeepSpeed4Science, aiming to build unique capabilities through AI system technology innovations to help domain experts to unlock today's biggest science mysteries. Learn more: DeepSpeed4Science website and tutorials


DeepSpeed Software Suite

DeepSpeed Library

The DeepSpeed library (this repository) implements and packages the innovations and technologies in DeepSpeed Training, Inference and Compression Pillars into a single easy-to-use, open-sourced repository. It allows for easy composition of multitude of features within a single training, inference or compression pipeline. The DeepSpeed Library is heavily adopted by the DL community, and has been used to enable some of the most powerful models (see DeepSpeed Adoption).

Model Implementations for Inference (MII)

Model Implementations for Inference (MII) is an open-sourced repository for making low-latency and high-throughput inference accessible to all data scientists by alleviating the need to apply complex system optimization techniques themselves. Out-of-box, MII offers support for thousands of widely used DL models, optimized using DeepSpeed-Inference, that can be deployed with a few lines of code, while achieving significant latency reduction compared to their vanilla open-sourced versions.

DeepSpeed on Azure

DeepSpeed users are diverse and have access to different environments. We recommend to try DeepSpeed on Azure as it is the simplest and easiest method. The recommended method to try DeepSpeed on Azure is through AzureML recipes. The job submission and data preparation scripts have been made available here. For more details on how to use DeepSpeed on Azure, please follow the Azure tutorial.


DeepSpeed Adoption

DeepSpeed is an important part of Microsoft’s new AI at Scale initiative to enable next-generation AI capabilities at scale, where you can find more information here.

DeepSpeed has been used to train many different large-scale models, below is a list of several examples that we are aware of (if you'd like to include your model please submit a PR):

DeepSpeed has been integrated with several different popular open-source DL frameworks such as:

Documentation
Transformers with DeepSpeed
Accelerate with DeepSpeed
Lightning with DeepSpeed
MosaicML with DeepSpeed
Determined with DeepSpeed
MMEngine with DeepSpeed

Build Pipeline Status

Description Status
NVIDIA nv-torch110-p40 nv-torch110-v100 nv-torch-latest-v100 nv-h100 nv-inference nv-nightly
AMD amd-mi200
CPU torch-latest-cpu cpu-inference
Intel Gaudi hpu-gaudi2
Intel XPU xpu-max1100
PyTorch Nightly nv-torch-nightly-v100
Integrations nv-transformers-v100 nv-lightning-v100 nv-accelerate-v100 nv-mii nv-ds-chat nv-sd
Misc Formatting pages-build-deployment Documentation Statuspython

Installation

The quickest way to get started with DeepSpeed is via pip, this will install the latest release of DeepSpeed which is not tied to specific PyTorch or CUDA versions. DeepSpeed includes several C++/CUDA extensions that we commonly refer to as our 'ops'. By default, all of these extensions/ops will be built just-in-time (JIT) using torch's JIT C++ extension loader that relies on ninja to build and dynamically link them at runtime.

Requirements

  • PyTorch must be installed before installing DeepSpeed.
  • For full feature support we recommend a version of PyTorch that is >= 1.9 and ideally the latest PyTorch stable release.
  • A CUDA or ROCm compiler such as nvcc or hipcc used to compile C++/CUDA/HIP extensions.
  • Specific GPUs we develop and test against are listed below, this doesn't mean your GPU will not work if it doesn't fall into this category it's just DeepSpeed is most well tested on the following:
    • NVIDIA: Pascal, Volta, Ampere, and Hopper architectures
    • AMD: MI100 and MI200

Contributed HW support

  • DeepSpeed now support various HW accelerators.
Contributor Hardware Accelerator Name Contributor validated Upstream validated
Huawei Huawei Ascend NPU npu Yes No
Intel Intel(R) Gaudi(R) 2 AI accelerator hpu Yes Yes
Intel Intel(R) Xeon(R) Processors cpu Yes Yes
Intel Intel(R) Data Center GPU Max series xpu Yes Yes

PyPI

We regularly push releases to PyPI and encourage users to install from there in most cases.

pip install deepspeed

After installation, you can validate your install and see which extensions/ops your machine is compatible with via the DeepSpeed environment report.

ds_report

If you would like to pre-install any of the DeepSpeed extensions/ops (instead of JIT compiling) or install pre-compiled ops via PyPI please see our advanced installation instructions.

Windows

Windows support is partially supported with DeepSpeed. On Windows you can build wheel with following steps, currently only inference mode is supported.

  1. Install pytorch, such as pytorch 1.8 + cuda 11.1
  2. Install visual cpp build tools, such as VS2019 C++ x64/x86 build tools
  3. Launch cmd console with Administrator privilege for creating required symlink folders
  4. Run python setup.py bdist_wheel to build wheel in dist folder

Features

Please checkout DeepSpeed-Training, DeepSpeed-Inference and DeepSpeed-Compression pages for full set of features offered along each of these three pillars.

Further Reading

All DeepSpeed documentation, tutorials, and blogs can be found on our website: deepspeed.ai

Description
Getting Started First steps with DeepSpeed
DeepSpeed JSON Configuration Configuring DeepSpeed
API Documentation Generated DeepSpeed API documentation
Tutorials Tutorials
Blogs Blogs

Contributing

DeepSpeed welcomes your contributions! Please see our contributing guide for more details on formatting, testing, etc.
Thanks so much to all of our amazing contributors!

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Publications

  1. Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He. (2019) ZeRO: memory optimizations toward training trillion parameter models. arXiv:1910.02054 and In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20).

  2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial).

  3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. arXiv:2010.13369 and NeurIPS 2020.

  4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv:2101.06840 and USENIX ATC 2021. [paper] [slides] [blog]

  5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. arXiv:2102.02888 and ICML 2021.

  6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv:2104.07857 and SC 2021. [paper] [slides] [blog]

  7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. arXiv:2104.06069 and HiPC 2022.

  8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. arXiv:2108.06084 and NeurIPS 2022.

  9. Yucheng Lu, Conglong Li, Minjia Zhang, Christopher De Sa, Yuxiong He. (2022) Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam. arXiv:2202.06009.

  10. Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He. (2022) DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale arXiv:2201.05596 and ICML 2022. [pdf] [slides] [blog]

  11. Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro. (2022) Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model arXiv:2201.11990.

  12. Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He. (2022) Extreme Compression for Pre-trained Transformers Made Simple and Efficient. arXiv:2206.01859 and NeurIPS 2022.

  13. Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, Yuxiong He. (2022) ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. arXiv:2206.01861 and NeurIPS 2022 [slides] [blog]

  14. Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Yuxiong He. (2022) DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale. arXiv:2207.00032 and SC 2022. [paper] [slides] [blog]

  15. Zhewei Yao, Xiaoxia Wu, Conglong Li, Connor Holmes, Minjia Zhang, Cheng Li, Yuxiong He. (2022) Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers. arXiv:2211.11586.

  16. Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Yuxiong He. (2022) DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. arXiv:2212.03597 ENLSP2023 Workshop at NeurIPS2023

  17. Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He. (2023) Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv:2301.12017 and ICML2023.

  18. Syed Zawad, Cheng Li, Zhewei Yao, Elton Zheng, Yuxiong He, Feng Yan. (2023) DySR: Adaptive Super-Resolution via Algorithm and System Co-design. ICLR:2023.

  19. Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, Yuxiong He. (2023) Scaling Vision-Language Models with Sparse Mixture of Experts. arXiv:2303.07226 and Finding at EMNLP2023.

  20. Quentin Anthony, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He, Aamir Shafi, Mustafa Abduljabbar, Hari Subramoni, Dhabaleswar Panda. (2023) MCR-DL: Mix-and-Match Communication Runtime for Deep Learning arXiv:2303.08374 and will appear at IPDPS 2023.

  21. Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele. (2023) A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training arXiv:2303.06318 and will appear at ICS 2023.

  22. Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Xiaoxia Wu, Connor Holmes, Zhewei Yao, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He. (2023) ZeRO++: Extremely Efficient Collective Communication for Giant Model Training arXiv:2306.10209 and ML for Sys Workshop at NeurIPS2023 [blog]

  23. Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He. (2023) ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation arXiv:2303.08302 and ENLSP2023 Workshop at NeurIPS2023 [slides]

  24. Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He. (2023) Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important? arXiv:2305.09847

  25. Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes, Zhongzhu Zhou, Michael Wyatt, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, Yuxiong He. (2023) DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales arXiv:2308.01320.

  26. Xiaoxia Wu, Zhewei Yao, Yuxiong He. (2023) ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats arXiv:2307.09782 and ENLSP2023 Workshop at NeurIPS2023 [slides]

  27. Zhewei Yao, Xiaoxia Wu, Conglong Li, Minjia Zhang, Heyang Qin, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He. (2023) DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention arXiv:2309.14327

  28. Shuaiwen Leon Song, Bonnie Kruft, Minjia Zhang, Conglong Li, Shiyang Chen, Chengming Zhang, Masahiro Tanaka, Xiaoxia Wu, Jeff Rasley, Ammar Ahmad Awan, Connor Holmes, Martin Cai, Adam Ghanem, Zhongzhu Zhou, Yuxiong He, et al. (2023) DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies arXiv:2310.04610 [blog]

  29. Zhewei Yao, Reza Yazdani Aminabadi, Stephen Youn, Xiaoxia Wu, Elton Zheng, Yuxiong He. (2023) ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers arXiv:2310.17723

  30. Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao (2023) ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks arXiv:2312.08583

  31. Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song. (2024) FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design arXiv:2401.14112

Videos

  1. DeepSpeed KDD 2020 Tutorial
    1. Overview
    2. ZeRO + large model training
    3. 17B T-NLG demo
    4. Fastest BERT training + RScan tuning
    5. DeepSpeed hands on deep dive: part 1, part 2, part 3
    6. FAQ
  2. Microsoft Research Webinar
  3. DeepSpeed on AzureML
  4. Large Model Training and Inference with DeepSpeed // Samyam Rajbhandari // LLMs in Prod Conference [slides]
  5. Community Tutorials

deepspeed's People

Contributors

aphedges avatar arashb avatar awan-10 avatar bacharl avatar bm-synth avatar chunyang-wen avatar cli99 avatar cmikeh2 avatar conglongli avatar delock avatar digger-yu avatar guoyejun avatar heyangqin avatar inkcherry avatar jeffra avatar jomayeri avatar lekurile avatar loadams avatar molly-smith avatar mrwyattii avatar nelyahu avatar quentin-anthony avatar rezayazdaniaminabadi avatar rraminen avatar samadejacobs avatar samyam avatar stas00 avatar tjruwase avatar tohtana avatar yejing-lai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepspeed's Issues

Detect split brain issues w.r.t. user code and deepspeed batch sizes

Often user code will have a user-defined batch size and the DeepSpeed config json will have it's own batch size. When using gradient accumulation this can cause bugs where DeepSpeed thinks grad accumulation steps should be different than what user code is doing.

If the user is using the default collate_fn then DeepSpeed should be able to detect and throw an exception in these cases. We can check to see what batch size is being passed in the forward pass by inspecting the first dimension.

Lastly, we probably want to add a error suppression flag in the DeepSpeed config to allow users to turn off this error if they know what they are doing and their batch alignment is non-standard.

Unable to detect local GPU Resources

After running deepspeed locally, from the deepspeed/deepspeed:latest docker container, it is unable to detect my local NVIDIA GTX 1080.

Edit: I am on Windows 10 which is complicating this issue

training the 20 and 8 billion model failed on SUMMIT

Hello,

I am trying to train the 8 billion and the 20 billion models on SUMMIT and both failed.
SUMMIT has 6 Nvidia V100 16GB GPUs per node.
Both the 8 billion and the 20 billion give oom.

The training command is:

export MP_SIZE=6

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 100 \
       --hidden-size 3720 \
       --num-attention-heads 30 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

jsrun -n${NODES} -a6 -c42 -g6 -r1 --smpiargs $SMPIARGS python pretrain_bert_nccl.py --sharedfile=$SHAREDFILE \
       --deepspeed_mpi --deepspeed --deepspeed_config ${DS_CONFIG} \
       --model-parallel-size ${MP_SIZE} \
       --num-layers 72 \
       --hidden-size 3072 \
       --num-attention-heads 24 \
       --batch-size 1 \
       --seq-length 512 \
       --max-preds-per-seq 76 \
       --max-position-embeddings 512 \
       --train-iters 1000000 \
       --save ${SAVEPATH} \
       --use-tfrecords \
       --train-data ${TRAINDATAPATH} \
       --tokenizer-type BertWordPieceTokenizer \
       --tokenizer-model-type ${VOCABPATH} \
       --presplit-sentences \
       --cache-dir ${CACHEPATH} \
       --split 949,50,1 \
       --distributed-backend nccl \
       --lr 0.0001 \
       --lr-decay-style linear \
       --lr-decay-iters 990000 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --warmup .01 \
       --fp16 \
       --fp32-layernorm \
       --fp32-embedding \
       --vocab-size 30 \
       --make-vocab-size-divisible-by 5 \
       --checkpoint-activations \
       --checkpoint-num-layers 1

The config file is:

{
  "train_batch_size": 1,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  } 
}

I am testing it on 1 node and even after I reduced the train batch size to 1, it didn't work:


The logs are:
  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 04:40:19.647170: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 04:40:22.566024 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 04:40:22.567073 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 04:40:22.567220 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 04:40:22.567455: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 04:40:22.570236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 04:40:22.572765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 04:40:22.575278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 04:40:22.577850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 04:40:22.580415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 04:40:22.582986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 04:40:22.583008: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 04:40:22.583068: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 04:40:22.583108: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 04:40:22.583146: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 04:40:22.585072: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 04:40:22.585118: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 04:40:22.585156: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 04:40:22.615387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 04:40:22.623295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 04:40:22.623314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 04:40:22.646660 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 04:40:25.123421 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 04:40:25.123578 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 04:40:25.123658 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 04:40:25.149839: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 04:40:25.153336 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 04:40:25.153439 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 04:40:25.154995 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 04:40:25.166115 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:125722:125722 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125722:125722 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125722 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:125724:125724 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125726:125726 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125727:125727 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125723:125723 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125722:125971 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125725:125725 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:125725:125725 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:125725:125725 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125726:125726 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125724:125724 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125723:125723 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125727:125727 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:125725:125992 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:125723:125993 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:125724:125994 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:125726:125995 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:125727:125996 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:125722:125971 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:125722:125971 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:125722:125971 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:125726:125995 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:125725:125992 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:125724:125994 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:125722:125971 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:125726:125995 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:125723:125993 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:125727:125996 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:125725:125992 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:125724:125994 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:125722:125971 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:125722:125971 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:125722:125722 [0] NCCL INFO Launch mode Parallel
building BERT model ...
h36n18:125726:125995 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:125723:125993 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
 > number of parameters on model parallel rank 0: 2799983247
h36n18:125722:126579 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:125722:126579 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:125722:126579 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 5: 2799983247
 > number of parameters on model parallel rank 3: 2799983247
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 579, in main
    model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
  File "pretrain_bert_nccl.py", line 170, in setup_model_and_optimizer
    optimizer = get_optimizer(model, args)
  File "pretrain_bert_nccl.py", line 141, in get_optimizer
    'delayed_shift': args.hysteresis})
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 198, in __init__
    master_param = param.detach().clone().float()
RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 15.75 GiB total capacity; 14.50 GiB already allocated; 16.94 MiB free; 373.95 MiB cached; 0 bytes inactive)
 > number of parameters on model parallel rank 2: 2799983247
 > number of parameters on model parallel rank 1: 2799983247
 > number of parameters on model parallel rank 4: 2799983247

  use_npy_data_loader .......... False
  train_data_path .............. 
  val_data_path ................ 
  test_data_path ............... 
  input_data_sizes_file ........ sizes.txt
  delim ........................ ,
  text_key ..................... sentence
  eval_text_key ................ None
  valid_data ................... None
  split ........................ 949,50,1
  test_data .................... None
  lazy_loader .................. False
  loose_json ................... False
  presplit_sentences ........... True
  num_workers .................. 2
  tokenizer_model_type ......... /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
  tokenizer_path ............... tokenizer.model
  tokenizer_type ............... BertWordPieceTokenizer
  cache_dir .................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
  use_tfrecords ................ True
  seq_length ................... 512
  max_preds_per_seq ............ 76
  deepspeed .................... True
  deepspeed_config ............. /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ds_bert_config.json
  deepscale .................... False
  deepscale_config ............. None
  deepspeed_mpi ................ True
  sharedfile ................... /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/test/.sharedfile
  cuda ......................... True
  rank ......................... 0
  world_size ................... 6
  dynamic_loss_scale ........... True
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
2020-02-29 05:07:35.425203: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
WARNING: Logging before flag parsing goes to stderr.
W0229 05:07:38.074505 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:46: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0229 05:07:38.074888 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:55: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

W0229 05:07:38.075031 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:66: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-02-29 05:07:38.075261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2020-02-29 05:07:38.078041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-29 05:07:38.080565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-29 05:07:38.083095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-29 05:07:38.085669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-29 05:07:38.088239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-29 05:07:38.090805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-29 05:07:38.090827: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2020-02-29 05:07:38.090887: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2020-02-29 05:07:38.090926: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2020-02-29 05:07:38.090965: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2020-02-29 05:07:38.092861: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2020-02-29 05:07:38.092907: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2020-02-29 05:07:38.092946: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-02-29 05:07:38.123406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-29 05:07:38.130912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-29 05:07:38.130926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      
W0229 05:07:38.154345 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/python/data/util/random_seed.py:58: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W0229 05:07:39.526942 35184372395936 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0229 05:07:39.527102 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:86: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
W0229 05:07:39.527187 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
2020-02-29 05:07:39.553327: W tensorflow/core/common_runtime/eager/context.cc:371] Added two functions with the same name: __inference_Dataset_flat_map_read_one_file_28
W0229 05:07:39.556849 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:96: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
W0229 05:07:39.556953 35184372395936 deprecation.py:323] From /ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.1-3/lib/python3.6/site-packages/tensorflow/contrib/data/python/ops/batching.py:273: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
W0229 05:07:39.559207 35184372395936 deprecation_wrapper.py:119] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:116: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

W0229 05:07:39.570396 35184372395936 deprecation.py:323] From /autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py:119: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
configuring data
loading BertWordPieceTokenizer ( /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/ ) from cache_dir  /gpfs/alpine/proj-shared/bif120/dataset/bfd100/models/deepspeed/cache/
loaded /ccs/proj/bif120/deepforce/scripts/deepspeed/bio-bfd/
> padded vocab (size: 30) with 0 dummy tokens (new size: 30)
h36n18:127714:127714 [0] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127714:127714 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127714 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:127718 [4] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127719:127719 [5] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127717:127717 [3] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127714:127963 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127716:127716 [2] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127715:127715 [1] NCCL INFO NET/Socket : Using [0]ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
h36n18:127715:127715 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127719:127719 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127716 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127717:127717 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127718:127718 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/IB [1]mlx5_3:1/IB [2]mlx5_0:1/IB [3]mlx5_2:1/IB ; OOB ib0:10.41.20.224<0>
h36n18:127716:127984 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127715:127985 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127719:127986 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127717:127987 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127718:127988 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127714:127963 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:127963 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:127963 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127715:127985 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127717:127987 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127716:127984 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127714:127963 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127715:127985 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:127988 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127719:127986 [5] NCCL INFO comm 0x200104006650 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127717:127987 [3] NCCL INFO comm 0x200104006650 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127716:127984 [2] NCCL INFO comm 0x200104006650 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
h36n18:127714:127963 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:127963 [0] NCCL INFO comm 0x20040c006650 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127715:127985 [1] NCCL INFO comm 0x200104006650 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:127988 [4] NCCL INFO comm 0x200104006650 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
building BERT model ...
 > number of parameters on model parallel rank 0: 1381032967
 > number of parameters on model parallel rank 1: 1381032967
 > number of parameters on model parallel rank 5: 1381032967
 > number of parameters on model parallel rank 3: 1381032967
 > number of parameters on model parallel rank 2: 1381032967
h36n18:127714:128267 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127714:128267 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127714:128267 [0] NCCL INFO comm 0x200404006620 rank 0 nranks 1 cudaDev 0 nvmlDev 0 - Init COMPLETE
 > number of parameters on model parallel rank 4: 1381032967
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127715:128279 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127715:128279 [1] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127715:128279 [1] NCCL INFO comm 0x2001c8006620 rank 0 nranks 1 cudaDev 1 nvmlDev 1 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127719:128281 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128281 [5] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127719:128281 [5] NCCL INFO comm 0x2001ec006620 rank 0 nranks 1 cudaDev 5 nvmlDev 5 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127716:128283 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127716:128283 [2] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127716:128283 [2] NCCL INFO comm 0x200340006620 rank 0 nranks 1 cudaDev 2 nvmlDev 2 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127717:128286 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127717:128286 [3] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127717:128286 [3] NCCL INFO comm 0x200320006620 rank 0 nranks 1 cudaDev 3 nvmlDev 3 - Init COMPLETE
NCCL version 2.4.7nvb1+cuda10.1
h36n18:127718:128288 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127718:128288 [4] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled up to size -2
h36n18:127718:128288 [4] NCCL INFO comm 0x2001f4006620 rank 0 nranks 1 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff,ffffffff,ffffffff
h36n18:127718:128337 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ff000000,00000000,00000000
h36n18:127719:128338 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ff000000,00000000,00000000
h36n18:127715:128339 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff,ffffffff,ffffffff
h36n18:127717:128341 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,ff000000,00000000,00000000
h36n18:127716:128340 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff,ffffffff,ffffffff
h36n18:127714:128336 [0] NCCL INFO Duplicating rings to 4 per user request.
h36n18:127714:128336 [0] NCCL INFO Channel 00 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 01 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 02 :    0   1   2   3   4   5
h36n18:127714:128336 [0] NCCL INFO Channel 03 :    0   1   2   3   4   5
h36n18:127719:128338 [5] NCCL INFO Ring 00 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 00 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 00 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 01 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 01 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 01 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 01 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 01 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 02 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 02 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 02 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 02 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 02 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 02 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO Ring 03 : 5[5] -> 0[0] via P2P/IPC
h36n18:127715:128339 [1] NCCL INFO Ring 03 : 1[1] -> 2[2] via P2P/IPC
h36n18:127718:128337 [4] NCCL INFO Ring 03 : 4[4] -> 5[5] via P2P/IPC
h36n18:127717:128341 [3] NCCL INFO Ring 03 : 3[3] -> 4[4] via P2P/IPC
h36n18:127714:128336 [0] NCCL INFO Ring 03 : 0[0] -> 1[1] via P2P/IPC
h36n18:127716:128340 [2] NCCL INFO Ring 03 : 2[2] -> 3[3] via P2P/IPC
h36n18:127719:128338 [5] NCCL INFO comm 0x200408006620 rank 5 nranks 6 cudaDev 5 nvmlDev 5 - Init COMPLETE
h36n18:127715:128339 [1] NCCL INFO comm 0x200424006620 rank 1 nranks 6 cudaDev 1 nvmlDev 1 - Init COMPLETE
h36n18:127718:128337 [4] NCCL INFO comm 0x200410006620 rank 4 nranks 6 cudaDev 4 nvmlDev 4 - Init COMPLETE
h36n18:127717:128341 [3] NCCL INFO comm 0x20033c006620 rank 3 nranks 6 cudaDev 3 nvmlDev 3 - Init COMPLETE
h36n18:127714:128336 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees disabled
h36n18:127714:128336 [0] NCCL INFO comm 0x200718006620 rank 0 nranks 6 cudaDev 0 nvmlDev 0 - Init COMPLETE
h36n18:127714:127714 [0] NCCL INFO Launch mode Parallel
h36n18:127716:128340 [2] NCCL INFO comm 0x20035c006620 rank 2 nranks 6 cudaDev 2 nvmlDev 2 - Init COMPLETE
learning rate decaying linear
Partition Activations False and Correctness Check False
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.75 GiB total capacity; 14.04 GiB already allocated; 580.94 MiB free; 200.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 2; 15.75 GiB total capacity; 14.13 GiB already allocated; 586.94 MiB free; 188.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 15.75 GiB total capacity; 14.13 GiB already allocated; 582.88 MiB free; 192.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 5; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 4; 15.75 GiB total capacity; 14.16 GiB already allocated; 554.94 MiB free; 196.72 MiB cached; 0 bytes inactive)
Traceback (most recent call last):
  File "pretrain_bert_nccl.py", line 629, in <module>
    main()
  File "pretrain_bert_nccl.py", line 607, in main
    timers, args)
  File "pretrain_bert_nccl.py", line 338, in train
    args, timers)
  File "pretrain_bert_nccl.py", line 297, in train_step
    nsp_loss, args)
  File "pretrain_bert_nccl.py", line 272, in backward_step
    optimizer.update_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 566, in update_master_grads
    self._model_grads_to_master_grads()
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16.py", line 303, in _model_grads_to_master_grads
    model_grads_to_master_grads(fp16_group, fp32_from_fp16_group)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/fp16/fp16util.py", line 167, in model_grads_to_master_grads
    master.grad = Variable(master.data.new(*master.data.size()))
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 3; 15.75 GiB total capacity; 14.16 GiB already allocated; 558.94 MiB free; 192.72 MiB cached; 0 bytes inactive)

From my understanding from the paper on table 8 that you were able to train both the 8 and 20 billion models on 4 x 16GB GPU using 4 way model parallelism.
In my case I am using 6 way model parallelism with batch size 1 and it dosn't work.

Did I miss understood something?
Do you have any idea how to make it work ?

ZeRO with non-zero loss scale crashes

Typically users want to use dynamic loss scaling. During some development of a new feature for ZeRO I discovered that ZeRO crashes when given a non-zero loss scale value in DeepSpeed's config JSON. I've created a unit test that shows it passes when ZeRO is disabled and another test with ZeRO enabled showing it triggers this bug so we can go back and test when it is fixed.

https://github.com/microsoft/DeepSpeed/blob/jeffra/zero_loss_scale_bug/tests/unit/test_fp16.py#L211-L285

Failed test with stack trace: https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/results?buildId=198&view=logs&j=75347757-894e-5c54-3c11-df095f4d729a&t=50de4b86-57af-55e8-ca98-b1a0d42235e2

Here's the stack trace:
image

Pip install support

Hi. I looked at the install.sh file, and it looks like pip installation support is doable. It would be great before somebody nabs the pypi deepspeed module name.

_init_distributed does not use dist_init_required

In the deepspeed_light.py script, the function _init_distributed gets passed dist_init_required, but this passed variable is not used. If dist_init_required is False this causes an AssetionError even know the functionality should be disabled.

Support configuration for general devices and backends.

DeepSpeed is currently assumes NVIDIA GPUs using the NCCL backend. It would be nice to support more general configurations.

Non-exhaustive list of things to consider:

  • Configuration mechanisms (e.g., JSON config file, deepspeed.initialize())
  • Data movement
  • Resource querying and specification: we currently query the number of local GPUs and would need to add additional capabilities for CPUs, etc.
  • Documentation: often assumes GPUs and would need to be revised to be more general (e.g., https://github.com/microsoft/DeepSpeed#resource-configuration-multi-node)

Turing NLG

Are there any plans to release the Turing NLG pre-trained model?

FP32 Mode for ZeRO

I figured out that ZeRO only works in FP16 mode, are there any plans to also introduce a FP32 mode?

Kind regards

DDLRUN + DeepSpeed on SUMMIT

Hi,

I am trying to use deepspeed on SUMMIT using ddlrun, but it doesn't work properly.
I am testing it with cifar like:
ddlrun deepspeed cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json

Could you please give us an example for using deepspeed with horovod , mpi and ddlrun ?

Megatron tutorial

We need to port the Megatron tutorial into docs/ and then update links to it in README.md, etc.

Initialization of two nn.Modules (e.g. generator and discriminator)

Dear DeepSpeed-Team,

first of all thank you for your effort, I was very excited to hear about this approach.

I am currently trying to realize a GAN which requires me to initialize two networks, I tried the following without success:

generator_engine, _, _, __ = deepspeed.initialize(args=args,
                                                  model=self.generator,
                                                  model_parameters=filter(
                                                      lambda p: p.requires_grad,
                                                      self.generator.parameters()),
                                                  training_data=data)
discriminator_engine, _, data_loader, __ = deepspeed.initialize(args=args,
                                                                model=self.discriminator,
                                                                model_parameters=filter(
                                                                    lambda p: p.requires_grad,
                                                                    self.discriminator.parameters()),
                                                                training_data=data)

Executing this, I get the following:

DeepSpeed info: version=0.1.0, git-hash=6d60206, git-branch=master
File "/home/deepspeed/Code/identification/generative/CPGAN/orchestrator_msggan_deepspeed.py", line 253, in train
training_data=data)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/init.py", line 95, in initialize
Traceback (most recent call last):
File "identification/generative/CPGAN/train_deepspeed.py", line 186, in
collate_fn=collate_fn)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in init
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
main()
File "identification/generative/CPGAN/train_deepspeed.py", line 179, in main
raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!
save_every_n_steps=args.save_every_n_steps)
File "/home/deepspeed/Code/identification/generative/CPGAN/orchestrator_msggan_deepspeed.py", line 253, in train
training_data=data)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/init.py", line 95, in initialize
collate_fn=collate_fn)
File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in init
dist.init_process_group(backend="nccl")
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!

It would be nice to get a pointer on how to tackle such a situation, especially since it is a very common use case.

Kind regards

Conda Environment Install Issue

Trying to get DeepSpeed installed for local use with a Conda environment, but it seems that DeepSpeed in not installing to the environment itself. After building the wheel DeepSpeed is not installing into the proper Conda conda environment location. Apex is installing in the proper environment location. Unclear why DeepSpeed is not working but Apex is.

Support old and new apex optimizer fusion

apex now provides a new way of optimizer fusion which is incompatible with the old way. One key API difference is that the old step() required additional arguments, while the new step() takes no argument.

This work item will provide a smoother transition for users from old to new fusion by making desired fusion style a configuration option.

Following CIFAR Tutorial but Code Forcing RANK variable

I am trying to get DeepSpeed working and have been following the CIFAR tutorial example. In the example local_rank=-1 and dist_init_required=None as it is only with a single system (not distributed). However, it seems that it is forcing me to have RANK, LOCAL_RANK and other distributed environmental variables set. Should dist_init_required=False?

Load model checkpoint without loading the optimizer states.

Extend the load_checkpoint API to allow loading the checkpoint without loading the optimizer states. This is useful during evaluation and fine tuning. Need to make sure the FP32 bit model parameters are loaded along side the FP16 to avoid immediate model divergence when model is loaded without the optimizer states.

Optimization for a Single GPU

Hi,
Thanks for this great work!

Just one question: Is it possible to train my model on a single GPU using this library and obtain the reported optimization benefits in memory consumption/training efficiency, or this is only achievable in case of using multiple GPUs?

Update default deepspeed config

Simplify the DeepSpeed config json in the README to show reasonable defaults to start using DeepSpeed. (e.g., disable_allgather is here which probably doesn't need to be)

Local Install Issue - Apex

I noticed that in the install.sh file when doing a local install, sh install.sh -l, the installer seems to uninstall and reinstall apex twice. Is this normal behavior?

Config and core arguments API docstrings

The documentation for add_core_arguments() and add_config_arguments() could be expanded... Can we document what arguments are added in the docstrings? (these things could be unit tested too)

train_batch_size + dataset + actual batch size

Hello,

I have 4 questions for clarification:

  1. Why we should pass the training_data to the deepspeed.initialize to generate a new trainloader rather than using a normal torch trainloader ?
  2. Can we use a custom pytorch trainloader in case we have custom dataset that returns for example inputs, outputs and mask ?
  3. If the actual batch size that is used to be passed to the model is different than the train_batch_size in the json file, what will happen ?
  4. Can we just define gradient_accumulation_steps and train_micro_batch_size_per_gpu
    only and leave deepspeed to calculate train_batch_size automatically ?

DeepSpeed using DistributedSampler with model parallelism

DeepSpeed's data loader will use DistributedSampler by default unless another is provided:

data_sampler = DistributedSampler(dataset)

If DeepSpeed is configured with model parallelism, or called from a library with a sub-group of the world processes, the default behavior of DistributedSampler is incorrect because it queries the global world size and rank information. We should specify num_replicas and rank when creating the sampler.

If mpu is provided to deepspeed.initialize(), we should query mpu.get_data_parallel_world_size() and mpu.get_data_parallel_rank() and forward that information to the sampler.

NoneType has no attribute to

I get the following error when I run my training. When I comment out this line the training works but the loss doesn't decrease

File "/storage/home/ec2-user/ner/trainer/trainer.py", line 131, in _train_epoch
  self.model.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/pt/deepspeed_light.py", line 628, in step
  fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
AttributeError: 'NoneType' object has no attribute 'to'
  self.optimizer.step()
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 165, in step
  fp32_param.grad = fp16_param.grad.to(fp32_param.dtype)
AttributeError: 'NoneType' object has no attribute 'to'

TypeError: FP16_DeepSpeedZeroOptimizer is not an Optimizer

I'm trying to use 1-Cycle scheduler, but I meet the following error :

TypeError: FP16_DeepSpeedZeroOptimizer is not an Optimizer


Here is my configuration file :

{
    "train_batch_size": 64,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 3e-05,
            "betas": [
                0.9,
                0.999
            ],
            "eps": 1e-8,
            "weight_decay": 0.01
        }
    },
    "gradient_clipping": 0.1,
    "scheduler": {
        "type": "OneCycle",
        "params": {
            "cycle_first_step_size": 16000,
            "cycle_first_stair_count": 8000,
            "decay_step_size": 16000,
            "cycle_min_lr": 1e-06,
            "cycle_max_lr": 3e-05,
            "decay_lr_rate": 1e-07,
            "cycle_min_mom": 0.85,
            "cycle_max_mom": 0.99,
            "decay_mom_rate": 0.0
        }
    },
    "zero_optimization": true,
    "disable_allgather": true,
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "min_loss_scale": 1
    }
}

When using another Scheduler (with FP16), I meet no problem.

batch config issue?

There are a few things in configure train batch size that does not seem correct to me, and there are few things that we do not currently support.

  1. The following assertion

train_batch_size == train_micro_batch_size_per_gpu * gradient_accumulation_step * world_size

should always hold but currently it does not in some cases.
For example, when train_micro_batch_size_per_gpu and gradient accumulation steps are None in the ds_cofig its initialized to train_batch_size and 1 respectively which leads to

train_batch_size == train_batch_size * 1 * world_size

  1. if train_micro_batch_size_per_gpu > per_device_batch_size, we should throw a config error. Currently, its assigned to be equal to per_device_batch_size.

  2. We do not currently support the user providing only the train_micro_batch_size or train_micro_batch_size and gradient _accumulation_steps.

Error while initializing multiple models

Hi.

I'm trying to use deepspeed in my code with multiple models, but got an error like below. Do you have any idea to solve this issue? Thanks in advance.

  File "train_ds.py", line 98, in <module>
    solver = Solver(opt)
  File "/data2/1konny/svg/solver_ds.py", line 40, in __init__
    self.init_models_and_optimizers()
  File "/data2/1konny/svg/solver_ds.py", line 117, in init_models_and_optimizers
    self.decoder, self.decoder_optimizer, _, _ = ds.initialize(opt, model=decoder, model_parameters=decoder_params)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/__init__.py", line 87, in initialize
    collate_fn=collate_fn)
  File "/usr/local/lib/python3.6/dist-packages/deepspeed/pt/deepspeed_light.py", line 123, in __init__
    dist.init_process_group(backend="nccl")
  File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 372, in init_process_group
    raise RuntimeError("trying to initialize the default process group "
RuntimeError: trying to initialize the default process group twice!

ds_config.json

{
  "train_batch_size": 4,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0001,
      "max_grad_norm": 1.0,
      "betas": [
         0.9,
         0.999
       ]
    }
  }
}

command-line

deepspeed train_ds.py --deepspeed --deepspeed_config deepspeed_util/ds_config.json ...

code

training_data = load_dataset()
encoder_params = filter(lambda p: p.requires_grad, encoder.parameters())
decoder_params = filter(lambda p: p.requires_grad, decoder.parameters())
self.encoder, self.encoder_optim, train_loader, _ = deepspeed.initialize(opt, model=encoder, model_parameters=encoder_params, training_data=training_data)
self.decoder, self.decoder_optim, _, _ = deepspeed.initialize(opt, model=decoder, model_parameters=decoder_params)

Detect if init distributed is needed

Ideally we could remove the init_dist_required flag from deepspeed initialize if we can detect if it’s already been started. This can prevent some types of bugs like in #65

max_grad_norm is ignored in FP16 training

I am currently fighting with a dynamic loss scale that is constantly decreasing due to gradient overflows. Setting max_grad_norm in the JSON config has no effect since it is overidden in deepspeed_light.py

if self.fp16_enabled() and 'max_grad_norm' in optimizer_parameters.keys():
optimizer_parameters['max_grad_norm'] = 0.0
:

I think this modification should be removed.

Kind regards

Install details

Add more details the the install section of readme to talk about local vs multi node. Also update resource configuration section to discuss single node support.

Catch spawned process failures and terminate

The DeepSpeed launcher should detect failed processes and then ensure that the remaining children are joined with a timeout. The distributed_test decorator does this. We should more rigorously evaluate that and see if it's appropriate for deepspeed_run.

pytorch gradient checkpointing is much better than deepspeed !

Hello,

I have a script that trains 12 layers transformer model (about 85 million) using gradient checkpoint. It was working with a local batch size of 32 per Nvidia Titan GPU.
I tried to use deepspeed instead and I am always getting OOM, even with a batch size 8.

minimal code:
Initialization:

model = models.TransformerModel(ntokens, args.emsize, args.nhead, args.nhid, args.nlayers, args.dropout)

parameters = filter(lambda p: p.requires_grad, model.parameters())

model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=parameters)

Training:

with tqdm(total=int(args.log_interval),
              desc='Train Step     #{}-{}'.format(step + 1,step+args.log_interval),
              disable=False) as t:
        for batch_idx, batch in enumerate(datasetGenerator):

            data, target,src_padding = batch['input'].to(model_engine.local_rank), batch['target'].to(model_engine.local_rank), batch['padding_mask'].to(model_engine.local_rank)
            

            output = model_engine(data, has_mask=False,src_key_padding_mask = src_padding.t())


            train_accuracy.update(accuracy(target, output))
            loss = criterion(output.view(-1, ntokens), target.view(-1))

            model_engine.backward(loss)
            model_engine.step()

            t.set_postfix({'loss': train_loss.avg.item(),
                           'accuracy': 100. * train_accuracy.avg.item()})
            t.update(1)

Original Transformer code with gradient checkpointing:

def forward(self, src, mask=None, src_key_padding_mask=None):
        r"""Pass the input through the encoder layers in turn.
        Args:
            src: the sequnce to the encoder (required).
            mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """
        output = src

        for i in range(self.num_layers):
            #output = self.layers[i](output, src_mask=mask,
            #                        src_key_padding_mask=src_key_padding_mask)
            output = checkpoint(self.layers[i], output, mask, src_key_padding_mask)


        if self.norm:
            output = self.norm(output)

        return output 

The working batch size for the dataloader is only 4.
Any idea how can I achieve the same batch size as gradient checkpoints with deepspeed?

ZeRO optimizer LAMB compatibility

My use case for this library is mostly for BERT models, as opposed to Megatron+ sized LMs. ZeRO in that context is mainly useful for fitting larger batch sizes and increasing throughput. For that reason, I'm wondering if/when you are planning on adding a ZeRO compatible LAMB optimizer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.