Coder Social home page Coder Social logo

cleemesser / hetseq Goto Github PK

View Code? Open in Web Editor NEW

This project forked from yifding/hetseq

0.0 0.0 0.0 859 KB

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Home Page: https://hetseq.readthedocs.io

License: MIT License

Python 90.84% Shell 9.16%

hetseq's Introduction

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

This is our coding implementation for the paper:

Yifan Ding, Nicholas Botzer, Tim Weninger. HetSeq: Distributed GPU Training on Heterogeneous Infrastructure, Proc. of Association for the Advancement of Artificial Intelligence (AAAI) Innovative Application of Artificial Intelligence, February 2021.

Author: Yifan Ding ([email protected])

arxiv paper available: https://arxiv.org/abs/2009.14783

Documentation available: https://hetseq.readthedocs.io

Medium towards data science Post: Training BERT at a University

Documentation includes Distributed Setting, Scripts to Run HetSeq, Extending HetSeq, Parameter Explanation and Code Reference.

Overview

HetSeq is a distributed neural network platiform designed to run on Heterogeneous Infrastructure with common scientific shared file system. It can be run directly on command line with SSH or task queen submission system without privilege or any extra packages. It takes care of the data index randomization and assignment to different GPUs in the multi-node and multi-GPU setting. Users can easily extend HetSeq to many other models with minimum effort.

HetSeq requires installation of PyTorch with GPU support and NCCL.

Installation

  1. create and activate conda virtual environment with Python 3.7.4 (recommended)
$ conda create --name hetseq
$ conda activate hetseq
$ conda install python=3.7.4
  1. Git clone directory and install nessasory package
$ git clone https://github.com/yifding/hetseq.git
$ cd /path/to/hetseq
$ pip install -r requirements.txt 
$ pip install --editable . 
  1. To Run BERT: Download data files including training corpus, model configuration, and BPE dictionary. Test corpus from here, full data from this link. Download test_DATA.zip for test or DATA.zip for full run, unzip it and place the preprocessing/ directory inside the package directory. Available corpus under preprocessing/,
  • phase one of BERT training corpus : preprocessing/hdf5_lower_case_1_seq_len_128.../wikicorpus_en/
  • phase two of BERT training corpus : preprocessing/hdf5_lower_case_1_seq_len_512.../wikicorpus_en/
  • sample test for phase one : preprocessing/test128/
  • sample test for phase two : preprocessing/test512/
  • see NVIDIA-pytorch-BERT, google_original_BERT and BERT paper for more information.
  • current provided is generated from NVIDIA-pytorch-BERT with wikipedia data (book data is not available)
  1. Running HetSeq script is available at https://hetseq.readthedocs.io/en/master/examples.html,

Distributed Configuration

HetSeq can be executed on single GPU on a single node, multiple GPUs on a single node, or multiple GPUs across multiple nodes. Main logic is defined at train.py.

  • --distributed-init-method: defines an initialization. e.g.: "tcp://10.32.82.207:11111" (tcp for multiple nodes) or "file:///hetseq/communicate.txt" (shared file for multiple nodes).
  • --distributed-world-size: total number of GPUs used in the training.
  • --distributed-gpus: the number of GPUs on the current node.
  • --distributed-rank: represents the rank/index of the first GPU used on current node.

Performance table

Running BERT on nodes with 4 GPUs each.

nodes GPUs epochs batch size steps avg. time per step training time training loss expansion speedup
1 4 5 128 267,139 2.60s 7.19d 0.026 1 1
2 8 5 256 133,570 2.69s 4.19d 0.028 0.86 1.72
4 16 5 512 66,785 2.794 2.23d 0.031 0.81 3.22
8 32 5 1024 33,393 3.126 1.21d 0.055 0.74 5.94

Notice and tips

loading BERT data takes a while.

Known issues

  • currently not supporting continue training
  • mnist datasets download does not support multiple GPUs

future patch

  • bert processing pipeline not included
  • interface of datasets/transformers not included
  • hetseq not supporting download from pip
  • evaluation separate/combined not included
  • fp16 support

License

this repository is MIT-licensed. It is created based on fairseq, NVIDIA-BERT, and pytorch

Please send us e-mail or leave comments on github if have any questions.

Copyright (c) 2020 Yifan Ding and Weninger Lab

hetseq's People

Contributors

yifding avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.