Coder Social home page Coder Social logo

fsdp_1t's Introduction

Run fairseq-train with SLURM srun/sbatch

Create conda env

conda create -yn fsdp_1T python=3.8
conda activate fsdp_1T

Checkout and build PyTorch from source (for EFA support see the instuctions)

conda install -y astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
conda install -y -c pytorch magma-cuda110
git clone --recursive [email protected]:pytorch/pytorch.git
cd pytorch
TORCH_CUDA_ARCH_LIST=8.0 python setup.py install
cd ..

Clone and install pbelevich/fairscale from source

git clone [email protected]:pbelevich/fairscale.git pbelevich-fairscale
cd pbelevich-fairscale
pip install -e .
cd ..

Clone and install pbelevich/fairseq from branch fsdp_1T

git clone -b fsdp_1T [email protected]:pbelevich/fairseq.git pbelevich-fairseq
cd pbelevich-fairseq
pip install -e .
cd ..

Install deepspeed

pip install deepspeed

Clone and build NVIDIA/apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./
cd ..

[No need if you use fsdp_1T@pbelevich/fairseq] Quick fix fairseq-deepspeed issue: Open fairseq/optim/cpu_adam.py and add , False to the line 116

Clone this repo

git clone https://github.com/pbelevich/fsdp_1T.git
cd fsdp_1T

Preprocess the data for RoBERTa

wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
mkdir -p gpt2_bpe
wget -O gpt2_bpe/encoder.json https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
wget -O gpt2_bpe/vocab.bpe https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
for SPLIT in train valid test; do \
    python -m examples.roberta.multiprocessing_bpe_encoder \
        --encoder-json gpt2_bpe/encoder.json \
        --vocab-bpe gpt2_bpe/vocab.bpe \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs wikitext-103-raw/wiki.${SPLIT}.bpe \
        --keep-empty \
        --workers 60; \
done
wget -O gpt2_bpe/dict.txt https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
fairseq-preprocess \
    --only-source \
    --srcdict gpt2_bpe/dict.txt \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

Run fairseq-train with SLURM sbatch (output to the file slurm-XXXXX.out)

sbatch fairseq_fsdp_sbatch.sh

To see the log:

tail -f -n +1 slurm-XXXXX.out

Run fairseq-train with SLURM srun (output to the screen)

./fairseq_fsdp_interactive.sh

fsdp_1t's People

Contributors

mrshenli avatar pbelevich avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.