Coder Social home page Coder Social logo

eleutherai / gpt-neox Goto Github PK

View Code? Open in Web Editor NEW
6.6K 120.0 950.0 113.04 MB

An implementation of model parallel autoregressive transformers on GPUs, based on the DeepSpeed library.

License: Apache License 2.0

Python 82.69% Shell 0.49% Dockerfile 0.39% Makefile 0.03% C++ 13.92% Cuda 2.39% C 0.09%
deepspeed-library gpt-3 transformers language-model

gpt-neox's Introduction

GitHub issues Weights & Biases monitoring

GPT-NeoX

This repository records EleutherAI's library for training large-scale language models on GPUs. Our current framework is based on NVIDIA's Megatron Language Model and has been augmented with techniques from DeepSpeed as well as some novel optimizations. We aim to make this repo a centralized and accessible place to gather techniques for training large-scale autoregressive language models, and accelerate research into large-scale training. This library is in widespread use in academic, industry, and government labs, including by researchers at Oak Ridge National Lab, CarperAI, Stability AI, Together.ai, Korea University, Carnegie Mellon University, and the University of Tokyo among others. Uniquely among similar libraries GPT-NeoX supports a wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on AWS, CoreWeave, ORNL Summit, ORNL Frontier, LUMI, and others.

If you are not looking to train models with billions of parameters from scratch, this is likely the wrong library to use. For generic inference needs, we recommend you use the Hugging Face transformers library instead which supports GPT-NeoX models.

Why GPT-NeoX?

GPT-NeoX leverages many of the same features and technologies as the popular Megatron-DeepSpeed library but with substantially increased usability and novel optimizations. Major features include:

  • Distributed training with ZeRO and 3D parallelism
  • A wide variety of systems and hardwares, including launching via Slurm, MPI, and the IBM Job Step Manager, and has been run at scale on AWS, CoreWeave, Oak Ridge's Summit and Frontier, Pacific Northwest National Laboratory, Argonne's Polaris, LUMI, and more.
  • Cutting edge architectural innovations including rotary and alibi positional embeddings, parallel feedforward attention layers, and flash attention.
  • Predefined configurations for popular architectures including Pythia, PaLM, Falcon, and LLaMA 1 & 2
  • Curriculum Learning
  • Easy connections with the open source ecosystem, including Hugging Face's tokenizers and transformers libraries, logging via WandB, and evaluation via our Language Model Evaluation Harness.

News

[8/10/2023] We now support checkpointing with AWS S3! Activate with the s3_path config option (for more detail, see the PR)

[9/20/2023] As of #1035, we have deprecated Flash Attention 0.x and 1.x, and migrated support to Flash Attention 2.x. We don't believe this will cause problems, but if you have a specific use-case that requires old flash support using the latest GPT-NeoX, please raise an issue.

[8/10/2023] We have experimental support for LLaMA 2 and Flash Attention v2 supported in our math-lm project that will be upstreamed later this month.

[5/17/2023] After fixing some miscellaneous bugs we now fully support bf16.

[4/11/2023] We have upgraded our Flash Attention implementation to now support Alibi positional embeddings.

[3/9/2023] We have released GPT-NeoX 2.0.0, an upgraded version built on the latest DeepSpeed which will be regularly synced with going forward.

Versions

Prior to 3/9/2023, GPT-NeoX relied on DeeperSpeed, which was based on an old version of DeepSpeed (0.3.15). In order to migrate to the latest upstream DeepSpeed version while allowing users to access the old versions of GPT-NeoX and DeeperSpeed, we have introduced two versioned releases for both libraries:

Contents

Quick Start

Environment and Dependencies

Host Setup

First make sure you are in an environment with Python 3.8 with an appropriate version of PyTorch 1.8 or later installed. Note: Some of the libraries that GPT-NeoX depends on have not been updated to be compatible with Python 3.10+. Python 3.9 appears to work, but this codebase has been developed and tested for Python 3.8.

To install the remaining basic dependencies, run:

pip install -r requirements/requirements.txt
pip install -r requirements/requirements-wandb.txt # optional, if logging using WandB
pip install -r requirements/requirements-tensorboard.txt # optional, if logging via tensorboard
python ./megatron/fused_kernels/setup.py install # optional, if using fused kernels

from the repository root.

Warning

Our codebase relies on DeeperSpeed, our fork of the DeepSpeed library with some added changes. We strongly recommend using Anaconda, a virtual machine, or some other form of environment isolation before continuing. Failure to do so may cause other repositories that rely on DeepSpeed to break.

Flash Attention

To use Flash-Attention, install the additional dependencies in ./requirements/requirements-flashattention.txt and set the attention type in your configuration accordingly (see configs). This can provide significant speed-ups over regular attention on certain GPU architectures, including Ampere GPUs (such as A100s); see the repository for more details.

Multi-Node Launching

NeoX and Deep(er)Speed support training on multiple different nodes and you have the option of using a variety of different launchers to orchestrate multi-node jobs.

In general there needs to be a "hostfile" somewhere accessible with the format:

node1_ip slots=8
node2_ip slots=8

where the first column contains the IP address for each node in your setup and the number of slots is the number of GPUs that node has access to. In your config you must pass in the path to the hostfile with "hostfile": "/path/to/hostfile". Alternatively the path to the hostfile can be in the environment variable DLTS_HOSTFILE.

pdsh

pdsh is the default launcher, and if you're using pdsh then all you must do (besides ensuring that pdsh is installed in your environment) is set {"launcher": "pdsh"} in your config files.

MPI

If using MPI then you must specify the MPI library (DeepSpeed/GPT-NeoX currently supports mvapich, openmpi, mpich, and impi, though openmpi is the most commonly used and tested) as well as pass the deepspeed_mpi flag in your config file:

{
    "launcher": "openmpi",
    "deepspeed_mpi": true
}

With your environment properly set up and the correct configuration files you can use deepy.py like a normal python script and start (for example) a training job with:

python3 deepy.py train.py /path/to/configs/my_model.yml

Slurm

Using Slurm can be slightly more involved. Like with MPI, you must add the following to your config:

{
    "launcher": "slurm",
    "deepspeed_slurm": true
}

If you do not have ssh access to the compute nodes in your Slurm cluster you need to add {"no_ssh_check": true}

(Advanced) Custom Launching

There are many cases where the above default launching options are not sufficient

  • Many clusters have their own unique job scheduler or specific MPI/Slurm arguments necessary for launching jobs such as Summit JSRun or LLNL Flux
  • While the above Slurm/MPI/pdsh default options are enough for most job runs, advanced users may want to add arguments for optimization or debugging purposes

In these cases, you will need to modify the DeepSpeed multinode runner utility to support your usecase. Broadly, these enhancements fall under two categories:

1. Adding a Launcher (e.g. JSRun, Flux, etc)

In this case, you must add a new multinode runner class to deepspeed/launcher/multinode_runner.py and expose it as a configuration option in GPT-NeoX. Examples on how we did this for Summit JSRun are in this DeeperSpeed commit and this GPT-NeoX commit, respectively.

2. Modifying Run Command or Environment Variables

We have encountered many cases where we wish to modify the MPI/Slurm run command for an optimization or to debug (e.g. to modify the Slurm srun CPU binding or to tag MPI logs with the rank). In this case, you must modify the multinode runner class' run command under its get_cmd method (e.g. mpirun_cmd for OpenMPI). Examples on how we did this to provide optimized and rank-tagged run commands using Slurm and OpenMPI for the Stability cluster are in this DeeperSpeed branch

Hostfile Generation

In general you will not be able to have a single fixed hostfile, so you need to have a script to generate one dynamically when your job starts. An example script to dynamically generate a hostfile using Slurm and 8 GPUs per node is:

#!/bin/bash
GPUS_PER_NODE=8
mkdir -p /sample/path/to/hostfiles
# need to add the current slurm jobid to hostfile name so that we don't add to previous hostfile
hostfile=/sample/path/to/hostfiles/hosts_$SLURM_JOBID
# be extra sure we aren't appending to a previous hostfile
rm $hostfile &> /dev/null
# loop over the node names
for i in `scontrol show hostnames $SLURM_NODELIST`
do
    # add a line to the hostfile
    echo $i slots=$GPUS_PER_NODE >>$hostfile
done

$SLURM_JOBID and $SLURM_NODELIST being environment variables Slurm will create for you. See the sbatch documentation for a full list of available Slurm environment variables set at job creation time.

Job Launching

Then you can create an sbatch script from which to kick off your GPT-NeoX job. A bare-bones sbatch script on a Slurm-based cluster with 8 GPUs per node would look like this:

#!/bin/bash
#SBATCH --job-name="neox"
#SBATCH --partition=your-partition
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8

# Some potentially useful distributed environment variables
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12802
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

# Your hostfile creation script from above
./write_hostfile.sh
# Tell DeepSpeed where to find our generated hostfile via DLTS_HOSTFILE
export DLTS_HOSTFILE=/sample/path/to/hostfiles/hosts_$SLURM_JOBID

# Launch training
python3 deepy.py train.py /sample/path/to/your/configs/my_model.yml

You can then kick off a training run with sbatch my_sbatch_script.sh

Containerized Setup

We also provide a Dockerfile and docker-compose configuration if you prefer to run NeoX in a container.

Requirements to run the container are to have appropriate GPU drivers, an up-to-date installation of Docker, and nvidia-container-toolkit installed. To test if your installation is good you can use their "sample workload", which is:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Provided that will run, you need to export NEOX_DATA_PATH and NEOX_CHECKPOINT_PATH in your environment to specify your data directory and directory for storing and loading checkpoints:

export NEOX_DATA_PATH=/mnt/sda/data/enwiki8 #or wherever your data is stored on your system
export NEOX_CHECKPOINT_PATH=/mnt/sda/checkpoints

And then, from the gpt-neox directory, you can build the image and run a shell in a container with

docker compose run gpt-neox bash

After the build, you should be able to do this:

mchorse@537851ed67de:~$ echo $(pwd)
/home/mchorse
mchorse@537851ed67de:~$ ls -al
total 48
drwxr-xr-x  1 mchorse mchorse 4096 Jan  8 05:33 .
drwxr-xr-x  1 root    root    4096 Jan  8 04:09 ..
-rw-r--r--  1 mchorse mchorse  220 Feb 25  2020 .bash_logout
-rw-r--r--  1 mchorse mchorse 3972 Jan  8 04:09 .bashrc
drwxr-xr-x  4 mchorse mchorse 4096 Jan  8 05:35 .cache
drwx------  3 mchorse mchorse 4096 Jan  8 05:33 .nv
-rw-r--r--  1 mchorse mchorse  807 Feb 25  2020 .profile
drwxr-xr-x  2 root    root    4096 Jan  8 04:09 .ssh
drwxrwxr-x  8 mchorse mchorse 4096 Jan  8 05:35 chk
drwxrwxrwx  6 root    root    4096 Jan  7 17:02 data
drwxr-xr-x 11 mchorse mchorse 4096 Jan  8 03:52 gpt-neox

For a long-running job, you should run

docker compose up -d

to run the container in detached mode, and then, in a separate terminal session, run

docker compose exec gpt-neox bash

You can then run any job you want from inside the container.

Concerns when running for a long time or in detached mode include

  • You will have to terminate the container manually when you are no longer using it
  • If you want processes to continue running when your shell session ends, you will need to background them.
  • If you then want logging, you will have to make sure to pipe logs to disk or set up wandb.

If you prefer to run the prebuilt container image from dockerhub, you can run the docker compose commands with -f docker-compose-dockerhub.yml instead, e.g.,

docker compose run -f docker-compose-dockerhub.yml gpt-neox bash

Usage

All functionality should be launched using deepy.py, a wrapper around the deepspeed launcher.

We currently offer three main functions:

  1. train.py is used for training and finetuning models.
  2. eval.py is used to evaluate a trained model using the language model evaluation harness.
  3. generate.py is used to sample text from a trained model.

which can be launched with:

./deepy.py [script.py] [./path/to/config_1.yml] [./path/to/config_2.yml] ... [./path/to/config_n.yml]

For example, to launch training you can run

./deepy.py train.py ./configs/20B.yml ./configs/local_cluster.yml

For more details on each entry point, see the Training and Finetuning, Inference and Evaluation respectively.

Configuration

GPT-NeoX parameters are defined in a YAML configuration file which is passed to the deepy.py launcher. We have provided some example .yml files in configs, showing a diverse array of features and model sizes.

These files are generally complete, but non-optimal. For example, depending on your specific GPU configuration, you may need to change some settings such as pipe-parallel-size, model-parallel-size to increase or decrease the degree of parallelisation, train_micro_batch_size_per_gpu or gradient-accumulation-steps to modify batch size related settings, or the zero_optimization dict to modify how optimizer states are parallelised across workers.

For a more detailed guide to the features available and how to configure them, see the configuration README, and for documentation of every possible argument, see configs/neox_arguments.md.

Datasets

Preconfigured Datasets

Several preconfigured datasets are available, including most components from the Pile, as well as the Pile train set itself, for straightforward tokenization using the prepare_data.py entry point.

E.G, to download and tokenize the enwik8 dataset with the GPT2 Tokenizer, saving them to ./data you can run:

python prepare_data.py -d ./data

or a single shard of the pile (pile_subset) with the GPT-NeoX-20B tokenizer (assuming you have it saved at ./20B_checkpoints/20B_tokenizer.json):

python prepare_data.py -d ./data -t HFTokenizer --vocab-file ./20B_checkpoints/20B_tokenizer.json pile_subset

The tokenized data will be saved out to two files: [data-dir]/[dataset-name]/[dataset-name]_text_document.binand [data-dir]/[dataset-name]/[dataset-name]_text_document.idx. You will need to add the prefix that both these files share to your training configuration file under the data-path field. E.G:

  "data-path": "./data/enwik8/enwik8_text_document",

Using Custom Data

To prepare your own dataset for training with custom data, format it as one large jsonl-formatted file with each item in the list of dictionaries being a separate document. The document text should be grouped under one JSON key, i.e "text". Any auxiliary data stored in other fields will not be used.

Next make sure to download the GPT2 tokenizer vocab, and merge files from the following links:

Or use the 20B tokenizer (for which only a single Vocab file is needed):

(alternatively, you can provide any tokenizer file that can be loaded by Hugging Face's tokenizers library with the Tokenizer.from_pretrained() command)

You can now pretokenize your data using tools/datasets/preprocess_data.py, the arguments for which are detailed below:

usage: preprocess_data.py [-h] --input INPUT [--jsonl-keys JSONL_KEYS [JSONL_KEYS ...]] [--num-docs NUM_DOCS] --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer} [--vocab-file VOCAB_FILE] [--merge-file MERGE_FILE] [--append-eod] [--ftfy] --output-prefix OUTPUT_PREFIX
                          [--dataset-impl {lazy,cached,mmap}] [--workers WORKERS] [--log-interval LOG_INTERVAL]

optional arguments:
  -h, --help            show this help message and exit

input data:
  --input INPUT         Path to input jsonl files or lmd archive(s) - if using multiple archives, put them in a comma separated list
  --jsonl-keys JSONL_KEYS [JSONL_KEYS ...]
                        space separate listed of keys to extract from jsonl. Default: text
  --num-docs NUM_DOCS   Optional: Number of documents in the input data (if known) for an accurate progress bar.

tokenizer:
  --tokenizer-type {HFGPT2Tokenizer,HFTokenizer,GPT2BPETokenizer,CharLevelTokenizer}
                        What type of tokenizer to use.
  --vocab-file VOCAB_FILE
                        Path to the vocab file
  --merge-file MERGE_FILE
                        Path to the BPE merge file (if necessary).
  --append-eod          Append an <eod> token to the end of a document.
  --ftfy                Use ftfy to clean text

output data:
  --output-prefix OUTPUT_PREFIX
                        Path to binary output file without suffix
  --dataset-impl {lazy,cached,mmap}
                        Dataset implementation to use. Default: mmap

runtime:
  --workers WORKERS     Number of worker processes to launch
  --log-interval LOG_INTERVAL
                        Interval between progress updates

For example:

python tools/datasets/preprocess_data.py \
            --input ./data/mydataset.jsonl.zst \
            --output-prefix ./data/mydataset \
            --vocab ./data/gpt2-vocab.json \
            --merge-file gpt2-merges.txt \
            --dataset-impl mmap \
            --tokenizer-type GPT2BPETokenizer \
            --append-eod

You would then run training with the following settings added to your configuration file:

  "data-path": "data/mydataset_text_document",

Training and Finetuning

Training is launched using deepy.py, a wrapper around DeepSpeed's launcher, which launches the same script in parallel across many GPUs / nodes.

The general usage pattern is:

python ./deepy.py train.py [path/to/config1.yml] [path/to/config2.yml] ...

You can pass in an arbitrary number of configs which will all be merged at runtime.

You can also optionally pass in a config prefix, which will assume all your configs are in the same folder and append that prefix to their path.

E.G:

python ./deepy.py train.py -d configs 125M.yml local_setup.yml

This will deploy the train.py script on all nodes with one process per GPU. The worker nodes and number of GPUs are specified in the /job/hostfile file (see parameter documentation), or can simply be passed in as the num_gpus arg if running on a single node setup.

Although this is not strictly necessary, we find it useful to define the model parameters in one config file (e.g configs/125M.yml) and the data path parameters in another (e.g configs/local_setup.yml).

Pretrained Models

GPT-NeoX-20B

GPT-NeoX-20B is a 20 billion parameter autoregressive language model trained on the Pile. Technical details about GPT-NeoX-20B can be found in the associated paper. The configuration file for this model is both available at ./configs/20B.yml and included in the download links below.

Slim weights - (No optimizer states, for inference or finetuning, 39GB)

To download from the command line to a folder named 20B_checkpoints, use the following command:

wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/slim_weights/ -P 20B_checkpoints

Full weights - (Including optimizer states, 268GB)

To download from the command line to a folder named 20B_checkpoints, use the following command:

wget --cut-dirs=5 -nH -r --no-parent --reject "index.html*" https://the-eye.eu/public/AI/models/GPT-NeoX-20B/full_weights/ -P 20B_checkpoints

Weights can be alternatively be downloaded using a BitTorrent client. Torrent files can be downloaded here: slim weights, full weights.

We additionally have 150 checkpoints saved throughout training, one every 1,000 steps. We are working on figuring out how to best serve these at scale, but in the meanwhile people interested in working with the partially trained checkpoints can email us at [email protected] to arrange access.

Pythia

The Pythia Scaling Suite is a suite of models ranging from 70M parameters to 12B parameters trained on the Pile intended to promote research on interpretability and training dynamics of large language models. Further details about the project and links to the models can be found in the in the paper and on the project's GitHub.

Polyglot

The Polyglot Project is an effort to train powerful non-English pretrained language models to promote the accessibility of this technology to researchers outside the dominant powerhouses of machine learning. EleutherAI has trained and released 1.3B, 3.8B, and 5.8B parameter Korean language models, the largest of which outpreforms all other publicly available language models on Korean language tasks. Further details about the project and links to the models can be found here.

Inference

For most uses we recommend deploying models trained using the GPT-NeoX library via the Hugging Face Transformers library which is better optimized for inference.

We support three types of generation from a pretrained model:

  1. Unconditional generation
  2. Conditional generation based on an input read from a file
  3. Interactive generation, which allows for multiple rounds of back-and-forth between a user and the language model via a command line interface

All three types of text generation can be launched via python ./deepy.py generate.py -d configs 125M.yml local_setup.yml text_generation.yml with the appropriate values set in configs/text_generation.yml.

Evaluation

GPT-NeoX supports evaluation on downstream tasks through the language model evaluation harness.

To evaluate a trained model on the evaluation harness, simply run:

python ./deepy.py eval.py -d configs your_configs.yml --eval_tasks task1 task2 ... taskn

where --eval_tasks is a list of evaluation tasks followed by spaces, e.g --eval_tasks lambada hellaswag piqa sciq. For details of all tasks available, refer to the lm-evaluation-harness repo.

Exporting to Hugging Face

GPT-NeoX is optimized heavily for training only, and GPT-NeoX model checkpoints are not compatible out of the box with other deep learning libraries. To make models easily loadable and shareable with end users, and for further exporting to various other frameworks, GPT-NeoX supports checkpoint conversion to the Hugging Face Transformers format.

Though NeoX supports a number of different architectural configurations, including AliBi positional embeddings, not all of these configurations map cleanly onto the supported configurations within Hugging Face Transformers.

NeoX supports export of compatible models into the following architectures:

  • GPTNeoXForCausalLM
  • LlamaForCausalLM
  • MistralForCausalLM

Training a model which does not fit into one of these Hugging Face Transformers architectures cleanly will require writing custom modeling code for the exported model.

To convert a GPT-NeoX library checkpoint to Hugging Face-loadable format, run:

python ./tools/ckpts/convert_neox_to_hf.py --input_dir /path/to/model/global_stepXXX --config_file your_config.yml --output_dir hf_model/save/location --precision {auto,fp16,bf16,fp32} --architecture {neox,mistral,llama}

Then to upload a model to the Hugging Face Hub, run:

huggingface-cli login
python ./tools/ckpts/upload.py

and input the requested information, including HF hub user token.

Importing Models Into GPT-NeoX

NeoX supplies several utilities for converting a pretrained model checkpoint into a format that can be trained within the library.

The following models or model families can be loaded in GPT-NeoX:

  • Llama 1
  • Llama 2
  • CodeLlama
  • Mistral-7b-v0.1

We provide two utilities for converting from two different checkpoint formats into a format compatible with GPT-NeoX.

To convert a Llama 1 or Llama 2 checkpoint distributed by Meta AI from its original file format (downloadable here or here) into the GPT-NeoX library, run

python tools/ckpts/convert_raw_llama_weights_to_neox.py --input_dir /path/to/model/parent/dir/7B --model_size 7B --output_dir /path/to/save/ckpt --num_output_shards <TENSOR_PARALLEL_SIZE> (--pipeline_parallel if pipeline-parallel-size >= 1)

To convert from a Hugging Face model into a NeoX-loadable, run tools/ckpts/convert_hf_to_sequential.py. See documentation within that file for further options.

Monitoring

In addition to storing logs locally, we provide built-in support for two popular experiment monitoring frameworks: Weights & Biases and TensorBoard

Weights and Biases

EleutherAI is currently using Weights & Biases to record our experiments. If you are logged into Weights & Biases on your machine—you can do this by executing wandb login—your runs will automatically be recorded. There are two optional fields associated with Weights & Biases: wandb_group allows you to name the run group and wandb_team allows you to assign your runs to an organization or team account.

TensorBoard

We also support using TensorBoard via the tensorboard-dir field. Dependencies required for TensorBoard monitoring can be found in and installed from ./requirements/requirements-tensorboard.txt.

Running on multi-node

If you need to supply a hostfile for use with the MPI-based DeepSpeed launcher, you can set the environment variable DLTS_HOSTFILE to point to the hostfile.

Profiling

We support profiling with Nsight Systems and PyTorch Memory Profiling.

Nsight Systems Profiling

To use the Nsight Systems profiling, set config options profile, profile_step_start, and profile_step_stop. Launch training with:

nsys profile -s none -t nvtx,cuda -o <path/to/profiling/output> --force-overwrite true \
--capture-range=cudaProfilerApi --capture-range-end=stop python $TRAIN_PATH/deepy.py \
$TRAIN_PATH/train.py --conf_dir configs <config files>

The generated output file can then by viewed with the Nsight Systems GUI:

Alt text

PyTorch Memory Profiling

To use PyTorch Memory Profiling, set config options memory_profiling and memory_profiling_path.

Alt text

View the generated profile with the memory_viz.py script. Run with:

python _memory_viz.py trace_plot <generated_profile> -o trace.html

Adoption and Publications

The GPT-NeoX library was been widely adopted by academic and industry researchers and ported on to many HPC systems.

If you have found this library useful in your research, please reach out and let us know! We would love to add you to our lists.

Publications

EleutherAI and our collaborators have used it in the following publications:

The following publications by other research groups use this library:

Models

The following models were trained using this library:

English LLMs

Non-English LLMs

Code Models

AI for Science

Other Modalities

Administrative Notes

Citing GPT-NeoX

If you have found the GPT-NeoX library helpful in your work, you can cite this repository as

@software{gpt-neox-library,
  title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
  author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Phang, Jason and Purohit, Shivanshu and Schoelkopf, Hailey and Stander, Dashiell and Songz, Tri and Tigges, Curt and Thérien, Benjamin and Wang, Phil and Weinbach, Samuel},
  url = {https://www.github.com/eleutherai/gpt-neox},
  doi = {10.5281/zenodo.5879544},
  month = {9},
  year = {2023},
  version = {2.0.0},
}

To cite the 20 billion parameter model named GPT-NeoX-20B, please use

@inproceedings{gpt-neox-20b,
  title={{GPT-NeoX-20B}: An Open-Source Autoregressive Language Model},
  author={Black, Sid and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, USVSN Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel},
  booktitle={Proceedings of the ACL Workshop on Challenges \& Perspectives in Creating Large Language Models},
  url={https://arxiv.org/abs/2204.06745},
  year={2022}
}

Contributing

GPT-NeoX is built by the open-source AI community, and relies on our amazing contributors! Please see our contributing guide for more details on our CLA, code formatting, testing, etc.

Licensing

This repository hosts code that is part of EleutherAI's GPT-NeoX project. Copyright (c) 2024, EleutherAI. Licensed under the Apache License:

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

This repository is based off code written by NVIDIA that is licensed under the Apache License, Version 2.0. In accordance with the Apache License, all files that are modifications of code originally written by NVIDIA maintain a NVIDIA copyright header. All files that do not contain such a header are the exclusive copyright of EleutherAI. When the NVIDIA code has been modified from its original version, that fact is noted in the copyright header. All derivative works of this repository must preserve these headers under the terms of the Apache License.

This repository also contains code written by a number of other authors. Such contributions are marked and the relevant licensing is included where appropriate.

For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please email us at [email protected].

Acknowledgements

We run our experiments on a Kubernetes cluster provided by CoreWeave and a Slurm cluster provided by Stability AI. We are thankful to the DeepSpeed team for their advice and consultation.

gpt-neox's People

Contributors

clashluke avatar curt-tigges avatar dashstander avatar dependabot[bot] avatar erichallahan avatar haileyschoelkopf avatar hughph avatar jahatef avatar jaimemcc-intel avatar joshlk avatar kipgparker avatar kyle1668 avatar leogao2 avatar lucidrains avatar micpie avatar quentin-anthony avatar r0n12 avatar satpalsr avatar sauravmaheshkar avatar sdtblck avatar segyges avatar shivanshupurohit avatar slash-under avatar stellaathena avatar sweinbach avatar trisongz avatar vhellendoorn avatar xu-song avatar yang avatar zphang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gpt-neox's Issues

Inference with the 2.5B Model

Hi all,

I was told to come to this repository from the original gpt-neo repository if I wanted to try to run inference on the released 2.5B model. What's the current status of this repo compared to the old one? Can you run inference on a single GPU with the pretrained model?

Thanks and great work to everyone working on this.

Number of GPUs not automatically detected on a single node instance

Testing the codebase on an AWS instance and, if /job/hostfile is not present, you need to add num_gpus to the config to get training working.

Could (should?) we autodetect the number of GPUs if nothing is specified?

if not, we should add a more informative error message. This is the current traceback if num_gpus isn't specified and /job/hostfile isn't present:

Traceback (most recent call last):
  File "./deepy.py", line 67, in <module>
    old_style_args, conf = ConfigMonster().consume_args(extra_conf=extra_conf)
  File "/home/ubuntu/gpt-neox/megatron/config_monster.py", line 379, in consume_args
    ds_runner_conf, megatron_conf, ds_config_conf = self.derive_params_and_split(conf)
  File "/home/ubuntu/gpt-neox/megatron/config_monster.py", line 318, in derive_params_and_split
    world_size = ((num_gpus / pp_size) / mp_size)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

Fix our configs

Our configs are a mess. We have some hardcoded global variables in training scripts, some config files, duplicated arguments, unstated arithmetic dependencies, and probably more I am not thinking of right away.

parameters

How to display total parameters?
How to adjust parameters?I changed some numbers in gpt3_small.json, but it didn't work

Expand to all 8 CoreWeave Machines

Right now we are running on a single 8-core machine, but we have access to 8. We need to figure out how to set up cross-machine distributed learning using Kubernettes.

Build a Tensorboard

Exactly what it says on the tin. Get a tensorboard up and running so that we can easily keep an eye on how our models are doing.

Implement the MPU from Megatron

The Megatron code contains a "MPU" library. MPU stands for "model parallelism unit." The purpose of an MPU is to allow custom tensor slicing across GPUs. DeepSpeed allows you to hook up a MPU, but doesn't provide one. The goal is to convert the MPU from Megatron to GPT-NeoX. This is a modified clone of Megatron: https://github.com/EleutherAI/MegatronPipeline

You may find the (minimalistic) descriptions DeepSpeed provides helpful:
https://www.deepspeed.ai/features/#model-parallelism
https://www.deepspeed.ai/tutorials/megatron/

The full DeepSpeed docs can be found here: https://deepspeed.readthedocs.io/en/latest/index.html

Figure out why 1-bit Adam pretends to run

We have been mystified by the fact that we haven’t seen any speed-up from 1-bit Adam, but we recently discovered that it requires OpenMPI to function. This explains the lack of improvement (we do not currently have a working OpenMPI implementation) but does not explain why the code acts like it runs without a problem. There are two tasks here:

[ ] figure out the code is currently doing
[ ] figure out how to get 1-bit Adam to work (once OpenMPI exists)

Fix DeepSpeed (ZeRO2 + Pipeline Parallel)

There is an issue with the DeepSpeed library that prevents you from using Pipeline Parallelism and ZeRO Stage 2 at the same time. @leogao2 has a rudimentary patch that allows the code to run (see here) but it causes a significant slowdown. We need to figure out how to do this better. For additional on the problem at hand, see #62

Profiling results from initial patch:

  • patched, zero2+checkpoint+pipeline: samples/sec: 1159.741, max vram: 3245MiB
  • patched, zero2+checkpoint: samples/sec: 1120.8568733324405, max vram: 1704MiB

Figure out what’s taking so long to do comms

Our code is spending an absurd amount of time doing communication. Here’s a breakdown for one iteration

%comms: 85.43695925203818 
%optimizer_step 33.48256986158767 
%forward: 3.2507325623113363 
%backward: 9.970107994957203

rank=0 time (ms) | train_batch: 87510.72 | batch_input: 78.66 | forward: 2844.74 | pipe_send_output: 5377.75 | comms: 74766.41 | pipe_recv_grad: 10826.45 | backward: 8724.91 | reduce_tied_grads: 0.80 | reduce_grads: 29651.29 | step: 29300.81 | _step_clipping: 0.20 | _step_step: 29295.69 | _step_zero_grad: 2.97 | _step_check_overflow: 0.80M 

so there are several stages where communication is happening - where one pipeline stage sends its output to the next (pipe_send_output), where one pipeline stage receives gradients from the previous stage (pipe_recv_grad), where gradients (reduce_grads) and tied gradients (reduce_tied_grads) are reduced across machines and where partitions are allgathered in the zero optimizer (a part of step).

In total, all these stages take up 85% of one iteration across 4 machines. We have no idea why we aren't saturating the throughput for the communication steps, but it definitely is the bottleneck

4 GPUS, 1 Node, pp=2, dp=2:
%comms: 48.58126341752161
%optimizer_step 1.4845065630029386
%forward: 12.781570852708713
%backward: 37.94681723603108

rank=0 time (ms) | train_batch: 186642.76 | batch_input: 521.31 | forward: 23855.87 | pipe_send_output: 26564.80 | comms: 90673.35 | pipe_recv_grad: 2339.02 | backward: 70824.96 | reduce_tied_grads: 0.58 | reduce_grads: 59596.19 | step: 2770.72 | _step_clipping: 0.17 | _step_step: 2766.47 | _step_zero_grad: 2.57 | _step_check_overflow: 0.61
8 GPUS, 1 Node, pp=2,dp=4:

%comms: 39.16614816474212
%optimizer_step 4.681605995231978
%forward: 15.035426649657623
%backward: 44.66480575797971

rank=0 time (ms) | train_batch: 79330.51 | batch_input: 335.96 | forward: 11927.67 | pipe_send_output: 16225.03 | comms: 31070.66 | pipe_recv_grad: 2078.04 | backward: 35432.79 | reduce_tied_grads: 0.76 | reduce_grads: 9496.98 | step: 3713.94 | _step_clipping: 0.14 | _step_step: 3709.45 | _step_zero_grad: 2.74 | _step_check_overflow: 0.67
16 GPUS, 2 Nodes, pp=2, dp=8

%comms: 52.51962050838257
%optimizer_step 7.945747803546135
%forward: 11.523001023940648
%backward: 34.636627991799266

rank=0 time (ms) | train_batch: 51027.40 | batch_input: 189.07 | forward: 5879.88 | pipe_send_output: 4394.57 | comms: 26799.35 | pipe_recv_grad: 9948.94 | backward: 17674.15 | reduce_tied_grads: 0.84 | reduce_grads: 8792.55 | step: 4054.50 | _step_clipping: 0.14 | _step_step: 4050.02 | _step_zero_grad: 2.76 | _step_check_overflow: 0.67
32 GPUS, 4 Nodes, pp=2, dp=16

%comms: 86.65624745291373
%optimizer_step 33.94892399535246
%forward: 3.0413984988055836
%backward: 9.345754953621045

rank=0 time (ms) | train_batch: 93470.05 | batch_input: 74.21 | forward: 2842.79 | pipe_send_output: 6155.78 | comms: 80997.49 | pipe_recv_grad: 10431.66 | backward: 8735.47 | reduce_tied_grads: 0.59 | reduce_grads: 33104.09 | step: 31732.03 | _step_clipping: 0.16 | _step_step: 31727.09 | _step_zero_grad: 2.96 | _step_check_overflow: 0.77
32 GPUS, 4 Nodes, pp=4, dp=8

%comms: 90.24613435844124
%optimizer_step 9.220154240256214
%forward: 2.199843158054571
%backward: 7.178205155294351

rank=0 time (ms) | train_batch: 125101.70 | batch_input: 140.49 | forward: 2752.04 | pipe_send_output: 43070.37 | comms: 112899.34 | pipe_recv_grad: 50213.59 | backward: 8980.05 | reduce_tied_grads: 0.69 | reduce_grads: 8346.74 | step: 11534.56 | _step_clipping: 0.14 | _step_step: 11531.15 | _step_zero_grad: 1.64 | _step_check_overflow: 0.66

Update documentation

The current README is wrong. We should fix it so that it’s correct and recommends current best practices. It also must have a link to OWT2 for people to download.

Fix depreciated code

When you run GPT-3 Small, you get a depreciated code warning because of how we handle iterators.

Integrate HuggingFace Tokenizers

Would be good to replace the weird tokenizer class megatron has with HF tokenizers, or at least accept both types of tokenizer. HF library is super intuitive for training and nice and easy to use. This would allow us to experiment with different tokenization schemes, etc.

The tokenizer code is a little spread out through the repo currently so this might be a bit of a task.

How to change parameters

GPT-NeoX is designed to be able to train models in the hundreds of billions of parameters, so how to adjust gpt3_small.json by realizing this

Deeperspeed

Does deeperspeed provide any performance advantages over deepspeed?
I'm also assuming the way to use deeperspeed is the same as deepspeed.

10 billion parameters

How to adjust the parameters to make the total parameters are 10 billion

If I use NVIDIA GeForce RTX 3090( 8 cards, 24G per card),could I achieve this?

Build PyTorch from source

Apparently PyTorch only works with MPI if you build it from source on an MPI-compatible system. The NVIDIA Docker image has this built-in and may be a time-saving path forward depending on how hard it is to migrate our existing stuff.

Fix tfrecord dataset to load less files into memory

The current tfrecord dataset loads 1 tfrecord at a time into memory.
The deepspeed distributed wrapper causes the dataset to do this once, for every sample, for every GPU.
Maybe it would be best to preprocess / prefetch n samples, write them to disk, then load the correct sample from disk at train time.

Version conflict on colab

Describe the bug
Using pytorch 1.7.1 results in an error while keeping it at 1.7.0 (default on colab) rectifies it

To Reproduce
Steps to reproduce the behavior:
!pip install -r requirements.txt

Results in

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 568, in _build_master
    ws.require(__requires__)
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 886, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/local/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 777, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (torch 1.7.1 (/usr/local/lib/python3.6/dist-packages), Requirement.parse('torch==1.7.0'), {'torchvision'})

Integrate DeepSpeed

To get this code running as efficiently as we will need, we should use the DeepSpeed library by Microsoft. It has a lot of bells and whistles and optimization options.

Allow for alternative architectures

We will want to experiment with different types of transformers. Modularization so that we can substitute different architectures is essential.

Integrate ZeRO-Powered Data Parallelism

Per DeepSpeed

We developed ZeRO to conquer the limitations of data parallelism and model parallelism while achieving the merits of both. ZeRO removes the memory redundancies across data-parallel processes by partitioning the model states—parameters, gradients, and optimizer state—across data parallel processes instead of replicating them. It uses a dynamic communication schedule during training to share the necessary state across distributed devices to retain the computational granularity and communication volume of data parallelism

We call this ZeRO-powered data parallelism, which allows per-device memory usage to scale linearly with the degree of data parallelism and incurs similar communication volume as data parallelism. ZeRO-powered data parallelism can fit models of arbitrary size—as long as the aggregated device memory is large enough to share the model states.

Create experiment runners

We will want to run experiments with a variety of configs and options. To enable this, we need two things:

  • configs files that we can use to specify the settings for a particular run
  • an experiment runner for managing and automatically executing several runs

Can't install Triton

I booted up the server and figured I would try giving the code a whirl. I appear to be unable to successfully run pip install -r requirements.txt, with it throwing errors when it got to Triton. The first problem was that I didn't have llvm-9-dev installed, but apt-get install llvm-9-dev took care of that. The second problem was that I had to run pip install CMake. Now both packages are installed and yet pip is throwing it's most complicated error yet. Below is the full output

    Running setup.py install for triton ... error
    ERROR: Command errored out with exit status 1:
     command: /root/anaconda3/envs/ds/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-ukz_au5s/install-record.txt --single-version-externally-managed --compile --install-headers /root/anaconda3/envs/ds/include/python3.7m/triton
         cwd: /tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/
    Complete output (193 lines):
    /root/anaconda3/envs/ds/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'keyword'
      warnings.warn(msg)
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/triton
    copying triton/__init__.py -> build/lib.linux-x86_64-3.7/triton
    copying triton/kernel.py -> build/lib.linux-x86_64-3.7/triton
    package init file 'triton/_C/__init__.py' not found (or not a regular file)
    creating build/lib.linux-x86_64-3.7/triton/ops
    copying triton/ops/__init__.py -> build/lib.linux-x86_64-3.7/triton/ops
    copying triton/ops/batchnorm.py -> build/lib.linux-x86_64-3.7/triton/ops
    copying triton/ops/einsum.py -> build/lib.linux-x86_64-3.7/triton/ops
    creating build/lib.linux-x86_64-3.7/triton/_C
    creating build/lib.linux-x86_64-3.7/triton/_C/include
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen
    copying triton/_C/include/triton/codegen/pass.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen
    copying triton/_C/include/triton/codegen/target.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    copying triton/_C/include/triton/codegen/analysis/align.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    copying triton/_C/include/triton/codegen/analysis/allocation.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    copying triton/_C/include/triton/codegen/analysis/axes.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    copying triton/_C/include/triton/codegen/analysis/layout.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    copying triton/_C/include/triton/codegen/analysis/liveness.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/analysis
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/selection
    copying triton/_C/include/triton/codegen/selection/generator.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/selection
    copying triton/_C/include/triton/codegen/selection/machine_layout.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/selection
    copying triton/_C/include/triton/codegen/selection/machine_value.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/selection
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/coalesce.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/cts.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/dce.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/disassociate.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/membar.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/peephole.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    copying triton/_C/include/triton/codegen/transform/reassociate.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/codegen/transform
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/backend.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/buffer.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/context.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/device.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/dispatch.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/error.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/handle.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/kernel.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/module.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/platform.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    copying triton/_C/include/triton/driver/stream.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/driver
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/external
    copying triton/_C/include/triton/external/half.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_d3d10.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_d3d11.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_dx9_media_sharing.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_dx9_media_sharing_intel.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_egl.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_ext.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_ext_intel.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_gl.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_gl_ext.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_platform.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl_va_api_media_sharing_intel.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/opencl.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    copying triton/_C/include/triton/external/CL/cl2.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CL
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CUDA
    copying triton/_C/include/triton/external/CUDA/cuda.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CUDA
    copying triton/_C/include/triton/external/CUDA/nvml.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/external/CUDA
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/basic_block.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/builder.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/constant.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/context.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/context_impl.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/enums.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/function.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/instructions.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/metadata.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/module.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/print.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/type.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/utils.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/value.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    copying triton/_C/include/triton/ir/visitor.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/ir
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/ast.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/code_gen.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/cpp.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/encoding.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/error.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/evaluator.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/mem_pool.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/parser.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/scanner.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/scope.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/token.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/type.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    copying triton/_C/include/triton/lang/visitor.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/lang
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/runtime
    copying triton/_C/include/triton/runtime/arg.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/runtime
    copying triton/_C/include/triton/runtime/function.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/runtime
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools
    copying triton/_C/include/triton/tools/graph.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools
    copying triton/_C/include/triton/tools/thread_pool.h -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools
    copying triton/_C/include/triton/tools/bench.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools
    copying triton/_C/include/triton/tools/sha1.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools
    creating build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools/sys
    copying triton/_C/include/triton/tools/sys/getenv.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools/sys
    copying triton/_C/include/triton/tools/sys/mkdir.hpp -> build/lib.linux-x86_64-3.7/triton/_C/include/triton/tools/sys
    running build_ext
    -- The C compiler identification is GNU 7.5.0
    -- The CXX compiler identification is GNU 7.5.0
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Check for working C compiler: /usr/bin/cc - skipped
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Check for working CXX compiler: /usr/bin/c++ - skipped
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Found LLVM: /usr/lib/llvm-9 (found version "9.0.0")
    -- Adding Python module
    -- Configuring done
    CMake Warning (dev) at CMakeLists.txt:45 (add_library):
      Policy CMP0038 is not set: Targets may not link directly to themselves.
      Run "cmake --help-policy CMP0038" for policy details.  Use the cmake_policy
      command to set the policy and suppress this warning.
    
      Target "triton" links to itself.
    This warning is for project developers.  Use -Wno-dev to suppress it.
    
    -- Generating done
    -- Build files have been written to: /tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/build/temp.linux-x86_64-3.7
    Scanning dependencies of target triton
    [  1%] Building CXX object CMakeFiles/triton.dir/lib/codegen/analysis/allocation.cc.o
    [  3%] Building CXX object CMakeFiles/triton.dir/lib/codegen/analysis/align.cc.o
    [  5%] Building CXX object CMakeFiles/triton.dir/lib/codegen/analysis/axes.cc.o
    [  7%] Building CXX object CMakeFiles/triton.dir/lib/codegen/analysis/layout.cc.o
    [  9%] Building CXX object CMakeFiles/triton.dir/lib/codegen/analysis/liveness.cc.o
    [ 10%] Building CXX object CMakeFiles/triton.dir/lib/codegen/selection/generator.cc.o
    [ 12%] Building CXX object CMakeFiles/triton.dir/lib/codegen/selection/machine_layout.cc.o
    [ 14%] Building CXX object CMakeFiles/triton.dir/lib/codegen/selection/machine_value.cc.o
    [ 16%] Building CXX object CMakeFiles/triton.dir/lib/codegen/target.cc.o
    /tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/src/lib/codegen/target.cc:5:10: fatal error: llvm/IR/IntrinsicsNVPTX.h: No such file or directory
     #include "llvm/IR/IntrinsicsNVPTX.h"
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated.
    CMakeFiles/triton.dir/build.make:185: recipe for target 'CMakeFiles/triton.dir/lib/codegen/target.cc.o' failed
    make[2]: *** [CMakeFiles/triton.dir/lib/codegen/target.cc.o] Error 1
    make[2]: *** Waiting for unfinished jobs....
    CMakeFiles/Makefile2:122: recipe for target 'CMakeFiles/triton.dir/all' failed
    make[1]: *** [CMakeFiles/triton.dir/all] Error 2
    Makefile:113: recipe for target 'all' failed
    make: *** [all] Error 2
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py", line 129, in <module>
        'Programming Language :: Python :: 3.6',
      File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/root/anaconda3/envs/ds/lib/python3.7/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/root/anaconda3/envs/ds/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py", line 55, in run
        self.build_extension(ext)
      File "/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py", line 95, in build_extension
        subprocess.check_call(['cmake', '--build', '.'] + build_args, cwd=self.build_temp)
      File "/root/anaconda3/envs/ds/lib/python3.7/subprocess.py", line 363, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--config', 'Release', '--', '-j4']' returned non-zero exit status 2.
    ----------------------------------------
ERROR: Command errored out with exit status 1: /root/anaconda3/envs/ds/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py'"'"'; __file__='"'"'/tmp/pip-install-gwvnpboi/triton_22306e539ed64a7a9b6510483f530faa/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-ukz_au5s/install-record.txt --single-version-externally-managed --compile --install-headers /root/anaconda3/envs/ds/include/python3.7m/triton Check the logs for full command output.```

Take ZeRO 3 for a test drive

Is your feature request related to a problem? Please describe.
Model too smol

Describe the solution you'd like
DeepSpeed ZeRO-3 is finally public! Let’s take it for a test drive (remember to turn pipeline parallelism off) and see if we can get it to run.

Describe alternatives you've considered
Use Pipeline Parallelism

Additional context
It probably doesn’t work or needs to be modded to work because DeepSpeed :works-internally:

gpt3small is broken

gpt3small seems to have been left behind in some of our updates, and neither scripts/train_gpt3small.sh nor scripts/train_gpt3small_pipeline.sh run.

Implement Generation / Eval with deepspeed model engine

currently Generation / Eval are happening with the pytorch model, not the model engine. This is already causing memory problems and won't allow us to scale up - we'll need to implement this with the deepspeed model engine.

AttributeError: module 'torch.utils' has no attribute 'checkpoint' in gpt-neox/gpt-neox

while running " deepspeed train_enwik8.py --deepspeed --deepspeed_config ./configs/deepspeed_zero2.json" I got the error AttributeError: module 'torch.utils' has no attribute 'checkpoint' in gpt-neox/gpt-neox
seems similar to problem described in pyTorch attributeerror-module-torch-utils-has-no-attribute-checkpoint

quick-fix: adding
"from torch.utils.checkpoint import checkpoint"
to top of file seems to fix the problem

(T5) Relative positional encodings?

[This is the reminder of a conversation I had with @sdtblck]

Simpler than the TrXL relative encodings, the T5 relative bias should enable less expensive inference (~1000x?) due to caching and an improved num_layers * ctx_length effective context size.
image

Data loading

Modify the code base so that it can download the Pile and train a model on it.

Hardcoded paths in gpt3_small.json

In the gpt3_small.json file there are two hardcoded paths:

"train_path": "/root/data/owt2/train/*.tfrecords",
"eval_path": "/root/data/owt2/eval/*.tfrecords",

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.