huu4ontocord / mdel Goto Github PK

View Code? Open in Web Editor NEW

67.0 21.0 14.0 38.11 MB

Multi-Domain Expert Learning

License: Apache License 2.0

Makefile 0.06% Shell 1.31% Python 84.45% Batchfile 0.10% Jupyter Notebook 14.08%

mdel's Introduction

MDEL

Multi-Domain Expert Learning

Environment Setup

To set up the development environment, run make setup_dev. This will setup the pre-commit hooks.

Creating Expert Datasets

First, make sure you followed the Environment Setup guidelines.

To create an expert dataset using the Pile data, follow these steps:

Download the Pile shard 1 data: ./scripts/get_pile_shard1_data.sh
To set the domain, edit the variable SUBSET_NAME in scripts/create_domain_pile_mix.sh. This should be set to a valid value of the Pile's variable pile_set_name. A list of valid values can be found below.
Run the above script to process the dataset
Authenticate into Hugginface: export HF_ACCESS_TOKEN={YOUR HUGGINGFACE TOKEN}
Set the dataset name in scripts/upload_to_hf.sh
Run the above script to upload the processed dataset to HuggingFace

Pile Subsets

Pile-CC
PubMed Central
Books3†
OpenWebText2
ArXiv
Github
FreeLaw
Stack Exchange
USPTO Backgrounds
PubMed Abstracts
Gutenberg (PG-19)†
OpenSubtitles†
Wikipedia (en)†
DM Mathematics†
Ubuntu IRC
BookCorpus2
EuroParl†
HackerNews
YoutubeSubtitles
PhilPapers
NIH ExPorter
Enron Emails†

Training Expert Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Set up W&B authentication: export WANDB_API_KEY=[FILL ME]
Edit the variable DATASET in script src/mdel/train.sh to match a valid dataset name on the MDEL HF.
Run the above script in background mode to start the training: ./train.sh &
The trained model should be uploaded to the MDEL HF

Merging Expert Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Run the merge script

python src/mdel/merge_experts.py \
   --hf-repo your_hf_username/desired_name_of_merged_model \
   -e mdel/expert_1 \
   -e mdel/expert_2 \
   -e mdel/expert_n

Evaluating Perplexity of Models

Clone this repo and follow the Environment Setup instructions
Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
Run the perplexity script

python3 src/mdel/calculate_perplexity.py \
   --model Multi-Domain-Expert-Layers/expert-arxiv \
   --dataset Multi-Domain-Expert-Layers/arxiv \
   --split validation_domain

References

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2020).The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

mdel's People

Contributors

Stargazers

Watchers

Forkers

p1ayer-1 jordiclive tehvenomm siddheshmhatre anandanne vantainguyen nourfahmy kenhktsui stillerman tosingithub micanilabs gpucce nicolomonti apollohuang1

mdel's Issues

Change the dataset mixing script to process files with a wild card

Expert merging: c-BTM

We would like to create a script for creating a merged model by using the C-BTM method.

The script would take as input:

List of experts models from the [MDEL HF repo](https://huggingface.co/Multi-Domain-Expert-Layers).
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Report val loss aggregated by data origin

When training an expert model on a mix of domain and pile data, we would like to log the validation loss for the domain data and the pile data separately.

Set up the training configuration

Training instruction followers as composable layers and expert layers

check out: /content/drive/Shareddrives/MDEL/dataset/minipile_instruct.jsonl

Train a 1b on the 300K minipile instructions? Train 5 variants:

Layers 14,15,16
Layers 4,5,6,7,8
Layers 4,5,6,7,8,14,15,16 (edited)
all layers
layers 9,10,11,12,13

inputs_ids cast to fp16 in deeperspeed bug

{
  "pipe-parallel-size": 1,
  "model-parallel-size": 1,

  "num-layers": 16,
  "hidden-size": 2048,
  "num-attention-heads": 8,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "pos-emb": "rotary",
  "rotary-pct": 0.25,
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
  
  "scaled-upper-triang-masked-softmax-fusion": false,
  "bias-gelu-fusion": false,

  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00025,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8
    }
  },
  "min_lr": 0.000025,

  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 500000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  }, 

  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "auto_cast": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  }, 

  "fp32_allreduce": true,

  "train_micro_batch_size_per_gpu": 4,
  "gradient-accumulation-steps": 4,

  "data-path": "data/debug_text_document",
  "data-impl": "mmap",
  "num_workers": 1,

  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  "gradient_clipping": 1.0,
  "weight-decay": 0.1,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  "train-iters": 143000,
  "lr-decay-iters": 143000,
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 1000,
  "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],
  "eval-interval": 143000,
  "eval-iters": 10,

  "log-interval": 10,
  "steps_per_print": 10,
  "wall_clock_breakdown": true,

  "tokenizer-type": "HFGPT2Tokenizer"
}

Tried this config but I see the error:

Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 27, in <module>
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    iteration = train(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    iteration = train(
      File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    self._exec_schedule(sched)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_schedule(sched)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)    
loss = self.module(*inputs, **kwargs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
    x = exec_range_func(start_idx, end_idx)(*x)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
x = exec_range_func(start_idx, end_idx)(*x)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
    inputs = layer(inputs)
      File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
inputs = layer(inputs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward

  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Add script for merging expert models via weight averaging

We would like to create a script for creating a merged model by averaging expert weights.

The script would take as input:

List of experts models from the MDEL HF repo.
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Create minimal example of training on SUMMIT

There is an effort to train LLAMA 1.5b on SUMMIT and we would like to train a few MDEL experts based on this model to test on SUMMIT.

The LLAMA is being trained using a DeeperSpeed version that works with SUMMIT:
https://github.com/EleutherAI/DeeperSpeed/tree/v2.0-summit

The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.

Train baseline models for evaluation

We need to eval the experts that are merged against if we trained a 1b Pythia model all together.

Trained with all layers on the 6 datasets we have.
Trained with just the upper layers.

To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.

This will give us a comparison of :

training all layers on same token and data
training some layers on same token and data
merging with different experts trained on same compute

Create template for HF dataset config

We need a way to create a config file for each dataset that is being uploaded via the upload script, so that the trainer will track the metrics split by domain data/pile data.

Add code for mixing Pile and Expert data

Get all relevant data for StarCoder into LUMI

80B tokens each of language day from vi, en, fi, hi, ja. Some of these langs we won't have enough data so we will need to do multiple passes on the data.

And we will do about 80b tokens of code, including multilingual code already gathered by
@Taishi Nakamura

that leaves 20b tokens which we can reserve for instruciton data, math, science, etc. - basically our expert data.

Do a small test run

Integrate with LLM evaluation frameworks

Integrate MDEL with various evaluation framework

Stabilize Training on Redmond Box

Investigate why the training occasionally crashes.

Evaluate a merged expert model's perplexity

The goal is to do a perplexity calculation on a few models:

A model that is a weighted average of a few experts models
A baseline model which is fine-tuned on the union of the experts' datasets
Same as above but only the layers that were tuned for the experts (layer 9-13)

The model in (1) can be created using the script in this PR. The list of experts is:

ArXiv
Pubmed Central
Pubmed Abstracts
USPTO
Philpapers
Github
Freelaw

The modes in (2, 3) should be prepared in the following issue.

The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).

The deliverables of this issue should be:

The perplexity numbers for each model
A script to reproduce the result

Dataset generation open issues

Currently using shard 1 only, it doesn't have enough data for some domains - we should preprocess a bigger part of the pile to get good coverage for these.
Need to publish subset names
Not clear whether to balance domain/pile by num samples or num tokens
Report val loss split by domain/pile data

Fix HF Hub Upload Error

When using the trainer script, the last HF Hub update of the model repo fails with the message:
remote: Sorry, your push was rejected during YAML metadata verification: remote: - Error: "model-index[0].results[0].dataset.config" must be a string remote: ------------------------------------------------------------------------- remote: ------------------------------------------------------------------------- remote: Please find the documentation at: remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata remote: remote: ------------------------------------------------------------------------- To https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto ! [remote rejected] main -> main (pre-receive hook declined) error: failed to push some refs to 'https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto'

Setup separate environments on Redmond.ai box

It's currently difficult to work on the box because it has a single user with a single environment, which frequently gets broken.

To solve it, we should either create separate users or install docker.

Automatic Training Scripts for All Expert Models

If most training script is homogenous except the data_path args/ config (I assume it is as they started from the same seed LM), then we could do a script that trains expert model sequentially and keeps the GPUs busy.

Investigate Expert Models Having High Perplexity

Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.

Here are some issues that may have caused this:

no warmup
LR too high
too few steps
mixing in pile data
too many gradient accumulation steps
measurement error

The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.