Coder Social home page Coder Social logo

mdel's Introduction

MDEL

Multi-Domain Expert Learning

Environment Setup

To set up the development environment, run make setup_dev. This will setup the pre-commit hooks.

Creating Expert Datasets

First, make sure you followed the Environment Setup guidelines.

To create an expert dataset using the Pile data, follow these steps:

  1. Download the Pile shard 1 data: ./scripts/get_pile_shard1_data.sh
  2. To set the domain, edit the variable SUBSET_NAME in scripts/create_domain_pile_mix.sh. This should be set to a valid value of the Pile's variable pile_set_name. A list of valid values can be found below.
  3. Run the above script to process the dataset
  4. Authenticate into Hugginface: export HF_ACCESS_TOKEN={YOUR HUGGINGFACE TOKEN}
  5. Set the dataset name in scripts/upload_to_hf.sh
  6. Run the above script to upload the processed dataset to HuggingFace

Pile Subsets

  • Pile-CC
  • PubMed Central
  • Books3†
  • OpenWebText2
  • ArXiv
  • Github
  • FreeLaw
  • Stack Exchange
  • USPTO Backgrounds
  • PubMed Abstracts
  • Gutenberg (PG-19)†
  • OpenSubtitles†
  • Wikipedia (en)†
  • DM Mathematics†
  • Ubuntu IRC
  • BookCorpus2
  • EuroParl†
  • HackerNews
  • YoutubeSubtitles
  • PhilPapers
  • NIH ExPorter
  • Enron Emails†

Training Expert Models

  1. Clone this repo and follow the Environment Setup instructions
  2. Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
  3. Set up W&B authentication: export WANDB_API_KEY=[FILL ME]
  4. Edit the variable DATASET in script src/mdel/train.sh to match a valid dataset name on the MDEL HF.
  5. Run the above script in background mode to start the training: ./train.sh &
  6. The trained model should be uploaded to the MDEL HF

Merging Expert Models

  1. Clone this repo and follow the Environment Setup instructions
  2. Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
  3. Run the merge script
python src/mdel/merge_experts.py \
   --hf-repo your_hf_username/desired_name_of_merged_model \
   -e mdel/expert_1 \
   -e mdel/expert_2 \
   -e mdel/expert_n

Evaluating Perplexity of Models

  1. Clone this repo and follow the Environment Setup instructions
  2. Set up HF authentication: export HUGGING_FACE_HUB_TOKEN=[FILL ME]
  3. Run the perplexity script
python3 src/mdel/calculate_perplexity.py \
   --model Multi-Domain-Expert-Layers/expert-arxiv \
   --dataset Multi-Domain-Expert-Layers/arxiv \
   --split validation_domain

References

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. (2020).The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.

mdel's People

Contributors

bentherien avatar digitous avatar henk717 avatar huu4ontocord avatar jordiclive avatar mrcabbage972 avatar mrseeker avatar nicolomonti avatar nourfahmy avatar p1ayer-1 avatar stillerman avatar thedarkzeno avatar vmay-chegg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mdel's Issues

Expert merging: c-BTM

We would like to create a script for creating a merged model by using the C-BTM method.

The script would take as input:

List of experts models from the [MDEL HF repo](https://huggingface.co/Multi-Domain-Expert-Layers).
Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

inputs_ids cast to fp16 in deeperspeed bug

{
  "pipe-parallel-size": 1,
  "model-parallel-size": 1,

  "num-layers": 16,
  "hidden-size": 2048,
  "num-attention-heads": 8,
  "seq-length": 2048,
  "max-position-embeddings": 2048,
  "pos-emb": "rotary",
  "rotary-pct": 0.25,
  "no-weight-tying": true,
  "gpt-j-residual": true,
  "output-layer-parallelism": "column",
  
  "scaled-upper-triang-masked-softmax-fusion": false,
  "bias-gelu-fusion": false,

  "init_method": "small_init",
  "output_layer_init_method": "wang_init",

  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00025,
      "betas": [0.9, 0.95],
      "eps": 1.0e-8
    }
  },
  "min_lr": 0.000025,

  "zero_optimization": {
    "stage": 0,
    "allgather_partitions": true,
    "allgather_bucket_size": 500000000,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": true,
    "cpu_offload": false
  }, 

  "fp16": {
    "enabled": true,
    "type": "bfloat16",
    "auto_cast": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 12,
    "hysteresis": 2,
    "min_loss_scale": 1
  }, 

  "fp32_allreduce": true,

  "train_micro_batch_size_per_gpu": 4,
  "gradient-accumulation-steps": 4,

  "data-path": "data/debug_text_document",
  "data-impl": "mmap",
  "num_workers": 1,

  "checkpoint-activations": true,
  "checkpoint-num-layers": 1,
  "partition-activations": true,
  "synchronize-each-layer": true,

  "gradient_clipping": 1.0,
  "weight-decay": 0.1,
  "hidden-dropout": 0,
  "attention-dropout": 0,

  "train-iters": 143000,
  "lr-decay-iters": 143000,
  "distributed-backend": "nccl",
  "lr-decay-style": "cosine",
  "warmup": 0.01,
  "checkpoint-factor": 1000,
  "extra-save-iters": [0,1,2,4,8,16,32,64,128,256,512],
  "eval-interval": 143000,
  "eval-iters": 10,

  "log-interval": 10,
  "steps_per_print": 10,
  "wall_clock_breakdown": true,

  "tokenizer-type": "HFGPT2Tokenizer"
}

Tried this config but I see the error:

Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 27, in <module>
  File "train.py", line 27, in <module>
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    pretrain(neox_args=neox_args)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 226, in pretrain
    iteration = train(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    iteration = train(
      File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 782, in train
reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss_dict, skipped_iter = train_step(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 688, in train_step
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    reduced_loss = train_step_pipe(
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/training.py", line 738, in train_step_pipe
    loss = model.train_batch(data_iter=data_iterator)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch
    self._exec_schedule(sched)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
self._exec_schedule(sched)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 1376, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    self._exec_instr(**cmd.kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/engine.py", line 658, in _exec_forward_pass
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    outputs = super().forward(inputs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    ret_val = func(*args, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/engine.py", line 1842, in forward
    loss = self.module(*inputs, **kwargs)    
loss = self.module(*inputs, **kwargs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 364, in forward
    x = exec_range_func(start_idx, end_idx)(*x)
      File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
x = exec_range_func(start_idx, end_idx)(*x)
  File "/dccstor/mayankgpfs/scratch/DeeperSpeed/deepspeed/runtime/pipe/module.py", line 337, in exec_func
    inputs = layer(inputs)
      File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
inputs = layer(inputs)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
    return forward_call(*input, **kwargs)  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward

  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 181, in forward
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    embeddings = super().forward(input_ids, position_ids)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/model/word_embeddings.py", line 136, in forward
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    words_embeddings = self.word_embeddings(input_ids)
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return forward_call(*input, **kwargs)
  File "/dccstor/mayankgpfs/scratch/gpt-neox/megatron/mpu/layers.py", line 196, in forward
    output_parallel = F.embedding(
  File "/dccstor/mayankgpfs/conda/envs/laion/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.HalfTensor instead (while checking arguments for embedding)

Add script for merging expert models via weight averaging

We would like to create a script for creating a merged model by averaging expert weights.

The script would take as input:

  1. List of experts models from the MDEL HF repo.
  2. Name of the output model

The averaged model would be uploaded to the MDEL HF repo. It's model card should contain the names of the experts it was created from.

Train baseline models for evaluation

We need to eval the experts that are merged against if we trained a 1b Pythia model all together.

  1. Trained with all layers on the 6 datasets we have.
  2. Trained with just the upper layers.

To keep it fair, we would need to get the exact same 8000 random train example each for the 7 dataset we used in the ohter experiments. And we merge the 6 experts with basic averaging and run the same eval from the 7 dataset on that model.

This will give us a comparison of :

  1. training all layers on same token and data
  2. training some layers on same token and data
  3. merging with different experts trained on same compute

Get all relevant data for StarCoder into LUMI

80B tokens each of language day from vi, en, fi, hi, ja. Some of these langs we won't have enough data so we will need to do multiple passes on the data.

And we will do about 80b tokens of code, including multilingual code already gathered by
@Taishi Nakamura

that leaves 20b tokens which we can reserve for instruciton data, math, science, etc. - basically our expert data.

Evaluate a merged expert model's perplexity

The goal is to do a perplexity calculation on a few models:

  1. A model that is a weighted average of a few experts models
  2. A baseline model which is fine-tuned on the union of the experts' datasets
  3. Same as above but only the layers that were tuned for the experts (layer 9-13)

The model in (1) can be created using the script in this PR. The list of experts is:

  1. ArXiv
  2. Pubmed Central
  3. Pubmed Abstracts
  4. USPTO
  5. Philpapers
  6. Github
  7. Freelaw

The modes in (2, 3) should be prepared in the following issue.

The evaluation should be done on the evaluation fold of each expert's dataset, but excluding the Pile part of it. The datasets are at MDEL HF. The calculation of the perplexity can be done with the Hugginface Trainer's evaluate() method (see example here).

The deliverables of this issue should be:

  1. The perplexity numbers for each model
  2. A script to reproduce the result

Dataset generation open issues

  1. Currently using shard 1 only, it doesn't have enough data for some domains - we should preprocess a bigger part of the pile to get good coverage for these.
  2. Need to publish subset names
  3. Not clear whether to balance domain/pile by num samples or num tokens
  4. Report val loss split by domain/pile data

Fix HF Hub Upload Error

When using the trainer script, the last HF Hub update of the model repo fails with the message:
remote: Sorry, your push was rejected during YAML metadata verification: remote: - Error: "model-index[0].results[0].dataset.config" must be a string remote: ------------------------------------------------------------------------- remote: ------------------------------------------------------------------------- remote: Please find the documentation at: remote: https://huggingface.co/docs/hub/model-cards#model-card-metadata remote: remote: ------------------------------------------------------------------------- To https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto ! [remote rejected] main -> main (pre-receive hook declined) error: failed to push some refs to 'https://huggingface.co/Multi-Domain-Expert-Layers/expert-uspto'

Setup separate environments on Redmond.ai box

It's currently difficult to work on the box because it has a single user with a single environment, which frequently gets broken.

To solve it, we should either create separate users or install docker.

Automatic Training Scripts for All Expert Models

If most training script is homogenous except the data_path args/ config (I assume it is as they started from the same seed LM), then we could do a script that trains expert model sequentially and keeps the GPUs busy.

Investigate Expert Models Having High Perplexity

Our analysis in #53 has shown that the expert models we had previously trained actually have a higher perplexity than the base model.

Here are some issues that may have caused this:

  • no warmup
  • LR too high
  • too few steps
  • mixing in pile data
  • too many gradient accumulation steps
  • measurement error

The expert models were trained with an old version of the trainer, so we don't know which wandb run they belong to and what were the pile/domain data losses during the training. Re-doing the training of one of the experts should help clarify.

Train 2nd batch of expert models

Train 3B experts again, with the following variations:

data size: 500K, 1M, 10M, and if enough examples 50M (basically max data)

  1. 80% pile/20% expert
  2. Expert only

Both should train layers 16-20 and 16-25. Save checkpoints at 50k and 100k steps.

Domains to train:

  1. ArXiv
  2. Github
  3. Pubmed Central
  4. Pubmed Abstracts
  5. Philpapers
  6. Freelaw
  7. USPTO

Create minimal example of training on LUMI

Guidelines from Sampo:

  1. Single pure torch (no custom CUDA kernels) codebase
  2. Start with small model (< 10B params)
  3. Use this fork of Megatron-DeepSpeed

A potential way of implementing this in a flexible way is to mount the fork as a submodule in the MDEL repo. Example here.

The purpose of this ticket is to create a minimal script or a step-by-step guide to start a training run of a GPT-NeoX model on SUMMIT.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.