Hi team, I am trying to fine tune on a seqcls task (using the provid

This is another good link: <a class="issue-link js-issue-link" data-

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. about biomedlm HOT 14 CLOSED

guathwa commented on July 17, 2024

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

from biomedlm.

Comments (14)

J38 commented on July 17, 2024

Hi I'll work today on running a basic fine tuning example and looking at the memory footprint and get back to you!

from biomedlm.

guathwa commented on July 17, 2024

Thanks!

from biomedlm.

J38 commented on July 17, 2024

Hi I'll get you a sample command in the next day or two, but this link here explains using deepspeed:

https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-with-one-gpu

You should be using the deepspeed command and make sure you've installed deepspeed.

I think if you only have 1 GPU you're going to need to try ZeRO-offload ... this is explained in that link I provided.

But I'll try to get this working on my own and let you know ...

from biomedlm.

J38 commented on July 17, 2024

This is another good link:

huggingface/transformers#8771 (comment)

from biomedlm.

J38 commented on July 17, 2024

Over the next few days I'll try to get this working so we have a great working example of fine-tuning the model with 1 GPU !

from biomedlm.

guathwa commented on July 17, 2024

Thank you, I will read up n try out also on my 1 GPU!

from biomedlm.

J38 commented on July 17, 2024

I've gotten the code running and it uses 20GB of GPU memory and 50GB of RAM. So as long as the machine with your A100 has plenty of RAM this could work with 1 GPU.

Set up environment:

# create conda environment
conda create -n biomedlm python=3.8.12 pytorch=1.12.1 torchdata cudatoolkit=11.6.0 -c pytorch -c nvidia

# activate conda environment
conda activate biomedlm

# install python dependencies
# note that flash attention can take 30m to install so it is normal for it to do nothing for 30m
pip install flash-attn
pip install numpy
pip install transformers==4.26.0 datasets=2.9.0 omegaconf wandb
pip install fairscale
pip install accelerate

DeepSpeed config: deepspeed_config.json

{
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-06,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 0.0
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_max_lr": 2e-06,
      "warmup_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "cpu_offload": true
  },
  
  "train_batch_size": "auto",

  "fp16": {
   "enabled": true
  }

}

Command I ran in seqcls directory:

task=pubmedqa_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval

deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /path/to/model --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20  --max_seq_length 560  --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir pubmedqa-finetune-demo --overwrite_output_dir --fp16 --use_flash true  --seed 1 --run_name pubmedqa-finetune-demo --deepspeed deepspeed_config.json

Please let me know if you can get this working!

from biomedlm.

J38 commented on July 17, 2024

So to summarize it looks like you can run the sequence classification with 1 GPU and 40GB GPU memory (maybe even 20GB GPU memory) ... but I do think you are going to need something like 50GB of machine RAM to take advantage of the CPU offloading

from biomedlm.

guathwa commented on July 17, 2024

That's really great news! Thank you so much for your help! I will try out and let you know. This machine has plenty of RAM too.

from biomedlm.

guathwa commented on July 17, 2024

Hi J38, I am happy to share that I am able to complete the training following your instructions, without using --use_flash true. If I include --use_flash true, it will give me the following error. Still trying to troubleshoot what could be the cause. If you have any clue, do let me know. Thanks.

(biomedlm) dro@dro-DGX-Station:~/guathwa/pubmedgpt/finetune/seqcls_tr_dro$ deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file $datadir/train.csv --validation_file $datadir/dev.csv --test_file $datadir/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:17,346] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-09 12:38:17,590] [INFO] [runner.py:548:main] cmd = /home/dro/anaconda3/envs/biomedlm/bin/python -u -m deepspeed.launcher.launch --world_info=xxx --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file data/tr/train.csv --validation_file data/tr/dev.csv --test_file data/tr/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:18,956] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-09 12:38:18,956] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-09 12:38:18,956] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-09 12:38:18,956] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-09 12:38:18,956] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-09 12:38:24,928] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "run_seqcls_gpt.py", line 634, in
main()
File "run_seqcls_gpt.py", line 221, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/dro/anaconda3/envs/biomedlm/lib/python3.8/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_flash', 'true']
[2023-02-09 12:38:25,968] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 78542
[2023-02-09 12:38:25,969] [ERROR] [launch.py:324:sigkill_handler] ['/home/dro/anaconda3/envs/biomedlm/bin/python', '-u', 'run_seqcls_gpt.py', '--local_rank=0', '--tokenizer_name', 'stanford-crfm/pubmed_gpt_tokenizer', '--model_name_or_path', '/home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt', '--train_file', 'data/tr/train.csv', '--validation_file', 'data/tr/dev.csv', '--test_file', 'data/tr/test.csv', '--do_train', '--do_eval', '--do_predict', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--learning_rate', '2e-06', '--warmup_ratio', '0.5', '--num_train_epochs', '20', '--max_seq_length', '560', '--logging_steps', '100', '--save_strategy', 'no', '--evaluation_strategy', 'no', '--output_dir', 'tr-finetune-demo', '--overwrite_output_dir', '--fp16', '--use_flash', 'true', '--seed', '1', '--run_name', 'tr-finetune-demo', '--deepspeed', 'deepspeed_config.json'] exits with return code = 1

from biomedlm.

J38 commented on July 17, 2024

By the way I wasn't seeing any performance gain using flash attention, not sure if it just doesn't help or a bug in my system ... this bug you're reporting is because I forgot to push the updated code that has the flash attention option ... will try to push that soon !

from biomedlm.

J38 commented on July 17, 2024

Okay I pushed the updated code!

from biomedlm.

guathwa commented on July 17, 2024

Saw the updated codes. I will close this issue. Thanks for the great help!

from biomedlm.

brando90 commented on July 17, 2024

is there any code that works? e.g. colab? thanks!

from biomedlm.

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. about biomedlm HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent