Coder Social home page Coder Social logo

fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. about biomedlm HOT 14 CLOSED

guathwa avatar guathwa commented on June 25, 2024
fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

from biomedlm.

Comments (14)

J38 avatar J38 commented on June 25, 2024

Hi I'll work today on running a basic fine tuning example and looking at the memory footprint and get back to you!

from biomedlm.

guathwa avatar guathwa commented on June 25, 2024

Thanks!

from biomedlm.

J38 avatar J38 commented on June 25, 2024

Hi I'll get you a sample command in the next day or two, but this link here explains using deepspeed:

https://huggingface.co/docs/transformers/main_classes/deepspeed#deployment-with-one-gpu

You should be using the deepspeed command and make sure you've installed deepspeed.

I think if you only have 1 GPU you're going to need to try ZeRO-offload ... this is explained in that link I provided.

But I'll try to get this working on my own and let you know ...

from biomedlm.

J38 avatar J38 commented on June 25, 2024

This is another good link:

huggingface/transformers#8771 (comment)

from biomedlm.

J38 avatar J38 commented on June 25, 2024

Over the next few days I'll try to get this working so we have a great working example of fine-tuning the model with 1 GPU !

from biomedlm.

guathwa avatar guathwa commented on June 25, 2024

Thank you, I will read up n try out also on my 1 GPU!

from biomedlm.

J38 avatar J38 commented on June 25, 2024

I've gotten the code running and it uses 20GB of GPU memory and 50GB of RAM. So as long as the machine with your A100 has plenty of RAM this could work with 1 GPU.

Set up environment:

# create conda environment
conda create -n biomedlm python=3.8.12 pytorch=1.12.1 torchdata cudatoolkit=11.6.0 -c pytorch -c nvidia

# activate conda environment
conda activate biomedlm

# install python dependencies
# note that flash attention can take 30m to install so it is normal for it to do nothing for 30m
pip install flash-attn
pip install numpy
pip install transformers==4.26.0 datasets=2.9.0 omegaconf wandb
pip install fairscale
pip install accelerate

DeepSpeed config: deepspeed_config.json

{
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 2e-06,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-8,
      "weight_decay": 0.0
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": "auto",
      "warmup_max_lr": 2e-06,
      "warmup_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "cpu_offload": true
  },
  
  "train_batch_size": "auto",

  "fp16": {
   "enabled": true
  }

}

Command I ran in seqcls directory:

task=pubmedqa_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval

deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /path/to/model --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20  --max_seq_length 560  --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir pubmedqa-finetune-demo --overwrite_output_dir --fp16 --use_flash true  --seed 1 --run_name pubmedqa-finetune-demo --deepspeed deepspeed_config.json 

Please let me know if you can get this working!

from biomedlm.

J38 avatar J38 commented on June 25, 2024

So to summarize it looks like you can run the sequence classification with 1 GPU and 40GB GPU memory (maybe even 20GB GPU memory) ... but I do think you are going to need something like 50GB of machine RAM to take advantage of the CPU offloading

from biomedlm.

guathwa avatar guathwa commented on June 25, 2024

That's really great news! Thank you so much for your help! I will try out and let you know. This machine has plenty of RAM too.

from biomedlm.

guathwa avatar guathwa commented on June 25, 2024

Hi J38, I am happy to share that I am able to complete the training following your instructions, without using --use_flash true. If I include --use_flash true, it will give me the following error. Still trying to troubleshoot what could be the cause. If you have any clue, do let me know. Thanks.

(biomedlm) dro@dro-DGX-Station:~/guathwa/pubmedgpt/finetune/seqcls_tr_dro$ deepspeed --num_gpus 1 --num_nodes 1 run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file $datadir/train.csv --validation_file $datadir/dev.csv --test_file $datadir/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:17,346] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-02-09 12:38:17,590] [INFO] [runner.py:548:main] cmd = /home/dro/anaconda3/envs/biomedlm/bin/python -u -m deepspeed.launcher.launch --world_info=xxx --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None run_seqcls_gpt.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path /home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt --train_file data/tr/train.csv --validation_file data/tr/dev.csv --test_file data/tr/test.csv --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir tr-finetune-demo --overwrite_output_dir --fp16 --use_flash true --seed 1 --run_name tr-finetune-demo --deepspeed deepspeed_config.json
[2023-02-09 12:38:18,956] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-02-09 12:38:18,956] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-02-09 12:38:18,956] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-02-09 12:38:18,956] [INFO] [launch.py:162:main] dist_world_size=1
[2023-02-09 12:38:18,956] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-02-09 12:38:24,928] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
File "run_seqcls_gpt.py", line 634, in
main()
File "run_seqcls_gpt.py", line 221, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/dro/anaconda3/envs/biomedlm/lib/python3.8/site-packages/transformers/hf_argparser.py", line 341, in parse_args_into_dataclasses
raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['--use_flash', 'true']
[2023-02-09 12:38:25,968] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 78542
[2023-02-09 12:38:25,969] [ERROR] [launch.py:324:sigkill_handler] ['/home/dro/anaconda3/envs/biomedlm/bin/python', '-u', 'run_seqcls_gpt.py', '--local_rank=0', '--tokenizer_name', 'stanford-crfm/pubmed_gpt_tokenizer', '--model_name_or_path', '/home/dro/guathwa/pubmedgpt/finetune/seqcls_tr_dro/stanford-crfm-pubmedgpt', '--train_file', 'data/tr/train.csv', '--validation_file', 'data/tr/dev.csv', '--test_file', 'data/tr/test.csv', '--do_train', '--do_eval', '--do_predict', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '2', '--learning_rate', '2e-06', '--warmup_ratio', '0.5', '--num_train_epochs', '20', '--max_seq_length', '560', '--logging_steps', '100', '--save_strategy', 'no', '--evaluation_strategy', 'no', '--output_dir', 'tr-finetune-demo', '--overwrite_output_dir', '--fp16', '--use_flash', 'true', '--seed', '1', '--run_name', 'tr-finetune-demo', '--deepspeed', 'deepspeed_config.json'] exits with return code = 1

from biomedlm.

J38 avatar J38 commented on June 25, 2024

By the way I wasn't seeing any performance gain using flash attention, not sure if it just doesn't help or a bug in my system ... this bug you're reporting is because I forgot to push the updated code that has the flash attention option ... will try to push that soon !

from biomedlm.

J38 avatar J38 commented on June 25, 2024

Okay I pushed the updated code!

from biomedlm.

guathwa avatar guathwa commented on June 25, 2024

Saw the updated codes. I will close this issue. Thanks for the great help!

from biomedlm.

brando90 avatar brando90 commented on June 25, 2024

is there any code that works? e.g. colab? thanks!

from biomedlm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.