Comments (20)
I will work to take notes from these issues and update the documentation to have some clear fine-tune on 1 GPU examples ... I think 1 GPU with cpu_offloading is going to be a common use case for a lot of users.
from biomedlm.
Do you want to fine-tune on MedQA or just run evaluation of a model ?
from biomedlm.
I work with @manusikka on the class research project. We are looking to evaluate a model, establish baseline and confirm results "new state of the art for the MedQA task of 50.3%". Any guidance would be appreciated. Thanks.
from biomedlm.
num_devices = number of GPUs
checkpoint = file path of hugging face model checkpoint dir
These two settings are related and depend on number of GPUs and how much memory the GPUs have:
train_per_device_batch_size = examples per device
grad_accum = number of steps to accumulate gradient
batch_size = train_per_device_batch_size x num_devices x grad_accum
So for example if you want batch_size=8, you'd set
train_per_device_batch_size=1, num_devices=8, grad_accum=1
(assuming you have 8 GPU)
If you want batch_size=32 you might do:
train_per_device_batch_size=1, num_devices=8, grad_accum=4
You could try train_per_device_batch=2, but you may run out of GPU memory.
lr = learning rate , for example 2e-06
num_train_epochs = number of epochs, for example 10
numerical_format = bf16
seed = random seed, set this differently for each experiment to something like 1,2, or 3
you can remove data_seed option
run_name = name for your experiment
Let me know if that clarifies and if you have any other questions ...
One note: the 50.3% is an average with seed=1, seed=2, and seed=3 ... so any given experiment won't yield that exact number, and experiments on your machine will probably yield different results since randomness will be different ... so don't expect to fall on 50.3% exactly or even on average, but hopefully it should be close to that on average
from biomedlm.
Thank you @J38 for such detailed explanation. I appears that many of the parameters you've mentioned are needed for training. I am confused: why are we training model, if are only trying to run Evaluation on existent model? Or, are we first building a model AND then running Eval on it, all in one batch command?
Also, can you clarify conceptual question and let me know if I am thinking right:
BioMedLM is a model that has been already trained on data and is saved to HuggingFace: https://huggingface.co/stanford-crfm/BioMedLM.
I should be able to just download the BioMedLM model and run evaluation on MedQA WITHOUT training, right?
For example, I would do something like this:
tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = GPT2LMHeadModel.from_pretrained("stanford-crfm/BioMedLM").to(device)
input_ids = tokenizer.encode(
"A 20-year-old woman presents with menorrhagia for the past several years..... Which of the following is the most likely cause of this patient’s symptoms? A: Factor V Leiden ...", return_tensors="pt"
).to(device)
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
#TODO: get output from tokenizer.decode(sample_output[0], skip_special_tokens=True)
Then, compare output to correct label and see if there is an exact match to the answer.
Is this evaluation method appropriate?
from biomedlm.
Also, I am trying to run evaluation on a MedQA question via model, as in:
`
question = ("A 20-year-old woman presents with menorrhagia for the past several years."
"She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember."
"Family history is significant for her mother, who had similar problems with bruising easily. "
"The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F),"
" and blood pressure 110/87 mm Hg. Physical examination is unremarkable. "
" Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds,"
" and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?"
"A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease"
)
input_ids = tokenizer.encode(
question, return_tensors="pt"
).to(device)
sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
`
Here is an output:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:28895 for open-end generation.
Input length of input_ids is 209, but max_length
is set to 50. This can lead to unexpected behavior. You should consider increasing max_new_tokens
.
Output:
A 20-year-old woman presents with menorrhagia for the past several years.She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember.Family history is significant for her mother, who had similar problems with bruising easily. The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F), and blood pressure 110/87 mm Hg. Physical examination is unremarkable. Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds, and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease An
Notice that the answer seem to be truncated (very last "An").
Is there a way to use above code snippet to display answer to the multiple choice MedQA question? Thanks!
from biomedlm.
I was able to run following command in terminal:
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
/root/.cache/huggingface/hub/models--stanford-crfm--BioMedLM --stanford-crfm--BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--bf16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir
I see all THREE command: do_train do_eval and do_predict. Should I be able to use just do_eval for my evaluation?
Where should I be able to see the results from eval? Thank you.
from biomedlm.
I've found this tutorial on multi-choice inference: https://huggingface.co/docs/transformers/tasks/multiple_choice#inference
Are we supposed to train our BioMedLM on Multi-Choice task, before running inference, as in this example: https://huggingface.co/docs/transformers/tasks/multiple_choice#train ?
Thank you.
from biomedlm.
The results will be printed out after the training is complete. I think do_eval will just work for eval. That command is running fine-tuning for multiple choice, and at the end prints out the results and puts .json
files in the directory for the fine-tuned model.
from biomedlm.
Thank you, @J38. Appreciate you response.
I am running following command on a single GPU (on https://colab.research.google.com/ using Pro+ GPU)
task=medqa_usmle_hf
datadir=data/$task
outdir=runs/$task/GPT2
mkdir -p $outdir
python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
stanford-crfm/BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--fp16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir
I've been experimenting with
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128'
but getting the same error.
Do you have recommendations for parameters when running train/eval on single GPU?
Thanks
from biomedlm.
You're going to have to use cpu_offloading if you're trying to train this on a single GPU.
from biomedlm.
Here is a thread where I got it working on 1 GPU for sequence classification:
from biomedlm.
I think it may be sufficient to just update the deepspeed config to use cpu_offloading ... there is an example deepspeed config in that thread I shared in the previous comment.
from biomedlm.
What this will do is drop information to machine RAM allowing you to work with much larger models at the cost of running much more slowly. But it is the only option for a model this large when you don't have a lot of GPU memory ...
from biomedlm.
You will need to use DeepSpeed rather than the torch distributed launch ... so I can see if I can get an example for the MC choice code working. It should be similar to what I posted for the sequence classification example.
from biomedlm.
@J38 Thank you for the guidance. We've just got deepspeed to work!
Here is the code in Jupyter Notebook:
!pip install fairscale
!pip install accelerate
!pip install deepspeed
Here is the command line that worked (but ran VERY slow)
`task=medqa_usmle_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval
deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path stanford-crfm/BioMedLM --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir medqa-finetune-demo --overwrite_output_dir --fp16 --seed 1 --run_name medqa-finetune-demo --deepspeed deepspeed_config.json
`
The deepspeed_config.json was taken from this thread: #9
from biomedlm.
Just to summarize, there are several ways to run a fine-tuning process, including:
- plain on 1 GPU
- torch.distributed on multiple GPUs
- deepspeed on multiple GPUs
If you use deepspeed, the deepspeed config will determine optimizer settings. So for instance that config sets the learning rate, so make sure you review the deepspeed config and set the training parameters the way you want for the experiment.
I think the
--learning_rate 2e-06
in your command. It's possible deepspeed will just notice this, but I would advise carefully reviewing the config to make sure all of the settings are what you want.
from biomedlm.
Now it happens the deepspeed config I showed had learning rate 2e-06
... but just wanted to let you know that that config will influence the optimizer settings, because deepspeed executes the optimization.
from biomedlm.
It is expected to be really slow, sorry, but training a model this large on 1 GPU is going to take a bit of time vs. using multiple GPUs. I think 8 GPUs take 1.5h to fine tune on this set, so it will be substantially slower with 1 GPU and cpu_offloading.
from biomedlm.
The PubMedQA task should only take like 4 hours, but that is a lot smaller training set ...
from biomedlm.
Related Issues (20)
- BioMedLm for NER and sentiment analysis HOT 3
- 2pac style rap
- Evaluate MedQA_USMLE on a saved model HOT 1
- Using a UMLS based retriever to enhance MedQA-USMLE performance HOT 1
- sentence embedding HOT 1
- How to cite BioMedLM HOT 3
- How can I try question answering ?
- Unexpected bug for generate function
- Seqcls for multi-label task HOT 2
- Finetuning BioMedLM for Medical QA HOT 10
- Running generation batch misses file HOT 4
- Max Input and Output length HOT 2
- Generation is suspiciously slow for long sequences HOT 4
- The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. HOT 2
- torch.distributed.launch on eight 40G A100, CUDA out of memory.
- demo.py's unexpected behavior HOT 4
- zero-shot keyword extraction HOT 6
- can it be fine tuned in samller GPU HOT 10
- fine tuning on seqcls task with deepspeed hit RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. HOT 14
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biomedlm.