We were able to run preprocess_medqa.py based on the steps in <a href="https://github.

I work with <a class="user-mention notranslate" data-hovercard-type="user" data-hoverc

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

I've found this tutorial on multi-choice inference: <a href="https://huggingface.co/do

Thank you, <a class="user-mention notranslate" data-hovercard-type="user" data-hoverca

How to run the evaluator for MedQA-USMLE about biomedlm HOT 20 OPEN

stanford-crfm commented on June 29, 2024

How to run the evaluator for MedQA-USMLE

from biomedlm.

Comments (20)

J38 commented on June 29, 2024 1

I will work to take notes from these issues and update the documentation to have some clear fine-tune on 1 GPU examples ... I think 1 GPU with cpu_offloading is going to be a common use case for a lot of users.

from biomedlm.

J38 commented on June 29, 2024

Do you want to fine-tune on MedQA or just run evaluation of a model ?

from biomedlm.

githubusera commented on June 29, 2024

I work with @manusikka on the class research project. We are looking to evaluate a model, establish baseline and confirm results "new state of the art for the MedQA task of 50.3%". Any guidance would be appreciated. Thanks.

from biomedlm.

J38 commented on June 29, 2024

num_devices = number of GPUs
checkpoint = file path of hugging face model checkpoint dir

These two settings are related and depend on number of GPUs and how much memory the GPUs have:

train_per_device_batch_size = examples per device
grad_accum = number of steps to accumulate gradient

batch_size = train_per_device_batch_size x num_devices x grad_accum

So for example if you want batch_size=8, you'd set
train_per_device_batch_size=1, num_devices=8, grad_accum=1

(assuming you have 8 GPU)

If you want batch_size=32 you might do:

train_per_device_batch_size=1, num_devices=8, grad_accum=4

You could try train_per_device_batch=2, but you may run out of GPU memory.

lr = learning rate , for example 2e-06
num_train_epochs = number of epochs, for example 10
numerical_format = bf16
seed = random seed, set this differently for each experiment to something like 1,2, or 3
you can remove data_seed option
run_name = name for your experiment

Let me know if that clarifies and if you have any other questions ...

One note: the 50.3% is an average with seed=1, seed=2, and seed=3 ... so any given experiment won't yield that exact number, and experiments on your machine will probably yield different results since randomness will be different ... so don't expect to fall on 50.3% exactly or even on average, but hopefully it should be close to that on average

from biomedlm.

githubusera commented on June 29, 2024

Thank you @J38 for such detailed explanation. I appears that many of the parameters you've mentioned are needed for training. I am confused: why are we training model, if are only trying to run Evaluation on existent model? Or, are we first building a model AND then running Eval on it, all in one batch command?

Also, can you clarify conceptual question and let me know if I am thinking right:
BioMedLM is a model that has been already trained on data and is saved to HuggingFace: https://huggingface.co/stanford-crfm/BioMedLM.

I should be able to just download the BioMedLM model and run evaluation on MedQA WITHOUT training, right?
For example, I would do something like this:

tokenizer = GPT2Tokenizer.from_pretrained("stanford-crfm/BioMedLM")
model = GPT2LMHeadModel.from_pretrained("stanford-crfm/BioMedLM").to(device)
input_ids = tokenizer.encode(
"A 20-year-old woman presents with menorrhagia for the past several years..... Which of the following is the most likely cause of this patient’s symptoms? A: Factor V Leiden ...", return_tensors="pt"
).to(device)

sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)
#TODO: get output from tokenizer.decode(sample_output[0], skip_special_tokens=True)

Then, compare output to correct label and see if there is an exact match to the answer.
Is this evaluation method appropriate?

from biomedlm.

githubusera commented on June 29, 2024

Also, I am trying to run evaluation on a MedQA question via model, as in:
`
question = ("A 20-year-old woman presents with menorrhagia for the past several years."
"She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember."
"Family history is significant for her mother, who had similar problems with bruising easily. "
"The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F),"
" and blood pressure 110/87 mm Hg. Physical examination is unremarkable. "
" Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds,"
" and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?"
"A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease"
)

input_ids = tokenizer.encode(
question, return_tensors="pt"
).to(device)

sample_output = model.generate(input_ids, do_sample=True, max_length=50, top_k=50)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))
`

Here is an output:
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:28895 for open-end generation.
Input length of input_ids is 209, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
Output:

A 20-year-old woman presents with menorrhagia for the past several years.She says that her menses “have always been heavy”, and she has experienced easy bruising for as long as she can remember.Family history is significant for her mother, who had similar problems with bruising easily. The patient's vital signs include: heart rate 98/min, respiratory rate 14/min, temperature 36.1°C (96.9°F), and blood pressure 110/87 mm Hg. Physical examination is unremarkable. Laboratory tests show the following: platelet count 200,000/mm3, PT 12 seconds, and PTT 43 seconds. Which of the following is the most likely cause of this patient’s symptoms?A: Factor V Leiden B: Hemophilia A C: Lupus anticoagulant D: Protein C deficiency E Von Willebrand disease An

Notice that the answer seem to be truncated (very last "An").
Is there a way to use above code snippet to display answer to the multiple choice MedQA question? Thanks!

from biomedlm.

githubusera commented on June 29, 2024

I was able to run following command in terminal:

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
/root/.cache/huggingface/hub/models--stanford-crfm--BioMedLM --stanford-crfm--BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--bf16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir

I see all THREE command: do_train do_eval and do_predict. Should I be able to use just do_eval for my evaluation?
Where should I be able to see the results from eval? Thank you.

from biomedlm.

githubusera commented on June 29, 2024

I've found this tutorial on multi-choice inference: https://huggingface.co/docs/transformers/tasks/multiple_choice#inference
Are we supposed to train our BioMedLM on Multi-Choice task, before running inference, as in this example: https://huggingface.co/docs/transformers/tasks/multiple_choice#train ?

Thank you.

from biomedlm.

J38 commented on June 29, 2024

The results will be printed out after the training is complete. I think do_eval will just work for eval. That command is running fine-tuning for multiple choice, and at the end prints out the results and puts .json files in the directory for the fine-tuned model.

from biomedlm.

githubusera commented on June 29, 2024

Thank you, @J38. Appreciate you response.

I am running following command on a single GPU (on https://colab.research.google.com/ using Pro+ GPU)
task=medqa_usmle_hf
datadir=data/$task
outdir=runs/$task/GPT2
mkdir -p $outdir

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --node_rank=0
run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path
stanford-crfm/BioMedLM --train_file data/medqa_usmle_hf/train.json --validation_file data/medqa_usmle_hf/dev.json
--test_file data/medqa_usmle_hf/test.json --do_train --do_eval --do_predict --per_device_train_batch_size
1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 1
--learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 10 --max_seq_length 512
--fp16 --seed 1 --data_seed 1 --logging_first_step --logging_steps 20
--save_strategy no --evaluation_strategy steps --eval_steps 500 --run_name alex
--output_dir trash/
--overwrite_output_dir

I am getting GPU error:

I've been experimenting with
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128'

but getting the same error.

Do you have recommendations for parameters when running train/eval on single GPU?

Thanks

from biomedlm.

J38 commented on June 29, 2024

You're going to have to use cpu_offloading if you're trying to train this on a single GPU.

from biomedlm.

J38 commented on June 29, 2024

Here is a thread where I got it working on 1 GPU for sequence classification:

from biomedlm.

J38 commented on June 29, 2024

I think it may be sufficient to just update the deepspeed config to use cpu_offloading ... there is an example deepspeed config in that thread I shared in the previous comment.

from biomedlm.

J38 commented on June 29, 2024

What this will do is drop information to machine RAM allowing you to work with much larger models at the cost of running much more slowly. But it is the only option for a model this large when you don't have a lot of GPU memory ...

from biomedlm.

J38 commented on June 29, 2024

You will need to use DeepSpeed rather than the torch distributed launch ... so I can see if I can get an example for the MC choice code working. It should be similar to what I posted for the sequence classification example.

from biomedlm.

githubusera commented on June 29, 2024

@J38 Thank you for the guidance. We've just got deepspeed to work!

Here is the code in Jupyter Notebook:
!pip install fairscale
!pip install accelerate
!pip install deepspeed

Here is the command line that worked (but ran VERY slow)
`task=medqa_usmle_hf ; datadir=data/$task ; export WANDB_PROJECT=biomedical-nlp-eval

deepspeed --num_gpus 1 --num_nodes 1 run_multiple_choice.py --tokenizer_name stanford-crfm/pubmed_gpt_tokenizer --model_name_or_path stanford-crfm/BioMedLM --train_file $datadir/train.json --validation_file $datadir/dev.json --test_file $datadir/test.json --do_train --do_eval --do_predict --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 2e-06 --warmup_ratio 0.5 --num_train_epochs 20 --max_seq_length 560 --logging_steps 100 --save_strategy no --evaluation_strategy no --output_dir medqa-finetune-demo --overwrite_output_dir --fp16 --seed 1 --run_name medqa-finetune-demo --deepspeed deepspeed_config.json
`

The deepspeed_config.json was taken from this thread: #9

from biomedlm.

J38 commented on June 29, 2024

Just to summarize, there are several ways to run a fine-tuning process, including:

plain on 1 GPU
torch.distributed on multiple GPUs
deepspeed on multiple GPUs

If you use deepspeed, the deepspeed config will determine optimizer settings. So for instance that config sets the learning rate, so make sure you review the deepspeed config and set the training parameters the way you want for the experiment.

I think the

--learning_rate 2e-06

in your command. It's possible deepspeed will just notice this, but I would advise carefully reviewing the config to make sure all of the settings are what you want.

from biomedlm.

J38 commented on June 29, 2024

Now it happens the deepspeed config I showed had learning rate 2e-06 ... but just wanted to let you know that that config will influence the optimizer settings, because deepspeed executes the optimization.

from biomedlm.

J38 commented on June 29, 2024

It is expected to be really slow, sorry, but training a model this large on 1 GPU is going to take a bit of time vs. using multiple GPUs. I think 8 GPUs take 1.5h to fine tune on this set, so it will be substantially slower with 1 GPU and cpu_offloading.

from biomedlm.

J38 commented on June 29, 2024

The PubMedQA task should only take like 4 hours, but that is a lot smaller training set ...

from biomedlm.

How to run the evaluator for MedQA-USMLE about biomedlm HOT 20 OPEN

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent