allenai / open-instruct Goto Github PK

License: Apache License 2.0

Dockerfile 0.43% Shell 7.48% Python 88.81% JavaScript 1.15% CSS 0.19% HTML 1.88% Makefile 0.06%

open-instruct's Issues

Cannot reproduce llama2-7b MMLU score

Hello! Thank you for this repo!
I am evaluating the model meta-llama/Llama-2-7b-hf on MMLU dataset.

The command is python -m eval.mmlu.run_eval --ntrain 5 --data_dir data/eval/mmlu --save_dir results/mmlu/llama2-7B-5shot --model_name_or_path meta-llama/Llama-2-7b-hf --tokenizer_name_or_path meta-llama/Llama-2-7b-hf --eval_batch_size 8.
But I got the result is 0.259

While I change eval_batch_size to 1, the result is 0.459, which is the exact value reported by llama2 paper.

I'm wondering what's going on?

Thank you!

LICENSE.txt not visible unless you agree to the LICENSE on Huggingface

Hi, thanks for the effort to make this model accessible!

I am cross-posting the discussion that I put on Huggingface, because I realized that people may not be following the discussion on Huggingface.

https://huggingface.co/allenai/tulu-65b/discussions/1

I wanted to obtain an access to this model, which requires me to agree to the terms and condtions:

I agree to abide by the terms of the license associated to this artifact, including domain and used-based restrictions

And, it seems that the terms and conditions are written inside LICENSE.txt:

This model is licensed under the AI model license given in LICENSE.txt along with the original Llama license (llama_license.txt).

The problem is that you cannot see LICENSE.txt unless you have obtained access to the model. It seems 232 people who downloaded the model already didn't care about this, but could you consider putting the LICENSE on the model card or have it accessible by other means?

Rotary position embedding scaling 8k for Llama2?

Thank you again for this great effort.

Does anyone managed to do rotary position embedding scaling to 8k for Llama2, and tuned those using open-instruct?
Discussions are much appreciated.

balanced_low_0 Related Issue During Inference

Hi, I noticed in scripts/eval/bbh.sh, if two or more CUDA devices are available, for example, CUDA_VISIBLE_DEVICES=0,1, it will give "Expected all tensors to be on the same device, but found at least two devices" error. If using auto instead of balanced_low_0 for device_map, then it is fine. I am wondering probably I am missing something? Thanks!

Errors happen when loading the recovered tokenizer

I want to run weight_diff.py to recover the fine-tuned model. Here is what I did:

I have already downloaded llama original weight and convert it to huggingface format. This step should be fine since I can successfully load model_raw and tokenizer_raw
I tried to download your diff weight by git clone https://huggingface.co/allenai/tulu-7b, this raise an error as Possibly malformed smudge on Windows: see git lfs help smudge for more info. The reason might be the files are too large.
I do it in another way, use allenai/tulu-7b directly as path_diff, so the code will download the model directly. This works for the model_recovered. However, tokenizer_recovered doesn't work and raised the following error
RecursionError: maximum recursion depth exceeded while getting the str of an object
I also met this error before. According to this issue, this seems that you use the old version of llama tokenizer.
Do you know how should I fix it? More specifically, can I load as
tokenizer_recovered = AutoTokenizer.from_pretrained("allenai/tulu-7b", unk_token="<unk>", bos_token="<s>", eos_token="</s>")

If I load it in this way, it works. But I need to make sure whether I use the same special tokens as you did during fine-tuning.

How to generate text?

Hi,

Thank you for your work.

I have attempted to generate text using allenai/open-instruct-stanford-alpaca-7b, but unfortunately, the output is not quite satisfactory. Could you kindly guide me on whether I made any mistakes and how I can improve the results?

code:

from transformers import AutoTokenizer, AutoModelForCausalLM


pretrained = "allenai/open-instruct-stanford-alpaca-7b"
model = AutoModelForCausalLM.from_pretrained(pretrained).cuda()
tokenizer = AutoTokenizer.from_pretrained(pretrained, use_fast=False)

def generate(prompt, **kwargs):
    inputs = tokenizer(prompt, return_tensors="pt")
    generate_ids = model.generate(inputs.input_ids.cuda(), **kwargs)
    response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return response


prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:\n"
prompt = prompt.format(instruction="Describe a time when you had to make a difficult decision.")

print(prompt)
print(generate(prompt, max_length=256))

Input:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe a time when you had to make a difficult decision.

### Response:

Output:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Describe a time when you had to make a difficult decision.

### Response:
fnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfnfn

TypeError of import operation of run_eval.py

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "open-instruct/eval/mmlu/run_eval.py", line 11, in
from eval.utils import get_next_word_predictions, load_hf_lm_and_tokenizer, query_openai_chat_model
File "open-instruct/eval/utils.py", line 10, in
from eval.dispatch_openai_requests import dispatch_openai_chat_requesets, dispatch_openai_prompt_requesets
File "open-instruct/eval/dispatch_openai_requests.py", line 11, in
messages_list: list[list[dict[str,Any]]],
TypeError: 'type' object is not subscriptable

async def dispatch_openai_chat_requesets(
    messages_list: list[list[dict[str,Any]]],
    model: str,
    **completion_kwargs: Any,
) -> list[str]:

    async_responses = [
        openai.ChatCompletion.acreate(
            model=model,
            messages=x,
            **completion_kwargs,
        )
        for x in messages_list
    ]
    return await asyncio.gather(*async_responses)

why the typeError occur when importing?

Support for tuning on chat data

Hello，
thanks for sharing the code base.

Can the open-instruct code base support for tuning on conversational data (e.g., ShareGPT) ?

Fail to generate eval data

Hi, I ran scripts/prepare_eval_data.sh but can't find the function eval.creative_eval.get_gpt_outputs. Where can I find it?

Estimated timeline for human data release

Hi @yizhongw,
Thank you very much for this excellent release!

I was wondering what was the estimated timeline for the release of the human evaluation data and interface (the README mentioned it's coming soon)? It would be very helpful for my research!

Thank you,
Kalpesh

How to pull the image

Sorry for bothering you for this problem. I met the following problem, when I try to pull the image.

Command:
~/project$ docker pull gcr.io/ai2-beaker-core/public/cl5erg1ebj67821o3200:latest

Output:
Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

I am quite new to GCR. Can anyone help me with this? Thank you very much!

`prepare_train_data.sh` fails

I get this error:

Splitting the ShareGPT dataset...
Traceback (most recent call last):
  File "scripts/split_sharegpt_conversations.py", line 117, in <module>
    main(args)
  File "scripts/split_sharegpt_conversations.py", line 96, in main
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "XXX/venv/lib64/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 643, in from_pretrained
    tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
  File "XXX/venv/lib64/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 487, in get_tokenizer_config
    resolved_config_file = cached_file(
  File "XXX/venv/lib64/python3.8/site-packages/transformers/utils/hub.py", line 417, in cached_file
    resolved_file = hf_hub_download(
  File "XXX/venv/lib64/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 110, in _inner_fn
    validate_repo_id(arg_value)
  File "XXX/venv/lib64/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '../hf_llama_models/7B/'. Use `repo_type` argument if needed.
Reformatting the datasets...
Processing super_ni data...
Processing self_instruct data...
Processing unnatural_instructions data...
Processing stanford_alpaca data...
Processing dolly data...
Processing oasst1 data...
Processing code_alpaca data...
Processing gpt4_alpaca data...
Processing baize data...
Processing sharegpt data...
Traceback (most recent call last):
  File "open_instruct/reformat_datasets.py", line 452, in <module>
    globals()[f"convert_{subfolder}_data"](os.path.join(args.raw_data_dir, subfolder), os.path.join(args.output_dir, subfolder))
  File "open_instruct/reformat_datasets.py", line 301, in convert_sharegpt_data
    with open(os.path.join(data_dir, "sharegpt_html_cleaned_and_split.json"), "r") as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'data/raw_train/sharegpt/sharegpt_html_cleaned_and_split.json'```

RecursionError when running weight diff script

I'm trying to use the weight diff script provided in the repo to rebuild one of the fine-tuned checkpoints, and I'm running into some issues. I started by setting up a new Conda environment (Python version 3.11.3), installed the requirements from weight-diff-requirements.txt using pip, downloaded the 7B-parameter LLaMA checkpoint as well as the allenai/open-instruct-sni-7B model diff from Huggingface, and then ran the weight_diff.py script as specified in README.md. This causes the following error:

  File "/gscratch/ark/rahuln/miniconda3/envs/weight-diff/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id
    return self.convert_tokens_to_ids(self.unk_token)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gscratch/ark/rahuln/miniconda3/envs/weight-diff/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids
    return self._convert_token_to_id_with_added_voc(tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gscratch/ark/rahuln/miniconda3/envs/weight-diff/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc
    return self.unk_token_id
           ^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded

Is this a known issue with weight_diff.py? Any suggestions on how to go about fixing it? Thanks!

Embedding resizing logic may be flawed

In the file finetune.py, lines 467-471 talk about resizing the embedding matrix of LLAMA when adding the padding token.

open-instruct/open_instruct/finetune.py

Line 467 in 60ceefa

    
           # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch

Issue: The size of the vocab inferred via the embedding matrix is 0, presumably due to safetensors. This means, resizing will always happen if I use a variant of LLAMA2 which I have modified myself which may already contain the pad token.

However, when a model which is available in its safetensors format, does model.resize_token_embeddings(len(tokenizer)) even work?

This is likely a huggingface issue but I thought I would raise it here.

TIA.

Potential code fix to StoppingCriteria during generation

I noticed that for the current StoppingCriteria during text generation is:

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
      sequences_should_be_stopped = []
      for i in range(input_ids.shape[0]):
          for stop_sequence in self.stop_sequences:
              if input_ids[i][-len(stop_sequence):].tolist() == stop_sequence:
                  sequences_should_be_stopped.append(True)
                  break
          sequences_should_be_stopped.append(False)
      return all(sequences_should_be_stopped)

However, this will make all(sequences_should_be_stopped) always False since it does not break out of the outer loop, so model will keep generating even "\n" is already generated (I noticed "\n" is the stopping tokens used in the repo). Isn't something like below should be the correct way? (I know that "\n" is checked afterwards with generated ids, so results won't be affected)

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
      batch_generation_should_be_stopped = []
      for i in range(input_ids.shape[0]):
          sequence_should_be_stopped = False
          for stop_sequence in self.stop_sequences:
              if input_ids[i][-len(stop_sequence):].tolist() == stop_sequence:
                  sequence_should_be_stopped = True
                  break
          batch_generation_should_be_stopped.append(sequence_should_be_stopped)
      return all(batch_generation_should_be_stopped)

Another thing I noticed is that since StoppingCriteria happens at batch level, so only all sequences in a batch happens to generate "\n" at a time step will make __call__ to return True and generation to halt, which is rare, so model will most likely to keep generating until hitting max length to generate or hitting model eos token in most batches. So I am wondering isn't using StoppingCriteria plus batch size of 1 more efficient during generation based inference?

Does LoRa finetuning give weight diff or actual weight?

Do we need to recover weights after LoRa fine-tuning or do we get the final weights?

Also, can you provide some information about which weights are recovered and which are difference? It's not clear from "Some of the checkpoints are released as weight diffs to the base model (mostly for LLaMa 1)." in the doc here: https://github.com/allenai/open-instruct#weight-diff-script

Maybe you can add a column here: https://github.com/allenai/open-instruct#released-checkpoints specifying which checkpoints are diff and which ones are recovered weights. Thanks.

cc: @yizhongw @hamishivi @eltociear

max_seq_length and per_device_train_batch_size

Hi,

I wonder what max_seq_length and per_device_train_batch_size for huggyllama/llama-7b you are using in all experiments?

I'm trying to use max_seq_length=2048 and per_device_batch_size=2 on 4xA100 80GB gpus with deepspeed as you provided in [finetune_with_accelerate.sh](https://github.com/allenai/open-instruct/blob/main/scripts/finetune_with_accelerate.sh).

I constantly meet CUDA OOM error.

Seeking your help.

LoRA finetuning

Hi,

Thanks for releasing this amazing repo! In the repo, a script to finetune with LoRA has been provided, yet results with LoRA finetuning is not mentioned in the paper. Is LoRA finetuning underperforming full finetuning?

Best,
Mengzhou

Not loading cached datasets for preprocessing

Dear,

I am currently working with the script found at https://github.com/allenai/open-instruct/blob/main/scripts/finetune_with_hf_trainer.sh and utilizing 4 GPUs.

From the logs I observed:

Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:06<00:00, 16738.93 examples/s]
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [02:38<00:00, 13336.92 examples/s] 
Tokenizing and reformatting instruction data (num_proc=128): 100%|██████████| 2109561/2109561 [03:10<00:00, 11098.11 examples/s] 
Filter: 100%|██████████| 2109561/2109561 [02:02<00:00, 17255.15 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:46<00:00, 19796.75 examples/s]
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Filter: 100%|██████████| 2109561/2109561 [01:38<00:00, 21370.43 examples/s]

Although you explicitly let the main process run tokenization first in the finetune_trainer.py. But the rest 3 processes didn't load the cached results. I've verified that the overwrite_cache option is turned off.

Have you met the same circumstance?

Mismatch MMLU results with arxiv paper

I ran the MMLU eval with the flan-v2-7b and stanford-alpaca-7b and I've got different results than those reported in the paper. Does the arxiv version has the latest numbers? I run the eval script with the flags --ntrain 0 --eval_batch_size 2 --load_in_8bit --use_chat_format

$ cat outputs/mmlu/stanford-alpaca-7b-0shot/metrics.json
{"average_acc": 0.4154678820680815, "subcat_acc": {"math": 0.27725563909774437, "health": 0.44146341463414634, "physics": 0.3546875, "business": 0.5720823798627003, "biology": 0.45594713656387664, "chemistry": 0.3333333333333333, "computer science": 0.36650485436893204, "economics": 0.3719676549865229, "engineering": 0.35172413793103446, "philosophy": 0.3812127236580517, "other": 0.46266094420600856, "history": 0.4989247311827957, "geography": 0.5151515151515151, "politics": 0.5200617283950617, "psychology": 0.4675885911840968, "culture": 0.5843373493975904, "law": 0.3448667044809983}, "cat_acc": {"STEM": 0.341948310139165, "humanities": 0.39086078639744953, "social sciences": 0.4712382190445239, "other (business, health, misc.)": 0.4666872301048735}}

$ cat outputs/mmlu/flan-v2-7b-0shot/metrics.json
{"average_acc": 0.4544936618715283, "subcat_acc": {"math": 0.2932330827067669, "health": 0.4634146341463415, "physics": 0.340625, "business": 0.5926773455377574, "biology": 0.4801762114537445, "chemistry": 0.35313531353135313, "computer science": 0.39563106796116504, "economics": 0.4029649595687331, "engineering": 0.36551724137931035, "philosophy": 0.397117296222664, "other": 0.5570815450643777, "history": 0.6053763440860215, "geography": 0.5050505050505051, "politics": 0.5509259259259259, "psychology": 0.5522904062229905, "culture": 0.6114457831325302, "law": 0.38740782756664777}, "cat_acc": {"STEM": 0.3548707753479125, "humanities": 0.434643995749203, "social sciences": 0.5193370165745856, "other (business, health, misc.)": 0.5144972239358421}}

Fine-tuning Reproducibility

Hi, first of all, thanks a lot for your great contributions!

I'm trying to fine-tune llama-7b on alpaca-52k, but facing a performance reproducibility issue for the MMLU benchmark.
I used finetune_with_accelerate.sh for two A100 40gb with the same hyperparameters in the paper as below

export CUDA_VISIBLE_DEVICES=0,1

MODEL_SIZE=7B
NUM_GPUS=2
BATCH_SIZE_PER_GPU=2
TOTAL_BATCH_SIZE=128
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))
echo "Training llama model ${MODEL_SIZE} using $NUM_GPUS GPUs, $BATCH_SIZE_PER_GPU batch size per GPU, $GRADIENT_ACC_STEPS gradient accumulation steps"

accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes $NUM_GPUS \
    --use_deepspeed \
    --deepspeed_config_file ds_configs/stage3_offloading_accelerate.conf \
    open_instruct/finetune.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --tokenizer_name decapoda-research/llama-7b-hf \
    --use_slow_tokenizer \
    --train_file /datasets/open_instruct/stanford_alpaca/stanford_alpaca_data.jsonl\
    --use_flash_attn \
    --max_seq_length 2048 \
    --preprocessing_num_workers 16 \
    --per_device_train_batch_size $BATCH_SIZE_PER_GPU \
    --gradient_accumulation_steps $GRADIENT_ACC_STEPS \
    --learning_rate 2e-5 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0. \
    --num_train_epochs 2 \
    --output_dir /checkpoints/llama-7b-dm-alpaca52k \
    --with_tracking \
    --report_to tensorboard \
    --logging_steps 1

Below is the script for MMLU evaluation

export CUDA_VISIBLE_DEVICES=1

# Evaluating llama 7B model using 0 shot directly

python -m eval.mmlu.run_eval \
    --ntrain 0 \
    --data_dir /datasets/MMLU/data \
    --save_dir results/mmlu/llama-7B-0shot \
    --model_name_or_path /checkpoints/llama-7b-dm-alpaca52k \
    --tokenizer_name_or_path /checkpoints/llama-7b-dm-alpaca52k \
    --eval_batch_size 4 \
    --load_in_8bit \
    --use_chat_format

However, I got an average accuracy of 0.320 which was 0.415 in the paper.

I already checked 0.415 was reproducible when I downloaded the officially released model at allenai/open-instruct-stanford-alpaca-7b.
So, the performance drop would be from fine-tuning.

Can anyone give me any tips for what I might miss?
Thanks

This script does not work with cerebras gpt and gpt neox.

I am using finetune_with_accelerate.sh for finetuning. It works for pythia models but does not work with cerebras gpt and gpt neox. Is there something I need to change with the tokenizers?

Possible unreliability of MMLU estimates

By reading the source code, the evaluation method for MMLU I think is as follows: comparing the first character generated by the model with the true label. I have a question about this. Though you add "Answer: " to the question, the model usually do not output options directly. Below is a example of related code.

    results = query_openai_chat_model(
        engine=args.openai_engine,
        instances=instances,
        batch_size=args.eval_batch_size if args.eval_batch_size else 10,
        output_path=os.path.join(args.save_dir, f"{subject}_openai_results.jsonl"),
        logit_bias={token_id: 100 for token_id in answer_choice_ids},
        max_tokens=1,   # Here
    )

Is my concern reasonable? Could you provide a relatively detailed explanation? It seems that there are no very reliable methods for extracting answers under unsupervised conditions in the field of instruction tuning.

Finetuning setup for 65b model

Hi, thank you for releasing the repo/ckpts! Could you share your setup for running 65b model finetune, e.g. deepspeed config, number of nodes, etc. Thanks!

Error with batch inference

../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [25,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed.
Error when generating completions for batch:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The above error occurs when testing bbh and seems to be due to batch inference.

It seems that the tokens <|user|> and <|assistant|> are not treated as special tokens.

It seems that the tokens <|user|> and <|assistant|> are not treated as special tokens.(i.e. these 2 tokens will be tokenized into many tokens). Is that right? Or it is a mistake?

File Not Found error while running finetune_lora_with_accelerate.sh

Following file seems to be missing in the entire repo oasst1_data.jsonl
Where can I find this file?

GPT-4 MMLU evaluation setup

Dear authors,

I enjoyed reading your work! I wonder if the GPT-4 evaluation results described in the paper only evaluate 100 instances from each MMLU subject? I ask this because it seems like only 100 instances are used per subject, on your mmlu.sh evaluation script.

If this is true, why wasn't this discussed in the paper? That is, is it a convention to use only 100 instances per subject for GPT-4 evaluation?

Thanks

Hyperparameter details of different sizes

Hi! Thanks for your great work!

I noticed that the setting does not include batch size for different sizes (7B, 13B, 30B, 65B) in the paper. And the script for training is for 7B. Could you please share the rest of the models' training script or just the hyperparameter details?

Thanks a lot!

CUDA OOM for qlora

Hi, i'm trying your qlora script(https://github.com/allenai/open-instruct/blob/main/scripts/finetune_qlora_with_accelerate.sh) for a node of eight A100 80GB gpus.

But I get the error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB (GPU 7; 79.20 GiB total capacity; 75.10 GiB already allocated; 279.56 MiB free; 77.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any suggestions?

model weights after finetune can not load by vllm or huggingface

Dear authors, thank you for opening source this great project. After I finetue my llama2-7B model using the finetune_with_accelate.sh, I can not load model weights by vllm or huggingface during inference process.

It seems that the model is saved successfully:

100%|██████████| 1300/1300 [6:12:02<00:00, 15.33s/it]12/14/2023 01:59:25 - INFO - main - Step: 1300, LR: 1.8040265793910542e-05, Loss: 0.2729374170303345
tokenizer config file saved in output/llama2_sharegpt_7B/tokenizer_config.json
Special tokens file saved in output/llama2_sharegpt_7B/special_tokens_map.json
added tokens file saved in output/llama2_sharegpt_7B/added_tokens.json
Configuration saved in output/llama2_sharegpt_7B/config.json
Configuration saved in output/llama2_sharegpt_7B/generation_config.json
The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 3 checkpoint shards. You can find where each parameters has been saved in the index located at output/llama2_sharegpt_7B/model.safetensors.index.json.
100%|██████████| 1300/1300 [6:12:18<00:00, 17.18s/it]

But when I use vllm to load model weights, there is an error:

(lzy-rlhf) liuziyi@g0003:/paratera5-data/private/liuziyi/mygit/open-instruct$ bash scripts/eval/gsm.sh
Loading data...
Loading model and tokenizer...
2023-12-14 10:46:39,885 INFO worker.py:1673 -- Started a local Ray instance.
INFO 12-14 10:46:44 llm_engine.py:73] Initializing an LLM engine with config: model='/paratera5-data/private/liuziyi/mygit/open-instruct/output/llama2_sharegpt_7B', tokenizer='/paratera5-data/private/liuziyi/mygit/open-instruct/output/llama2_sharegpt_7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=8, quantization=None, seed=0)
INFO 12-14 10:46:44 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
Traceback (most recent call last):
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/paratera5-data/private/liuziyi/mygit/open-instruct/eval/gsm/run_eval.py", line 247, in
main(args)
File "/paratera5-data/private/liuziyi/mygit/open-instruct/eval/gsm/run_eval.py", line 78, in main
model = vllm.LLM(
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 93, in init
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 246, in from_engine_args
engine = cls(*engine_configs,
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 107, in init
self._init_workers_ray(placement_group)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 194, in _init_workers_ray
self._run_workers(
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 727, in _run_workers_in_batch
all_outputs = ray.get(all_outputs)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/ray/_private/worker.py", line 2563, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerVllm.execute_method() (pid=1415438, ip=10.232.14.3, actor_id=735b979085096463debd933f01000000, repr=<vllm.engine.ray_utils.RayWorkerVllm object at 0x14675a4a8a60>)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/engine/ray_utils.py", line 31, in execute_method
return executor(*args, **kwargs)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/worker/worker.py", line 72, in load_model
self.model_runner.load_model()
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 36, in load_model
self.model = get_model(self.model_config)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/model_executor/model_loader.py", line 98, in get_model
model.load_weights(model_config.model, model_config.download_dir,
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/model_executor/models/llama.py", line 336, in load_weights
weight_loader(param, loaded_weight)
File "/ssd/apps/anaconda/2023.03/envs/lzy-rlhf/lib/python3.9/site-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 80, in weight_loader
assert loaded_weight.shape[parallel_dim] == self.num_embeddings
AssertionError

I will be greatly thankful if you can give me some insight for how to deal with this issus.

about SFT dataset and dataset ratio

I'm curious:

how is the SFT dataset selected?
how is the data ratio determined? maybe based on testset performance?

How to compute the reported EM for MMLU? And TydiQA on tulu7b is not reproducible

Hi,

I can run your eval code successfully! With tulu-7b on MMLU, I obtain the following results. In your paper, you report EM rather than accuracy, may I ask which script or flag should I use to report the EM number?

Average accuracy 0.299 - math
Average accuracy 0.462 - health
Average accuracy 0.352 - physics
Average accuracy 0.622 - business
Average accuracy 0.480 - biology
Average accuracy 0.297 - chemistry
Average accuracy 0.434 - computer science
Average accuracy 0.380 - economics
Average accuracy 0.366 - engineering
Average accuracy 0.372 - philosophy
Average accuracy 0.527 - other
Average accuracy 0.575 - history
Average accuracy 0.540 - geography
Average accuracy 0.569 - politics
Average accuracy 0.513 - psychology
Average accuracy 0.605 - culture
Average accuracy 0.377 - law
Average accuracy 0.359 - STEM
Average accuracy 0.414 - humanities
Average accuracy 0.504 - social sciences
Average accuracy 0.507 - other (business, health, misc.)
Average accuracy: 0.444

RecursionError when fine-tuning

Hi and thanks for great work! I was facing the following error while loading tokenizer from pre-trained weights in the /open-instruct/finetune.py script (finetuning with accelerate):
RecursionError: maximum recursion depth exceeded

This issue is resolved when changing AutoTokenizer to LlamaTokenizer.

Also - this solution also worked previously for weight_diff script (issue #8).

GPU budget for finetuning

Hi,

may I ask what is the required GPU memory when using the same scripts as yours for the no_offload and offload ds config?

I use 4GPU with 48GB memory and a batch size of 1, but it fails for both offload and no_offload ds configs. BTW, I use the LoRA fine-tuning script.

Reproduction issue

I try to reproduce your results of the vanilla llama-7b as the base using your scripts. However there exist differences.

The first screenshot is your paper results:

And then here comes my results using your scripts:

My exact eval commands are

MODEL_NAME_OR_PATH="huggyllama/llama-7b"
SAVE_TAG="llama-7b"

# MMLU 0shot
python -m eval.mmlu.run_eval \
    --ntrain 0 \
    --data_dir data/eval/mmlu \
    --save_dir results/mmlu/0shot.$SAVE_TAG \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --tokenizer_name_or_path $MODEL_NAME_OR_PATH \
    --eval_batch_size 2 \
    --use_chat_format 2>&1 | tee logs/log.mmlu.0shot.$SAVE_TAG

# MMLU 8shot
python -m eval.mmlu.run_eval \
    --ntrain 5 \
    --data_dir data/eval/mmlu \
    --save_dir results/mmlu/5shot.$SAVE_TAG \
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --tokenizer_name_or_path $MODEL_NAME_OR_PATH \
    --eval_batch_size 2 \
    --use_chat_format 2>&1 | tee logs/log.mmlu.5shot.$SAVE_TAG


# GSM cot evaluation
python -m eval.gsm.run_eval \
    --data_dir data/eval/gsm/ \
    --max_num_examples 200 \
    --save_dir results/gsm/cot.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 20 \
    --n_shot 8 2>&1 | tee logs/log.gsm.cot.$SAVE_TAG

# GSM no-cot evaluation
python -m eval.gsm.run_eval \
    --data_dir data/eval/gsm/ \
    --max_num_examples 200 \
    --save_dir results/gsm/no-cot.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 20 \
    --no_cot \
    --n_shot 8 2>&1 | tee logs/log.gsm.no-cot.$SAVE_TAG

# BBH cot
python -m eval.bbh.run_eval \
    --data_dir data/eval/bbh \
    --save_dir results/bbh/cot.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 10 \
    --max_num_examples_per_task 40 \
    --use_chat_format 2>&1 | tee logs/log.bbh.cot.$SAVE_TAG

# BBH no-cot
python -m eval.bbh.run_eval \
    --data_dir data/eval/bbh \
    --save_dir results/bbh/cot.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 10 \
    --max_num_examples_per_task 40 \
    --no_cot \
    --use_chat_format 2>&1 | tee logs/log.bbh.no-cot.$SAVE_TAG

# tydiqa with gold passage
python -m eval.tydiqa.run_eval \
    --data_dir data/eval/tydiqa/ \
    --n_shot 1 \
    --max_num_examples_per_lang 100 \
    --max_context_length 512 \
    --save_dir results/tydiqa/gp.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 20 \
    --use_chat_format 2>&1 | tee logs/log.tydiqa.gp.$SAVE_TAG

# tydiqa no gold passage, closed-book qa
python -m eval.tydiqa.run_eval \
    --data_dir data/eval/tydiqa/ \
    --n_shot 1 \
    --max_num_examples_per_lang 100 \
    --max_context_length 512 \
    --save_dir results/tydiqa/cb.$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 80 \
    --no_context \
    --use_chat_format 2>&1 | tee logs/log.tydiqa.cb.$SAVE_TAG

# codex_humaneval
python -m eval.codex_humaneval.run_eval \
    --data_file data/eval/codex_humaneval/HumanEval.jsonl.gz \
    --eval_pass_at_ks 1 5 10 20 \
    --unbiased_sampling_size_n 20 \
    --temperature 0.1 \
    --save_dir results/codex_humaneval/$SAVE_TAG \
    --model $MODEL_NAME_OR_PATH \
    --tokenizer $MODEL_NAME_OR_PATH \
    --eval_batch_size 4 2>&1 | tee logs/log.codex_humaneval.$SAVE_TAG

seeking your advice if there is any problems with my bash scripts:)

Llama2 checkpoints

Hello,
Do you plan to train and release llama2 checkpoints?
That's all, thanks!

mmlu run_eval.py: difference in evaluation results between run_eval.py(average_acc: 27.4) and normal results: (llama2-hf: 41.8)

There is a significant difference between the evaluation results obtained using the run_eval.py script and the normal results. The obtained evaluation results are as follows:
{
"average_acc": 0.27474718701039735,
"subcat_acc": {
"math": 0.21898496240601503,
"health": 0.275,
"physics": 0.25,
"business": 0.33638443935926776,
"biology": 0.2753303964757709,
"chemistry": 0.18151815181518152,
"computer science": 0.308252427184466,
"economics": 0.24258760107816713,
"engineering": 0.2482758620689655,
"philosophy": 0.26192842942345923,
"other": 0.2944206008583691,
"history": 0.310752688172043,
"geography": 0.26262626262626265,
"politics": 0.2978395061728395,
"psychology": 0.29818496110630943,
"culture": 0.3253012048192771,
"law": 0.2762336925694838
},
"cat_acc": {
"STEM": 0.243870112657389,
"humanities": 0.2769394261424017,
"social sciences": 0.2853428664283393,
"other (business, health, misc.)": 0.2902529302899445
}
}

Any plan to support mt-bench evaluation?

I think https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge is also a good evaluation set. Do you have any plan to support mt-bench evaluation in the future?

Contributing new datasets

Hi thank you for your great work ! Is open-instruct open to contributions ? (The code seems understandable enough for me to submit a PR easily enough)
I would like to add tasksource-instruct https://huggingface.co/datasets/tasksource/tasksource-instruct-v0

Thanks

could you please report your training cost?

seems that it takes a long time to train tulu, I am now trying to train sulu based on llama-2-7b, but I have to use 8*A100 40G, and I also need to modify --mixed_precision bf16 to --mixed_precision fp16 , but it output like that:

0%| | 2/12525 [03:18<344:16:10, 98.97s/it] 10/13/2023 06:31:56 - INFO - main - Step: 2, LR: 0.0, Loss: 1.25099515914917

is that 344h?why does it take so long?

The alpaca eval is not working can't reproduce

Hey @yizhongw great work.

This isn't working i.e. no such file in the repo
python eval/alpaca_farm_eval.py --model <model> --batch_size 8

I guess you are pointing to below but this script also not working
python eval/alpaca_farm/run_eval.py --model openlm-research/open_llama_3b --batch_size 8

also if possible provide a tutorial to perform alpaca eval

More details about individual task score and dataset stats?

Hi,

Great work.

I was wondering if you can provide individual task scores (particularly for MMLU and BBH tasks) as well as stats about dataset mixing (if you have). eg: how many instruction-response pairs mixing resulted in for both human-mix and Tulu mix?

Thanks and congrats again on great work.

cc: @yizhongw @hamishivi

Why is the evaluation method for Alpaca the same as for llama？

I think it is unfair!

Training script for human-mix and tulu mix?

The scripts/prepare_train_data.sh desn't create the mixture. Do you just concatenate the data? Can you add the training script to for human-mix and tulu mix data. finetune_with_accelerate.sh or finetune_with_accelerate.sh is for individual data it seems like.

Tulu v2 Sanky Diagram

Hey, made a quick Sanky Diagram for Tulu v2 and thought it might be interesting to share it with you guys. The reason I'm doing this is because I noticed FLAN is repeatedly used by different datasets, just like what LLM guys are doing with GSM8k, potentially causing data contaminations. Unfortunately, I still cannot figure out some of the detailed relationships correctly.

Made using https://sankeymatic.com/build/
Script:

FLAN v2 [50000] Tulu v2
FLAN v2 CoT [50000] Tulu v2
oasst1 [7708] Tulu v2
ShareGPT [114046] Tulu v2
GPT4-Alpaca [20000] Tulu v2
Code-Alpaca [20022] Tulu v2
LIMA [1030] Tulu v2
Evol Instruct [30000] Tulu v2
Open-Orca [30000] Tulu v2
Hardcoded [140] Tulu v2
Science [7544] Tulu v2

FLAN v2 CoT [75000] Open-Orca
FLAN v2 niv [75000] Open-Orca
FLAN v2 t0 [75000] Open-Orca
FLAN v2 flan [75000] Open-Orca

FLAN v2 CoT [75000] FLAN v2
FLAN v2 niv [75000] FLAN v2
FLAN v2 t0 [75000] FLAN v2
FLAN v2 flan [75000] FLAN v2

minor difference between results in paper and my reproducing results

looks like there Is always a minor gap between your results and mine.
tydiQA:
llama-7b: "average": {
"f1": 37.98539743712254,
"exact_match": 25.333333333333332
}
tulu-7b: "average": {
"f1": 43.70682780615583,
"exact_match": 30.0
}

GSM:
llama-7b:{
"exact_match": 0.08
}

why I can't reach the answer? for llama-7b, I download the llama-7b-hf model from HuggingFace at https://huggingface.co/yahma/llama-7b-hf/tree/main
for tulu-7b, I download at https://huggingface.co/allenai/tulu-7b and run the code:python scripts/weight_diff.py recover --path_raw ./llama-7b-hf --path_tuned ./tulu-7b-diff --path_diff ./tulu-7b

and I modify nothing except the batch_size for memory limit, but I don't think it will influence the output.

so can you tell me what's going wrong? I download the code at 10.8

Overview

Can you please provide a brief comparison of the three models you published yesterday

open-instruct-sharegpt-65b
open-instruct-human-mix-65b
tulu-65b

Contrasting their datasets and intended purposes?

full dataset of flan v2

Hi,

you offer a re-sampled version of flan v2 in prepare_train_data.py. May I ask how I find the full flan v2 dataset? I see there are multiple versions in huggingface and am not sure which version you use.

about tulu v2 conversion

Hi, I have a two questions regarding the tulu v2 dataset.

Q1. If I'm understanding correctly, the reformat_datasets.py is used to reproduce the tulu dataset. It converts a bunch of different datasets and concat into tulu v2. I can see random subsamplings, have you considered a more curated approach of subsampling from these dataset? Will it help further increase performance? (or perhaps the granularity of this research question belongs to a different study)

Q2. I am trying to convert the tulu v2 dataset into sharegpt multiturn format to adapt to my existing code. I first download the 3 parquet files, turn them into jsonl, and turn the format into sharegpt format. However, I find it extremely hard to perform the conversion, either by regex or even manually. Somehow there are always some JSON formatting issues that hf.dataset fails to deal with. Here is some code snippet that i tried to use:

def process_jsonl_file(input_file_path, output_file_path):
    # Define the regex patterns and replacements
    replacements = [
        (r'^"', ''),  # Remove leading quotation mark at the start of each line
        (r'\}\]"', '"}]}'),  # Replace '}]" with "}]},
        (r"\[{'role': 'user', 'content': \\", '{"conversations": [{"from": "human", "value": "'),  # Handle escape character
        (r"\[{'role': 'user', 'content': '", '{"conversations": [{"from": "human", "value": "'),
        (r'\}\n \{', '}, {'),  # Replace }\n { with }, {
        (r"\[{'role': 'system', 'content': '", '{"conversations": [{"from": "human", "value": "'),  # New pattern for 'system' role
    ]

def correct_json_format(json_string):
    # Correct common string format issues
    #corrected_string = json_string.replace("\\n", "\n")
    # Replace specific patterns
    corrected_string = json_string.replace("\"}\n {'role': 'user', 'content': '", "\"}, {\"from\": \"human\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'user', 'content': '", "\"}, {\"from\": \"human\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'user', 'content': \"", "\"}, {\"from\": \"human\", \"value\": \"")
    
    corrected_string = corrected_string.replace("'}\n {'role': 'assistant', 'content': \"", "\"}, {\"from\": \"gpt\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'assistant', 'content': '", "\"}, {\"from\": \"gpt\", \"value\": \"")
    corrected_string = corrected_string.replace("\"}\n {'role': 'assistant', 'content': '", "\"}, {\"from\": \"gpt\", \"value\": \"")

    # Return the corrected string
    return corrected_string

This is the target format (from UltraChat):

{"conversations": [{"from": "human", "value": "Are there any X?"}, {"from": "gpt", "value": "Yes, there are X"}, {"from": "human", "value": "That sounds great! Can you Y?"}, {"from": "gpt", "value": "Sure, here are Y"}]}
{"conversations": [{"from": "human", "value": "What percentage A?"}, {"from": "gpt", "value": "About 71%."}, {"from": "human", "value": "Wow, that's B"}, {"from": "gpt", "value": "Yes, it certainly is! "}]}

Like, am i doing this correctly? I'm completely lost at this point.

thank you for your help!

allenai / open-instruct Goto Github PK

open-instruct's Issues

Recommend Projects

Recommend Topics

Recommend Org