Coder Social home page Coder Social logo

hsiehjackson / ruler Goto Github PK

View Code? Open in Web Editor NEW
467.0 467.0 30.0 366 KB

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

License: Apache License 2.0

Dockerfile 0.84% Shell 7.27% Python 91.89%

ruler's Introduction

📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

Models Claimed Length Effective Length 4K 8K 16K 32K 64K 128K Avg. wAvg. (inc) wAvg. (dec)
Llama2 (7B) 4K 85.6
Gemini-1.5-pro 1M >128K 96.7 95.8 96.0 95.9 95.9 94.4 95.8 95.5 (1st) 96.1 (1st)
GPT-4-1106-preview 128K 64K 96.6 96.3 95.2 93.2 87.0 81.2 91.6 89.0 (2nd) 94.1 (2nd)
Llama3.1 (70B) 128K 64K 96.5 95.8 95.4 94.8 88.4 66.6 89.6 85.5 (6th) 93.7 (3rd)
Qwen2 (72B) 128K 32K 96.9 96.1 94.9 94.1 79.8 53.7 85.9 79.6 (12th) 92.3 (4th)
Command-R-plus (104B) 128K 32K 95.6 95.2 94.2 92.0 84.3 63.1 87.4 82.7 (9th) 92.1 (5th)
GLM4 (9B) 1M 64K 94.7 92.8 92.1 89.9 86.7 83.1 89.9 88.0 (3rd) 91.7 (6th)
Llama3.1 (8B) 128K 32K 95.5 93.8 91.6 87.4 84.7 77.0 88.3 85.4 (7th) 91.3 (7th)
Command-R (35B) 128K 32K 93.8 93.3 92.4 89.5 84.9 76.0 88.3 85.5 (5th) 91.1 (8th)
MegaBeam-Mistral (7B) 512K 32K 93.8 92.5 92.0 89.2 83.7 83.7 89.1 87.3 (4th) 91.0 (9th)
GradientAI/Llama3 (70B) 1M 16K 95.1 94.4 90.8 85.4 80.9 72.1 86.5 82.6 (10th) 90.3 (10th)
Mixtral-8x22B (39B/141B) 64K 32K 95.6 94.9 93.4 90.9 84.7 31.7 81.9 73.5 (16th) 90.3 (11th)
Yi (34B) 200K 32K 93.3 92.2 91.3 87.5 83.2 77.3 87.5 84.8 (8th) 90.1 (12th)
Phi3-mini (3.8B) 128K 32K 92.2 91.5 90.7 87.5 80.6 66.7 84.8 80.9 (11th) 88.7 (13th)
Phi3-medium (14B) 128K 32K 93.3 93.2 91.1 86.8 78.6 46.1 81.5 74.8 (15th) 88.3 (14th)
Mixtral-8x7B (12.9B/46.7B) 32K 32K 94.9 92.1 92.5 85.9 72.4 44.5 80.4 72.8 (17th) 87.9 (15th)
GradientAI/Llama3 (8B) 1M 16K 92.8 90.3 85.7 79.9 76.3 69.5 82.4 78.5 (13th) 86.3 (16th)
FILM-7B* (7B) 32K 32K 92.8 88.2 88.1 86.9 70.1 27.1 75.5 66.4 (19th) 84.7 (17th)
InternLM2.5 (7B) 1M 4K 88.1 85.5 84.5 82.7 75.5 68.9 80.9 77.8 (14th) 83.9 (18th)
Mistral (7B) 32K 16K 93.6 91.2 87.2 75.4 49.0 13.8 68.4 55.6 (21th) 81.2 (19th)
Mistral-Nemo 128K 16K 87.8 87.2 87.7 69.0 46.8 19.0 66.2 54.7 (22th) 77.8 (20th)
GLM3 (6B) 128K 4K 87.8 83.4 78.6 69.9 56.0 42.0 69.6 62.0 (20th) 77.2 (21th)
LWM (7B) 1M <4K 82.3 78.4 73.7 69.1 68.1 65.0 72.8 69.9 (18th) 75.7 (22th)
DBRX (36B/132B) 32K 8K 95.1 93.8 83.6 63.1 2.4 0.0 56.3 38.0 (23th) 74.7 (23th)
Qwen1.5 (72B) 32K 8K 94.9 93.8 78.0 67.8 0.0 0.0 55.7 37.5 (24th) 74.0 (24th)
Together (7B) 32K 4K 88.2 81.1 69.4 63.0 0.0 0.0 50.3 33.8 (25th) 66.7 (25th)
LongChat (7B) 32K <4K 84.7 79.9 70.8 59.3 0.0 0.0 49.1 33.1 (26th) 65.2 (26th)
LongAlpaca (13B) 32K <4K 60.6 57.0 56.6 43.6 0.0 0.0 36.3 24.7 (27th) 47.9 (27th)
  • Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, all models (except for Gemini-1.5-pro) exhibit large degradation on tasks in RULER as sequence length increases.
  • While all models claim context size of 32k tokens or greater, only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama-2-7b performance at 4K (85.6%). The performance exceeding the threshold is underlined.
  • Almost all models fall below the threshold before reaching the claimed context lengths.
  • Notes (FILM-7B)
    • The results are submitted by authors of this paper. They use YaRN without further training for the evaluation length exceeding 32K (64K and 128K).
    • They do not use the one-shot example for the CWE task.

💡 Requirements

  • Docker container: docker pull cphsieh/ruler:0.1.0
  • The requirements are listed in docker/Dockerfile and docker/requirements.txt. Use the following command to build the container based on NVIDIA's PyTorch container nvcr.io/nvidia/pytorch:23.08-py3.
cd docker/
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t cphsieh/ruler:0.1.0 .

🔍 Evaluate long-context LMs

1. Download data

cd scripts/data/synthetic/json/
python download_paulgraham_essay.py
bash download_qa_dataset.sh

2. Download model

  • We download the models from Huggingface.
  • The input template of each model is stored in scripts/data/template.py. Please add new model template if your new model uses a different chat template.
  • (Optional) If you are using TensorRT-LLM, please build your model engine based on their example scripts (e.g., Llama) with their Docker container.

3. Run evaluation pipeline

  • Setup run.sh
GPUS="" # number of GPUs
ROOT_DIR="" # the path that stores generated task samples and model predictions. 
MODEL_DIR="" # the path that contains individual model folders from Huggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.
  • Setup config_models.sh
case $MODEL_NAME in
    YOUR_HF_MODEL_NAME)
        MODEL_PATH=${MODEL_DIR}/YOUR_MODEL_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="" # hf or vllm
        ;;
    YOUR_TRTLLM_ENGINE_NAME)
        MODEL_PATH=${ENGINE_DIR}/YOUR_ENGINE_FOLDER
        MODEL_TEMPLATE_TYPE="" # base, meta-chat, etc. defined in `scripts/data/template.py`
        MODEL_FRAMEWORK="trtllm"
        ;;
    YOUR_OPENAI_MODEL_NAME)
        MODEL_PATH="" # OpenAI model name listed in https://platform.openai.com/docs/models/
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="openai"
        TOKENIZER_PATH="cl100k_base"
        TOKENIZER_TYPE="openai"
        OPENAI_API_KEY="" # your OpenAI API key
        ;;
    YOUR_GEMINI_MODEL_NAME)
        MODEL_PATH="" # Gemini model name listed in https://ai.google.dev/gemini-api/docs/models/gemini
        MODEL_TEMPLATE_TYPE="base"
        MODEL_FRAMEWORK="gemini"
        TOKENIZER_PATH=$MODEL_PATH
        TOKENIZER_TYPE="gemini"
        GEMINI_API_KEY="" # your Gemini API key
        ;;
  • Start evaluation based on our default synthetic benchmark
bash run.sh YOUR_MODEL_NAME synthetic

🧠 (Optional) Customize task complexity

The tasks to be evaluated on are stored in scripts/config_tasks.sh. Configuration of each task is defined in scripts/synthetic.yaml. The complexity of each task can be configured by changing the arguments which we describe in detail below.

Category Task name Configurations
Retrieval niah type_haystack: repeat/essay/needle
# repeat: repeated noise sentences
# essay: Paul Graham Essays
# needle: distracted needles

type_needle_k: words/numbers/uuids
type_needle_v: words/numbers/uuids
# words: adjective-noun
# numbers: 7 digits
# uuids: 32 digits

num_needle_k: int >= 1
# add multiple needles in haystack
num_needle_v: int >= 1
# retrieve multiple values from a single key
num_needle_q: int >= 1
# retrieve multiple values from multiple keys
Multi-hop
Tracing
variable_tracking num_chains: int >= 1
# number of variable name-binding chains
num_hops: int >= 1
# number of times binding variable names in each chain
Aggregation common_words_extraction freq_cw: int >= 1
# frequency of common words
freq_ucw: int >= 1
# frequency of uncommon words
num_cw: int >= 1
# number of common words
Aggregation freq_words_extraction alpha: float > 1.0
# parameter of the distributation to draw synthetic words. Reducing alpha to increase the difficulty of this task. Note that increasing the number of words to return also increases the difficulty of this task, we use 3 in our evaluations as models show worse performance at short context size when more words need to be returned.
Question
Answering
qa dataset: squad or hotpotqa
# the short-context qa dataset we use

🚀 (Optional) Contribute a new synthetic task

1. Create a python script for data preparation

  • Add basic arguments (required) and complexity configurations in the python script.
  • Verify the script is reproducible given a tokenizer, a sequence length, and a random seed.
  • Save the script under the folder scripts/data/synthetic.

2. Add task template

  • Add template and tokens_to_generate in scripts/data/synthetic/constants.py.
  • Add answer_predfix to prevent model from refusing to answer.

3. Add evaluation metric

  • Add the automatic metric to evaluate your task in scripts/eval/synthetic/constants.py

4. Add required configurations

  • Define your task name and complexity configurations in scripts/synthetic.yaml.
  • Add your task name in scripts/config_tasks.sh

🛠️ Limitations

While tasks in RULER are designed to be configurable, we only evaluate the above models with 13 task configurations. These tasks were selected because most models can achieve good (some almost perfect) performance at short context size (<= 4K), which leaves ample room to observe degradation as we extend the input length. We did not include more complexed tasks in RULER that models show worse performance at short context size. We also did not stress test every model with more difficult task configurations. Although RULER covers four task categories extending previous evaluation protocol and provides a clean test bed for sanity-checking LMs with known upper bound performance, it is by no means comprehensive enough and it cannot replace the more preferred realistic tasks. We welcome people to contribute new tasks and/or new task categories to help evaluate long-context capabilities.

📝 Citation

@article{hsieh2024ruler,
  title={RULER: What's the Real Context Size of Your Long-Context Language Models?},
  author={Cheng-Ping Hsieh and Simeng Sun and Samuel Kriman and Shantanu Acharya and Dima Rekesh and Fei Jia and Yang Zhang and Boris Ginsburg},
  year={2024}
  journal={arXiv preprint arXiv:2404.06654},
}

Disclaimer: This project is strictly for research purposes, and not an official product from NVIDIA.

ruler's People

Contributors

chandler-bing avatar hsiehjackson avatar luc1an0-h3 avatar luckyoungrous avatar viktorooreps avatar wangmerlyn avatar ying1123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ruler's Issues

Reproducing results 4k (LLaMA-2 7B chat, Mistral 7B Instruct v0.2)

Great work on the benchmark. Before benchmarking a model continually pre-trained with Infini-Attention, I wanted to do some sanity checks on the benchmark on reproducibility.

Experimental setup:

MODEL_PATH: meta-llama/Llama-2-7b-chat-hf [hf link](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)
MODEL_TEMPLATE_TYPE="meta-chat" # for mistral "mistral": "<s>[INST] {task_template} [/INST]",
MODEL_FRAMEWORK="vllm"
TEMPERATURE="0.0" # greedy
TOP_P="1.0"
TOP_K="32"
SEQ_LENGTHS=(
    4096
)
NUM_SAMPLES=500
REMOVE_NEWLINE_TAB=false
STOP_WORDS=""

The task parameters are unedited in the synthetic.yaml from the current main branch link

Results

For LLaMA-2 7b chat this is the summary.csv (4k sequence length)

Tasks niah_single_1 niah_single_2 niah_single_3 niah_multikey_1 niah_multikey_2 niah_multikey_3 niah_multivalue niah_multiquery vt cwe fwe qa_1 qa_2
Score 100.0 100.0 99.6 98.4 91.8 89.2 98.65 97.75 65.36 92.98 80.4 64.6 41.6
Nulls 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 1/500 0/500 0/500

Thus, making an average for 4k of 80.02% compared to the reported 85.6%

For Mistral 7B Instruct v0.2 this is the summary.csv (4k sequence length):

Tasks niah_single_1 niah_single_2 niah_single_3 niah_multikey_1 niah_multikey_2 niah_multikey_3 niah_multivalue niah_multiquery vt cwe fwe qa_1 qa_2
Score 100.0 99.0 95.2 98.0 99.8 96.4 95.9 97.1 98.44 98.68 87.2 84.8 62.2
Nulls 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500 0/500

Thus, making an average for 4k of 86.62% compared to the reported 93.6%

Questions

  • Do you have any idea how to reproduce the numbers? Are the parameters set accordingly?
  • Could you post a pip freeze to encourage reproducibility? A pip install of the current requirements.txt both on apptainer and a virtual environment will cause incompatibility issues, so I had to downgrade some libraries.

This will also help me in reproducing the Phi-3 128k results, as I also got around 54% average for 4k.
Thanks!

No Generated Output and JSON Serialization Error when calling llm directly in VLLMClient

Description:

I was unable to run the vllm server on my server, so I modified the VLLMClient class in the RULER/scripts/pred/client_wrappers.py file. Below is the modified code for the VLLMClient class:

class VLLMClient(Client):
    def _single_call(
        self,
        prompts,
        tokens_to_generate,
        temperature,
        top_p,
        top_k,
        random_seed,
        stop: List[str],
    ):
        request = {
            "prompt": prompts[0],
            "max_tokens": tokens_to_generate,
            "temperature": temperature,
            "top_k": top_k,
            "top_p": top_p,
            "stop": stop
        }

        from vllm import LLM, SamplingParams
        # Create a sampling params object.
        sampling_params = SamplingParams(temperature=0.0, top_p=0.95) #kept purposefully

        # Create an LLM.
        llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", gpu_memory_utilization=0.6)
        # Generate outputs containing the prompt, generated text, and other information.
        outputs = llm.generate(prompts, sampling_params)
        # Print the outputs.
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text if output.outputs else None
            print(f"Generated text: {generated_text!r}")
        return outputs

When I run the command bash run.sh llama3 synthetic, I do not see any generated text and encounter the following error:

Generated text: ''
Traceback (most recent call last):
  File "/RULER/scripts/pred/call_api.py", line 280, in <module>
    main()
  File "/RULER/scripts/pred/call_api.py", line 276, in main
    fout.write(json.dumps(outputs_parallel[computed_idx]) + '\n')
  File "/anaconda3/envs/ruler/lib/python3.10/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/anaconda3/envs/ruler/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/anaconda3/envs/ruler/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/anaconda3/envs/ruler/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type RequestOutput is not JSON serializable

I am trying to get the scores for llama3-8B and then I want to test on llama3-8B-1M context. Could you please help me resolve this issue?

How to test models with larger context length than 128K ?

Hi @hsiehjackson ,

I tried using your repo to test HF models like gradientai/Llama-3-8B-Instruct-Gradient-1048k , but couldn't load the entire model on a single A100 GPU. I wanted to use accelerate library or anything to load the model for experiments greater than 32K (currently we can test upto 32K on my GPU). Would love to hear how we can achieve this.

lost in the middle problem

In the NIAH task do you address the lost in the middle problem ? In the sense that can we control that the needles are inserted only in the middle and not in the beginning and the end because that seems to be where the hardness of the problem lies?

Tempate for Yi?

Great job for this project!

Can we have the template you used for the evaluation of Yi?
Or did you simply use the meta-chat for Yi?

Please let me know for the sake of reproducibility!

pre_sample in qa code

I see
parser.add_argument("--pre_samples", type=int, default=0, help='number of samples are already generated') used here for qa
input_text, answer = generate_input_output(index + args.pre_samples, used_docs) Why would you ever need this argument?

Time taken on 8 A100?

Hi,

Great work! I find this eval suite very handy. Just curious, how long would it take to perform the full evaluation for a 7B model with 8A100?

Gemini flash 1.5 results

Does anyone have the results for this model? Seems to hallucinate quite a lot in long context prompts, even though it has a context size of a million tokens.

Thanks.

hope add qwen2-7b-chat result

thanks for the great project, hope add official qwen2-7b-chat result,to compair with glm-4-96-chat.
according to qwen tech report: qwen2 is much better than glm4 on long context evaluation , but I doubt it...

Show Gemini Pro results

Hi,

your paper doesn’t show Gemini Pro results with 1M tokens.
Please test against this model.

niah.py hang with hf models

envs:

ubuntu 22.04
docker image ruler-0.1
model jamba-v0.1
A100 x4

# added model_config ....

Templates = {
    'base': "{task_template}",

    'jamba': """<|im_start|>system 
You are a helpful AI assistant.
<|im_end|> 
<|im_start|>user
{task_template}
<|im_end|> 
<|im_start|>assistant
""",

  case $MODEL_NAME in
      jamba)
          MODEL_PATH="/home/scratch.jianh_inf/.cache/huggingface/hub/models--lightblue--Jamba-v0.1-chat-multilingual/snapshots/38a2d5d2301ba642d1a48be1251a825022f78730"
          MODEL_TEMPLATE_TYPE="jamba"
          MODEL_FRAMEWORK="hf"
          ;;

bash run.sh jamba synthetic

This script hangs here 12 hours

logs

he/huggingface/hub/models--lightblue--Jamba-v0.1-chat-multilingual/snapshots/38a2d5d2301ba642d1a48be1251a825022f78730:hf:::::
+ IFS=:
+ read MODEL_PATH MODEL_TEMPLATE_TYPE MODEL_FRAMEWORK TOKENIZER_PATH TOKENIZER_TYPE OPENAI_API_KEY GEMINI_API_KEY AZURE_ID AZURE_SECRET AZURE_ENDPOINT
+ '[' -z /home/scratch.jianh_inf/.cache/huggingface/hub/models--lightblue--Jamba-v0.1-chat-multilingual/snapshots/38a2d5d2301ba642d1a48be1251a825022f78730 ']'
+ export OPENAI_API_KEY=
+ OPENAI_API_KEY=
+ export GEMINI_API_KEY=
+ GEMINI_API_KEY=
+ export AZURE_API_ID=
+ AZURE_API_ID=
+ export AZURE_API_SECRET=
+ AZURE_API_SECRET=
+ export AZURE_API_ENDPOINT=
+ AZURE_API_ENDPOINT=
+ source config_tasks.sh
++ NUM_SAMPLES=500
++ REMOVE_NEWLINE_TAB=false
++ STOP_WORDS=
++ '[' -z '' ']'
++ STOP_WORDS=
++ '[' false = false ']'
++ REMOVE_NEWLINE_TAB=
++ synthetic=("niah_single_1" "niah_single_2" "niah_single_3" "niah_multikey_1" "niah_multikey_2" "niah_multikey_3" "niah_multivalue" "niah_multiquery" "vt" "cwe" "fwe" "qa_1" "qa_2")
+ BENCHMARK=synthetic
+ declare -n TASKS=synthetic
+ '[' -z niah_single_1 ']'
+ '[' hf == vllm ']'
+ '[' hf == trtllm ']'
+ for MAX_SEQ_LENGTH in "${SEQ_LENGTHS[@]}"
+ RESULTS_DIR=/home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072
+ DATA_DIR=/home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/data
+ PRED_DIR=/home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/pred
+ mkdir -p /home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/data
+ mkdir -p /home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/pred
+ for TASK in "${TASKS[@]}"
+ python data/prepare.py --save_dir /home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/data --benchmark synthetic --task niah_single_1 --tokenizer_path /home/scratch.jianh_inf/.cache/huggingface/hub/models--lightblue--Jamba-v0.1-chat-multilingual/snapshots/38a2d5d2301ba642d1a48be1251a825022f78730 --tokenizer_type hf --max_seq_length 131072 --model_template_type jamba --num_samples 500
python /home/scratch.jianh_gpu/projects/RULER/scripts/data/synthetic/niah.py         --save_dir  /home/scratch.jianh_gpu/projects/RULER/jamba/synthetic/131072/data         --save_name niah_single_1         --subset validation         --tokenizer_path /home/scratch.jianh_inf/.cache/huggingface/hub/models--lightblue--Jamba-v0.1-chat-multilingual/snapshots/38a2d5d2301ba642d1a48be1251a825022f78730         --tokenizer_type hf         --max_seq_length 131072         --tokens_to_generate 128         --num_samples 500         --random_seed 42         --type_haystack repeat --type_needle_k words --type_needle_v numbers --num_needle_k 1 --num_needle_v 1 --num_needle_q 1                           --template "<|im_start|>system
You are a helpful AI assistant.
<|im_end|>
<|im_start|>user
Some special magic {type_needle_v} are hidden within the following text. Make sure to memorize it. I will quiz you about the {type_needle_v} afterwards.
{context}
What are all the special magic {type_needle_v} for {query} mentioned in the provided text?
<|im_end|>
<|im_start|>assistant
 The special magic {type_needle_v} for {query} mentioned in the provided text are"

Error in hugging face links in README

For what it's worth, I've noticed that a few Hugging face links in the README in the introductory table are wrong, most of the times missing the t of instruct. Other than that, thank you for making this code available!

Is there a particular reason to not support batch processing?

Every 2.0s: nvidia-smi                                   d26b4303cee2: Tue Jul 16 21:03:36 2024

Tue Jul 16 21:03:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0             149W / 400W |  20439MiB / 40960MiB |     52%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

I run LLaMA 2 on A100 GPU (on Google Colab, so maybe the environment is not perfect) and get 50% utilization.

I can try to implement batching myself, but need some advice on what to avoid and what not to break.

dataset argument for qa.py not specified

In the sample command you specify for qa.py, you don't specify the dataset argument https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/qa.py#L58
and I am getting this error. Can you let me know what dataset should be ? I suppose you pass those somewhere when you run things end to end?

(long-context) vivekkaul@Viveks-MacBook-Pro synthetic % python qa.py \
    --save_dir=./ \
    --save_name=qa \
    --tokenizer_path=tokenizer.model \
    --tokenizer_type=hf \
    --max_seq_length=4096 \
    --tokens_to_generate=128 \
    --num_samples=10 \
    --template="Answer the question based on the given documents. Only give me the answer and do not output any other words.\n\nThe following are given documents.\n\n{context}\n\nAnswer the question based on the given documents. Only give me the answer and do not output any other words.\n\nQuestion: {query} Answer:"
usage: qa.py [-h] --save_dir SAVE_DIR --save_name SAVE_NAME [--subset SUBSET] --tokenizer_path TOKENIZER_PATH [--tokenizer_type TOKENIZER_TYPE]
             --max_seq_length MAX_SEQ_LENGTH --tokens_to_generate TOKENS_TO_GENERATE --num_samples NUM_SAMPLES [--pre_samples PRE_SAMPLES]
             [--random_seed RANDOM_SEED] --template TEMPLATE [--remove_newline_tab] --dataset DATASET
qa.py: error: the following arguments are required: --dataset

Prediction format during evals

Hi Authors,

I am trying hard to understand the post-processing format before evaluation step.
For a particular case I am looking at, the prediction (LLM output) is a string, for eg:
'1. arthur 2. kilt 3. fire 4. meter 5. appliance 6. behalf 7. forest 8. activity 9. authenticity 10. ferret'
The corresponding ground-truth is
['appliance', 'meter', 'forest', 'ferret', 'kilt', 'behalf', 'fire', 'activity', 'arthur', 'authenticity']

When I do the string_match_all function, the output is 0.31. This does not look right.

Specifically https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L29 this line is doing a zip between a string and a list, which would be a character-wise zip.

Where am I going wrong?

Score is always 0.0, and it takes so long to prepare the dataset

Hi, I am following the instructions to run the synthetic benchmark.

I use the LLaMA-2-chat-hf model, and I specify the path in run.sh

GPUS="1" # GPU size for tensor_parallel.
ROOT_DIR="RULER/results" # the path that stores generated task samples and model predictions. 
MODEL_DIR="RULER/models" # the path that contains individual model folders from HUggingface.
ENGINE_DIR="" # the path that contains individual engine folders from TensorRT-LLM.

However, I found that the prepare.py takes so long to run

for TASK in "${TASKS[@]}"; do
        echo "Start prepare..."
        python data/prepare.py \
            --save_dir ${DATA_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --tokenizer_path ${TOKENIZER_PATH} \
            --tokenizer_type ${TOKENIZER_TYPE} \
            --max_seq_length ${MAX_SEQ_LENGTH} \
            --model_template_type ${MODEL_TEMPLATE_TYPE} \
            --num_samples ${NUM_SAMPLES} \
            ${REMOVE_NEWLINE_TAB}
        echo "Start call api..."
        python pred/call_api.py \
            --data_dir ${DATA_DIR} \
            --save_dir ${PRED_DIR} \
            --benchmark ${BENCHMARK} \
            --task ${TASK} \
            --server_type ${MODEL_FRAMEWORK} \
            --model_name_or_path ${MODEL_PATH} \
            --temperature ${TEMPERATURE} \
            --top_k ${TOP_K} \
            --top_p ${TOP_P} \
            ${STOP_WORDS}
    done

And the evaluation score is always 0.0, for example, I use the hotpotQA benchmark, and it keeps outputting:

Prepare qa_2 with lines: 500 to RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
Used time: 5.8 minutes
Start call api...
Predict qa_2 
from RULER/results/llama2-7b-chat/synthetic/131072/data/qa_2/validation.jsonl
to RULER/results/llama2-7b-chat/synthetic/131072/pred/qa_2.jsonl
0it [00:00, ?it/s]
Used time: 0.0 minutes
Start evaluate...
Total tasks: ['qa_2']
Evaluate task qa_2...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 517943.20it/s]

=============================================

       0        1
0  Tasks     qa_2
1  Score      0.0
2  Nulls  500/500

Saved eval results to RULER/results/llama2-7b-chat/synthetic/131072/pred/summary-qa_2.csv

Saved submission results to RULER/results/llama2-7b-chat/synthetic/131072/pred/submission.csv

I'd appreciate any help on this.

prediction evaluation statistics

Using a hf model I got the prediction text as "pred": {"text": [" The special magic number for wandering-age mentioned in the provided text is 8090293"]} I suppose it should be 8090293 only without the other text?

I was using a local hugging face model using the parameters specified here
https://huggingface.co/microsoft/Phi-3-mini-128k-instruct and I was wondering if

def process_data(data):
    prompt = data['input']
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": data['input']}
    ]
    output = pipe(messages, **generation_args)
    assert len(output) == 1
    generated_text = output[0]['generated_text']
    print(len(generated_text))
    print(generated_text)
    if generated_text.startswith(prompt):
        generated_text = generated_text[len(prompt):]
    return {'text': [generated_text]}

Llama 3 rope theta

Thanks for the great work!

From the README:

The results are evaluated by changing rope_theta to 16M in here.

Can I know the reason for adjusting rope_theta here rather than directly using say dynamic rope scaling? Thanks!

Raw scores?

Hey authors, really nice work!

The paper shows scores that are averaged across tasks for each test. Are the full set of task scores per model available anywhere? Particularly, for Gemini, only the final averaged score is available on Github.

Also, any plans to test beyond 128k for Gemini? Given that the test doesn't saturate at 128k for Gemini, it seems important.

how do you take care of the presence of 'and' in the output in the evaluation

Sometimes the output also can emit "and" for say the multi-value case. Should we not account for that in the output and evaluate the string match accordingly or you never faced such an issue?

3728882, 7210606, 7120868, and 8606962

was my output and my outputs are
['8606962', '7120868', '3728882', '7210606']

Why do you use partial match max metric for QA

Just wanted to know why we have https://github.com/hsiehjackson/RULER/blob/main/scripts/eval/synthetic/constants.py#L25
Why is this different from string_match_all for QA specifically ? Basically if any of the predictions match the reference, it is ok ? I didn't quite understand this well.

 def string_match_part(preds, refs):
    score = sum([max([1.0 if r.lower() in pred.lower() else 0.0 for r in ref]) for pred, ref in zip(preds, refs)]) / len(preds) * 100
    return round(score, 2)

When will the codes be release

Hello! I just read your impressive work! It solved one of my long-standing concerns: Do those "long-context" LLMs truly have real long context windows? I can't wait to test more models using your codebase!

questions about ICL code for variable tracking

Thanks for your work I am a bit confused about the code here
https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/variable_tracking.py#L116

        print(f'internal {is_icl}')
        cutoff = template.index(TASKS['variable_tracking']['template'][:20])
        cutoff_ans = template.index(TASKS['variable_tracking']['answer_prefix'][:10])
        template = ' '.join(template[cutoff:cutoff_ans].split()[:-1]) + template[cutoff_ans:]

I had few questions on the code:

  1. Why do you need to use cutoff and cutoff_ans. Is this to remove INST or what model template? Won't the model template change with every model? Secondly why do you have answer_prefix[:10], I don't understand the reason for this.
  2. From what I understand, in the first pass you generate on ICL learning example and doing so you use variable with length 3 (otherwise it is 5), in the second pass you generate actual text to find the variables, right?
  3. https://github.com/hsiehjackson/RULER/blob/main/scripts/data/synthetic/variable_tracking.py#L128 why do we return vars[0] not vars from generate_input_output?
  4. Why do you remove last word after splitting here ' '.join(template[cutoff:cutoff_ans].split()[:-1]) ?
  5. If there is no inserted model template and we only use the task template do we not need this code?

Question about files nouns.list and verbs.list

Just wanted to know if you had your own files nounlist.txt and verblist.txt or is it inbuilt into wonderwords. I don't seem to find those in the documentation.
nouns = wonderwords.random_word._get_words_from_text_file("nounlist.txt")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.