Coder Social home page Coder Social logo

eval's Introduction

eval

Lint Build Release License

Python Library for Evaluation

MT-Bench / MT-Bench-Branch Testing Steps

# Optional: Use cloud-instance.sh (https://github.com/instructlab/instructlab/tree/main/scripts/infra) to launch and setup the instance
scripts/infra/cloud-instance.sh ec2 launch -t g5.4xlarge
scripts/infra/cloud-instance.sh ec2 setup-rh-devenv
scripts/infra/cloud-instance.sh ec2 install-rh-nvidia-drivers
scripts/infra/cloud-instance.sh ec2 ssh sudo reboot
scripts/infra/cloud-instance.sh ec2 ssh


# Regardless of how you setup your instance
git clone https://github.com/instructlab/taxonomy.git && pushd taxonomy && git branch rc && popd
git clone --bare https://github.com/instructlab/eval.git && git clone eval.git/ && cd eval && git remote add syncrepo ../eval.git
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -e .
pip install vllm
python -m vllm.entrypoints.openai.api_server --model instructlab/granite-7b-lab --tensor-parallel-size 1

In another shell window

export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=10 # Optional if you want to shorten run times
python3 tests/test_gen_answers.py
python3 tests/test_branch_gen_answers.py

Example output tree

eval_output/
├── mt_bench
│   └── model_answer
│       └── instructlab
│           └── granite-7b-lab.jsonl
└── mt_bench_branch
    ├── main
    │   ├── model_answer
    │   │   └── instructlab
    │   │       └── granite-7b-lab.jsonl
    │   ├── question.jsonl
    │   └── reference_answer
    │       └── instructlab
    │           └── granite-7b-lab.jsonl
    └── rc
        ├── model_answer
        │   └── instructlab
        │       └── granite-7b-lab.jsonl
        ├── question.jsonl
        └── reference_answer
            └── instructlab
                └── granite-7b-lab.jsonl
python3 tests/test_judge_answers.py
python3 tests/test_branch_judge_answers.py

Example output tree

eval_output/
├── mt_bench
│   ├── model_answer
│   │   └── instructlab
│   │       └── granite-7b-lab.jsonl
│   └── model_judgment
│       └── instructlab
│           └── granite-7b-lab_single.jsonl
└── mt_bench_branch
    ├── main
    │   ├── model_answer
    │   │   └── instructlab
    │   │       └── granite-7b-lab.jsonl
    │   ├── model_judgment
    │   │   └── instructlab
    │   │       └── granite-7b-lab_single.jsonl
    │   ├── question.jsonl
    │   └── reference_answer
    │       └── instructlab
    │           └── granite-7b-lab.jsonl
    └── rc
        ├── model_answer
        │   └── instructlab
        │       └── granite-7b-lab.jsonl
        ├── model_judgment
        │   └── instructlab
        │       └── granite-7b-lab_single.jsonl
        ├── question.jsonl
        └── reference_answer
            └── instructlab
                └── granite-7b-lab.jsonl

eval's People

Contributors

nathan-weinberg avatar danmcp avatar alinaryan avatar alimaredia avatar russellb avatar booxter avatar dependabot[bot] avatar bcrochet avatar jameskunstle avatar makelinux avatar bjhargrave avatar cdoern avatar tiran avatar

Stargazers

Nikolaus Schlemm avatar ​/Thor(sten)?/ Schwesig avatar

Watchers

David Cox avatar Joe Sepi avatar Jason T. Greene avatar Leslie Hawthorn avatar Akash Srivastava avatar Máirín Duffy avatar JJ Asghar avatar Jeremy Eder avatar Anil Vishnoi avatar Kai Xu avatar Mark Sturdevant avatar Mustafa Eyceoz avatar  avatar Abhishek Bhandwaldar avatar Christian Kadner avatar Martin Hickey avatar Oindrilla Chatterjee avatar

eval's Issues

Allow usage of GPT-4 as judge model

FastChat has native support for invoking the openAI api (when provided with a valid openAI key) to invoke GPT-4 as the judge for MT-Bench assessments. This capability should be leveraged and made available for users to access, should they wish to.
Users will need to provide their own valid openAI api key

Error handling

The library currently doesn't have robust error handling. We should implement this with our custom error classes and do appropriate handling on the CLI side.

mmlu isn't consuming multiple gpus

Steps to recreate:

Launch mmlu on an instance with multiple gpus. Run:

ilab model evaluate --model models/instructlab/granite-7b-lab --benchmark mmlu

Only 1 gpu is consumed. Adjusting batch-size doesn't seem to have any effect.

current README out of sync with requirements.

After following README installation instructions, I get the error:

(venv) [root@ilab-dev-8xa100 eval]# python3 tests/test_gen_answers.py
Traceback (most recent call last):
  File "/ilab-data/jkunstle/eval/tests/test_gen_answers.py", line 2, in <module>
    from instructlab.eval.mt_bench import MTBenchEvaluator
ModuleNotFoundError: No module named 'instructlab'

Need to reconcile imports.

Add `instructlab.eval.constants` objects, would like access to `ALL_MMLU_TASKS` and similar

It'd be useful if I could pass a constant to the evaluator to tell it to "do vanilla, all tasks".

This could be the default as well- by default, people would probably want the holistic score, but they could alternatively do fewer tasks.

evaluator = MMLUEvaluator(
                model_path,
                tasks=ALL_MMLU_TASKS,
                few_shots=2,  # TODO need to know this param
                batch_size=torch.cuda.device_count(),
            )

question_id is not be stored consistently and is probably losing consistency

Currently the question_id is generated as a string here:

https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_branch_generator.py#L71

Because it is too large for an int32. But when it is returned through qa_pairs, question_ids come out looking like: 7.969730277787438e+37

Not sure yet whether it's losing the type in the dataframes or json parsing in

https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_judgment.py

One fix would be to find a smaller consistent hash. Otherwise need to figure out the typing to keep it a str throughout.

mmlu breaks logging of caller

If you put this into the instructlab cli:

        logger.info(1)
        from instructlab.eval.mmlu import MMLUEvaluator
        logger.info(2)

You won't get 2 in the output

logger still seems to work fine within the eval library logging info about mmlu. And it seems to be working fine for mt_bench in the instructlab cli with: instructlab/instructlab#1714

This is the line that causes the behavior: https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mmlu.py#L8

from lm_eval.evaluator import simple_evaluate

Tracking into lm_eval, the issue shows up here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/vllm_causallms.py#L23

from vllm import LLM, SamplingParams

Eval needs debug logging

There isn't much debug logging in the eval library today and we need to be able to figure out what's going on in user environments beyond basic exception handling.

Request: max_workers auto tuning

Currently, max-workers defaults to 16 and gives a recommendation when run on how to tune based on hardware from the cli. This isn't ideal since the recommendation is given after the run is started. Also, training calls evaluate and needs and needs a more programatic way to tune. This request is to give an 'auto' option for max workers that will give it a an appropriate value based on the available hardware configuration.

Related: instructlab/instructlab#2050

Auto detect mixtral type judge models and auto set merge_system_user_message

mt_bench currently take an option for merge_system_user_message which is needed for mixtral type judge models (Ex: prometheus). If mixtral can be auto detected, the need for this setting can be removed.

An alternate methodology might be to determine if the model is mixtral based on the format for the results. And if the format of the result looks like it was from mixtral, assume it was a mixtral model and parse it accordingly.

Enable MMLU tests to be run on an already served model

Right now the inferencing done during MMLU is done within the lm-eval-harness library. Lm-eval-harness has the ability to inference with openai-api compatible servers similar to how models are being served via ilab serve:

https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#model-apis-and-inference-servers

A fix for this issue would involve serving a model on an endpoint with either vLLM or llama-cpp and being able to pass that endpoint into MMLUEvaluator.run() or the MMLUEvaluator class to run MMLU tests.

API error handling ignores errors in results

When mt_bench gets an error in an openai call, the current logic is to retry a few times before returning $ERROR and a score of -1. I am not sure what lm_eval is doing for mmlu yet. The -1 score is later filtered out completely from the results. Speaking for mt_bench and mt_bench_branch, this results in only partial results in the eval but that isn't conveyed to the user.

Potential Options to Fix:

  • Fail on any api error
  • Show the error rate in the results so the user can make a call on whether it's meaningful to them
  • Allow user to specify an acceptable error rate

`test_branch_gen_answers` does not fail when no model is being served

In trying to work on the eval CI I noticed that the library doesn't raise any errors when there is no model being hosted at the requested port.

As I understand it, our current OpenAI error handling prints exceptions to stdout instead of failing because of some expected behavior with the API where it may be ok to fail temporarily while we keep retrying. However, currently, we are catching openai.OpenAIError, which is more general than open.APIConnectionError, which is what we are actually seeing in this case (no model being served).

I believe we can fix this one of two ways:

  1. keep the general except clause and if the error is open.APIConnectionError, we raise it. We could do this either immediately or after the max retries are completed.
  2. if applicable, we could just specify the error we're seeing that is requiring us to catch it and retry a few times. If I remember correctly this is a rate limiting issue so I'd imagine there is a separate native openai exception type for this scenario but would need to do more digging to confirm if that's the only scenario in which we'd want to carry out this retry functionality.

MTBenchEvaluator and MMLUEvaluator should be/have static methods

Evaluator objects shouldn't be reused- once we've evaluated a checkpoint or model, we want to save the score and move on to the next. This motivates a reasonable design change, implementing something like:

class MMLUEvaluator(Evaluator):

    def __init__(self):
        # optional empty initialization 
        ...

    def run(self):
       ...

    @staticmethod
    def run(self, model, tasks, few_shot, batch):
        ...

Invalid sdg path with mmlu_branch prints stack trace and unobvious error

If you run:

ilab model evaluate --model models/instructlab/granite-7b-lab --benchmark mmlu_branch --sdg-path invalid --base-model models/instructlab/granite-7b-lab

You get:

Traceback (most recent call last):
File "/home/ec2-user/instructlab/venv/bin/ilab", line 8, in
sys.exit(ilab())
^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/src/instructlab/utils.py", line 551, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/src/instructlab/model/evaluate.py", line 594, in evaluate
overall_score, individual_scores = evaluator.run()
^^^^^^^^^^^^^^^
File "/home/ec2-user/eval/src/instructlab/eval/mmlu.py", line 215, in run
results = run_mmlu(
^^^^^^^^^
File "/home/ec2-user/eval/src/instructlab/eval/mmlu.py", line 89, in run_mmlu
mmlu_output = simple_evaluate(
^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/utils.py", line 395, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 221, in simple_evaluate
task_dict = get_task_dict(tasks, task_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 444, in get_task_dict
task_name_from_string_dict = task_manager.load_task_or_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 287, in load_task_or_group
collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 181, in _load_individual_task_or_group
subtask_list = self._get_tasklist(name_or_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 133, in _get_tasklist
return self.task_index[name]["task"]
~~~~~~~~~~~~~~~^^^^^^
KeyError: 'mmlu_pr'

Need to return an appropriate EvalError so it prints a nice message with: https://github.com/instructlab/instructlab/pull/1616/files

Note: You'll get a similar error if the directory exists but doesn't have the right content inside it.

Ensure that SDG path data branch matches evaluation branch

We need to ensure the SDG path data we are receiving is being generated off the same branch we are passing for evaluation.

I have some sample data that indicates this is tracked in the SDG data in the form of origin_branch_name.

We need to meet with the SDG team and ensure that data will be there and can be consumed in a predictable way.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.