The eval from instructlab

Allow usage of GPT-4 as judge model

FastChat has native support for invoking the openAI api (when provided with a valid openAI key) to invoke GPT-4 as the judge for MT-Bench assessments. This capability should be leveraged and made available for users to access, should they wish to.
Users will need to provide their own valid openAI api key

Error handling

The library currently doesn't have robust error handling. We should implement this with our custom error classes and do appropriate handling on the CLI side.

mmlu isn't consuming multiple gpus

Steps to recreate:

Launch mmlu on an instance with multiple gpus. Run:

ilab model evaluate --model models/instructlab/granite-7b-lab --benchmark mmlu

Only 1 gpu is consumed. Adjusting batch-size doesn't seem to have any effect.

Add mmlu testing code

Add the mmlu testing code in a sep file

convo from: #10 (comment)

current README out of sync with requirements.

After following README installation instructions, I get the error:

(venv) [root@ilab-dev-8xa100 eval]# python3 tests/test_gen_answers.py
Traceback (most recent call last):
  File "/ilab-data/jkunstle/eval/tests/test_gen_answers.py", line 2, in <module>
    from instructlab.eval.mt_bench import MTBenchEvaluator
ModuleNotFoundError: No module named 'instructlab'

Need to reconcile imports.

Add `instructlab.eval.constants` objects, would like access to `ALL_MMLU_TASKS` and similar

It'd be useful if I could pass a constant to the evaluator to tell it to "do vanilla, all tasks".

This could be the default as well- by default, people would probably want the holistic score, but they could alternatively do fewer tasks.

evaluator = MMLUEvaluator(
                model_path,
                tasks=ALL_MMLU_TASKS,
                few_shots=2,  # TODO need to know this param
                batch_size=torch.cuda.device_count(),
            )

Add test runner to CI

@danmcp has introduced some tests in #9 - we should get testing setup as part of our current CI stack once they are merged in

Add E2E test workflow to CI

Similar to instructlab/sdg#33

Add specific run methods to mmlu

Currently, there is only one run method in the mmlu child class - let's add more specific run methods to the class like MT-Bench has done.

Convo here: #10 (comment)

[Epic] CLI Integration (July 15 GA)

question_id is not be stored consistently and is probably losing consistency

Currently the question_id is generated as a string here:

https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_branch_generator.py#L71

Because it is too large for an int32. But when it is returned through qa_pairs, question_ids come out looking like: 7.969730277787438e+37

Not sure yet whether it's losing the type in the dataframes or json parsing in

https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mt_bench_judgment.py

One fix would be to find a smaller consistent hash. Otherwise need to figure out the typing to keep it a str throughout.

mmlu breaks logging of caller

If you put this into the instructlab cli:

        logger.info(1)
        from instructlab.eval.mmlu import MMLUEvaluator
        logger.info(2)

You won't get 2 in the output

logger still seems to work fine within the eval library logging info about mmlu. And it seems to be working fine for mt_bench in the instructlab cli with: instructlab/instructlab#1714

This is the line that causes the behavior: https://github.com/instructlab/eval/blob/main/src/instructlab/eval/mmlu.py#L8

from lm_eval.evaluator import simple_evaluate

Tracking into lm_eval, the issue shows up here: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/vllm_causallms.py#L23

from vllm import LLM, SamplingParams

Eval needs debug logging

There isn't much debug logging in the eval library today and we need to be able to figure out what's going on in user environments beyond basic exception handling.

[Epic] MMLU Branch

#35

[Epic] MT Bench Branch

#39
#38

Error running `tests/test_gen_answers.py` and `tests/test_judge_answers.py` with prometheus-8x7b-V2.0

vLLM emits the error:

ERROR 07-05 19:50:10 serving_chat.py:225] Error in applying chat template from request: Conversation roles must alternate user/assistant/user/assistant/... INFO: ::1:39498 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

I think that the major problem is that granite uses a system/user/assistant patter rather than /user/assistant only.

Convert E2E job to use AWS runners

See instructlab/instructlab#2036 for reference

Request: max_workers auto tuning

Currently, max-workers defaults to 16 and gives a recommendation when run on how to tune based on hardware from the cli. This isn't ideal since the recommendation is given after the run is started. Also, training calls evaluate and needs and needs a more programatic way to tune. This request is to give an 'auto' option for max workers that will give it a an appropriate value based on the available hardware configuration.

Related: instructlab/instructlab#2050

Auto detect mixtral type judge models and auto set merge_system_user_message

mt_bench currently take an option for merge_system_user_message which is needed for mixtral type judge models (Ex: prometheus). If mixtral can be auto detected, the need for this setting can be removed.

An alternate methodology might be to determine if the model is mixtral based on the format for the results. And if the format of the result looks like it was from mixtral, assume it was a mixtral model and parse it accordingly.

Enable MMLU tests to be run on an already served model

Right now the inferencing done during MMLU is done within the lm-eval-harness library. Lm-eval-harness has the ability to inference with openai-api compatible servers similar to how models are being served via ilab serve:

https://github.com/EleutherAI/lm-evaluation-harness?tab=readme-ov-file#model-apis-and-inference-servers

A fix for this issue would involve serving a model on an endpoint with either vLLM or llama-cpp and being able to pass that endpoint into MMLUEvaluator.run() or the MMLUEvaluator class to run MMLU tests.

LM Eval version needs to be bumped

The latter issue is causing the former - a fix has been introduced for the latter - we need to bump our version of LM Evauation Harness once the new one is available then subsequently bump the Eval lib version

API error handling ignores errors in results

When mt_bench gets an error in an openai call, the current logic is to retry a few times before returning $ERROR and a score of -1. I am not sure what lm_eval is doing for mmlu yet. The -1 score is later filtered out completely from the results. Speaking for mt_bench and mt_bench_branch, this results in only partial results in the eval but that isn't conveyed to the user.

Potential Options to Fix:

Fail on any api error
Show the error rate in the results so the user can make a call on whether it's meaningful to them
Allow user to specify an acceptable error rate

`test_branch_gen_answers` does not fail when no model is being served

In trying to work on the eval CI I noticed that the library doesn't raise any errors when there is no model being hosted at the requested port.

As I understand it, our current OpenAI error handling prints exceptions to stdout instead of failing because of some expected behavior with the API where it may be ok to fail temporarily while we keep retrying. However, currently, we are catching openai.OpenAIError, which is more general than open.APIConnectionError, which is what we are actually seeing in this case (no model being served).

I believe we can fix this one of two ways:

keep the general except clause and if the error is open.APIConnectionError, we raise it. We could do this either immediately or after the max retries are completed.
if applicable, we could just specify the error we're seeing that is requiring us to catch it and retry a few times. If I remember correctly this is a rate limiting issue so I'd imagine there is a separate native openai exception type for this scenario but would need to do more digging to confirm if that's the only scenario in which we'd want to carry out this retry functionality.

Prometheus-2 support

this commit (xukai92/FastChat@5d44295) is needed for https://huggingface.co/prometheus-eval/prometheus-8x7b-v2.0 and should only be triggered for it

[Epic] MT Bench

#39
#26

MTBenchEvaluator and MMLUEvaluator should be/have static methods

Evaluator objects shouldn't be reused- once we've evaluated a checkpoint or model, we want to save the score and move on to the next. This motivates a reasonable design change, implementing something like:

class MMLUEvaluator(Evaluator):

    def __init__(self):
        # optional empty initialization 
        ...

    def run(self):
       ...

    @staticmethod
    def run(self, model, tasks, few_shot, batch):
        ...

Invalid sdg path with mmlu_branch prints stack trace and unobvious error

If you run:

ilab model evaluate --model models/instructlab/granite-7b-lab --benchmark mmlu_branch --sdg-path invalid --base-model models/instructlab/granite-7b-lab

You get:

Traceback (most recent call last):
File "/home/ec2-user/instructlab/venv/bin/ilab", line 8, in
sys.exit(ilab())
^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/src/instructlab/utils.py", line 551, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/src/instructlab/model/evaluate.py", line 594, in evaluate
overall_score, individual_scores = evaluator.run()
^^^^^^^^^^^^^^^
File "/home/ec2-user/eval/src/instructlab/eval/mmlu.py", line 215, in run
results = run_mmlu(
^^^^^^^^^
File "/home/ec2-user/eval/src/instructlab/eval/mmlu.py", line 89, in run_mmlu
mmlu_output = simple_evaluate(
^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/utils.py", line 395, in _wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/evaluator.py", line 221, in simple_evaluate
task_dict = get_task_dict(tasks, task_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 444, in get_task_dict
task_name_from_string_dict = task_manager.load_task_or_group(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 287, in load_task_or_group
collections.ChainMap(*map(self._load_individual_task_or_group, task_list))
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 181, in _load_individual_task_or_group
subtask_list = self._get_tasklist(name_or_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ec2-user/instructlab/venv/lib64/python3.11/site-packages/lm_eval/tasks/init.py", line 133, in _get_tasklist
return self.task_index[name]["task"]
~~~~~~~~~~~~~~~^^^^^^
KeyError: 'mmlu_pr'

Need to return an appropriate EvalError so it prints a nice message with: https://github.com/instructlab/instructlab/pull/1616/files

Note: You'll get a similar error if the directory exists but doesn't have the right content inside it.

[Epic] MMLU

#35
#26

MTBench Branch evaluator breaks when it encounters av3 taxonomy qna.yaml

It currently is trying to parse files as v2 taxonomy yaml.

Ensure that SDG path data branch matches evaluation branch

We need to ensure the SDG path data we are receiving is being generated off the same branch we are passing for evaluation.

I have some sample data that indicates this is tracked in the SDG data in the form of origin_branch_name.

We need to meet with the SDG team and ensure that data will be there and can be consumed in a predictable way.

Unify duplicate code in MMLU and MMLUBranch

Followup from #19

MMLU and MMLUBranch have a block of identical code - this should be unified into a single function both classes can call

instructlab / eval Goto Github PK

eval's Introduction

eval

MT-Bench / MT-Bench-Branch Testing Steps

eval's People

Contributors

Stargazers

Watchers

Forkers

eval's Issues

Recommend Projects

Recommend Topics

Recommend Org