Coder Social home page Coder Social logo

vikingmew / llm-rgb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from babelcloud/llm-rgb

0.0 0.0 0.0 481 KB

LLM Reasoning and Generation Benchmark. Evaluate LLMs in complex scenarios systematically.

Home Page: http://llm-rgb.babel.run

License: MIT License

JavaScript 3.87% TypeScript 82.48% CSS 12.93% HTML 0.72%

llm-rgb's Introduction

LLM Reasoning and Generation Benchmark

This repository contains a collection of detailed test cases (prompts) designed to evaluate the reasoning and generation capabilities of Language Learning Models (LLMs) in complex scenarios. It's important to note that this benchmark is not intended to be a comprehensive test for LLMs. The project was initially developed as an internal project at babel.cloud, with the aim of assessing the performance of LLMs in understanding context and complying with instructions.

Complex scenarios present three main challenges compared to chat or simple generations:

  1. Context Length: A single prompt may contain more than 8000 tokens (approximately 20K characters).
  2. Reasoning Depth: The generation of an answer may require multi-step reasoning.
  3. Instruction Compliance: The LLM may need to generate a response in a specific format, rather than in natural language.

Each test case is a generation task for an LLM, without involving multi-turn conversations. The complexity of each test case is assessed based on the following dimensions:

Context Length Difficulty: 1 - 3

The value is 1 if the prompt contains 2000 characters or less. If the number of characters is between 2000 and 5000 (inclusive), the value is 2. If it's more than 5000, the value is 3. The model's actual performance in this dimension depends on the result of each task and the task's context length difficulty. It's not accurate to rate a model's ability in different context lengths solely based on the maximum context length that the model can handle.

Reasoning Depth Difficulty: 1 - 4

The value is 1 if the answer can be inferred directly from the context, such as a knowledge base. If the answer requires reasoning, the value is 2, for example, "Who is considered the father of the iPhone and what is the last digit of his birth year?". If the answer requires reasoning with the provided context, the value is 4, such as writing a program using the provided context syntax.

Instruction Compliance Difficulty: 1 - 3

The value is 1 if the expected response is in natural language without any special requirements. If the expected response should be in a specific style such as "YES or NO", "Shell command only", the value is 2. If the expected response requires a structural format such as JSON, YAML, the value is 3.

The difficulty of each test case (Dn) is the sum of the three difficulties. Each test case includes a set of assertions to evaluate the LLM's output. The result of the assertion (Rn) is a decimal between [0, 1]. The final score of the test case (Sn) is calculated as Rn x Dn. "n" is the test case number. The total score for each LLM is the sum of all test case scores (S1...Sn).

Score Table

The following tables show the evaluation results, executed on Oct. 22nd, 2023. We ran the evaluation 10 times and take the average scores. The full score of all 15 testcases is 100.

Score by Abilities

image

Score by Testcases

image

Evaluation Details

Please check the following link for evaluation details of above table. Result-1 Result-2 Result-3 Result-4 Result-5 Result-6 Result-7 Result-8 Result-9 Result-10

  1. GPT-4: openai:gpt-4-0613
  2. GPT-3.5: openai:gpt-3.5-turbo-16k-0613
  3. Claude2: anthropic:claude-2
  4. Minimax: minimax:abab5.5-chat
  5. Cohere: cohere:command
  6. Palm2: google:code-bison
  7. Baidu: baidu:ERNIE-Bot
  8. ChatGLM: zhipu:chatglm_pro
  9. Aliqwen: alibaba:qwen-plus-v1
  10. Llama2: meta:llama2-70b-v2-chat
  11. Baichuan2: baichuan:Baichuan2-53B

Quick Start

The testing tools used in this project are provided by promptfoo. To run evaluations, you need to fill in the LLM configurations in promptfooconfig.yaml. You should comment out any providers and test cases that you don't want to use.

npm install
npm run start

By default, the test result will be uploaded so that you can share the test result link. If you don't want to share the test result:

npm run start:noshare

If you don't have a suitable environment to run the tests, you can use LLM-RGB Online.

If you want to run these tests against LLMs that are not currently listed, you can add custom webhook providers in the same way as the existing ones.

Contribute Test Cases

We welcome contributions of test cases that can evaluate the reasoning and generation abilities of LLMs. Please refer to the existing test cases for the required files and formats.

llm-rgb's People

Contributors

zhlmmc avatar fly88oj avatar bazinga-wang avatar vangie avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.