Coder Social home page Coder Social logo

premai-io / benchmarks Goto Github PK

View Code? Open in Web Editor NEW
70.0 11.0 3.0 819 KB

🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

License: MIT License

Shell 51.21% Rust 3.80% Python 44.75% Dockerfile 0.25%
ai inference-engines llmops mlops benchmarks latency performances

benchmarks's Introduction

🕹️ Benchmarks

A fully reproducible Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.

GitHub contributors GitHub commit activity GitHub last commit GitHub top language GitHub issues License

Table of Contents
  1. Quick glance towards performance metrics for Llama-2-7B
  2. Getting started
  3. Usage
  4. Contribute
  5. Roadmap
  6. Introducing Prem Grant Program

📊 Quick glance towards performance metrics for Llama-2-7B

Take a first glance of Llama-2-7B Model Performance Metrics Across Different Precision and Inference Engines. Metric used: tokens/sec

Engine float32 float16 int8 int4
candle - 36.78 ± 2.17 - -
llama.cpp - - 79.15 ± 1.20 100.90 ± 1.46
ctranslate 35.23 ± 4.01 55.72 ± 16.66 35.73 ± 10.87 -
onnx - 54.16 ± 3.15 - -
transformers (pytorch) 43.79 ± 0.61 46.39 ± 0.28 6.98 ± 0.05 21.72 ± 0.11
vllm 90.78 ± 1.60 90.54 ± 2.22 - 114.69 ± 11.20
exllamav2 - - 121.63 ± 0.74 130.16 ± 0.35
ctransformers - - 76.75 ± 10.36 84.26 ± 5.79
AutoGPTQ 42.01 ± 1.03 30.24 ± 0.41 - -
AutoAWQ - - - 109.20 ± 3.28
DeepSpeed - 81.44 ± 8.13 -
PyTorch Lightning 24.85 ± 0.07 44.56 ± 2.89 10.50 ± 0.12 24.83 ± 0.05
Optimum Nvidia 110.36 ± 0.52 109.09 ± 4.26 - -
Nvidia TensorRT-LLM 55.19 ± 1.03 85.03 ± 0.62 167.66 ± 2.05 235.18 ± 3.20

*(Data updated: 17th April 2024)

-- The above benchmarking is done on A100-80GB GPU. You can find more details for other devices like CPU/Metal under docs folder.

  • Also if you want to see more detailed information about each of the benchmark, you can find those details the respective benchmark folders.

  • If you want to compare side by side which inference engines supports which precision and device, you can check out the ml_engines.md file. Please note that this file is incomplete and a better comparision of engines will be added in the later versions.

Benchmarks can also be considered as a repository of hackable scripts, that contains the code and all the knowledge base to run the popular inference engines.

🚀 Getting Started

Welcome to our benchmarking repository! This organized structure is designed to simplify benchmark management and execution. Here's a quick guide to get you started:

  • Benchmark Organization: Each benchmark is uniquely identified as bench_name and resides in its dedicated folder, named bench_{bench_name}.

  • Benchmark Script (bench.sh): Within these benchmark folders, you'll find a common script named bench.sh. This script takes care of everything from setup and environment configuration to actual execution.

Benchmark Script Parameters

The bench.sh script supports the following key parameters, allowing for customization and flexibility:

  • prompt: Benchmark-specific prompt.
  • max_tokens: Maximum tokens for the benchmark.
  • repetitions: Number of benchmark repetitions.
  • log_file: File for storing benchmark logs.
  • device: Specify the device for benchmark execution (CPU, CUDA, Metal).
  • models_dir: Directory containing necessary model files.

Streamlined Execution

The overarching benchmark.sh script further simplifies the benchmark execution process:

  • File Download: It automatically downloads essential files required for benchmarking.
  • Folder Iteration: The script iterates through all benchmark folders in the repository, streamlining the process for multiple benchmarks.

This approach empowers users to effortlessly execute benchmarks based on their preferences. To run a specific benchmark, navigate to the corresponding benchmark folder (e.g., bench_{bench_name}) and execute the bench.sh script with the required parameters.

📄 Usage

To utilize the benchmarking capabilities of this repository, follow these usage examples:

Run a Specific Benchmark

Navigate to the benchmark folder and execute the bench.sh script with the desired parameters:

./bench_{bench_name}/bench.sh --prompt <value> --max_tokens <value> --repetitions <value> --log_file <file_path> --device <cpu/cuda/metal> --models_dir <path_to_models>

Replace <value> with the specific values for your benchmark, and <file_path> and <path_to_models> with the appropriate file and directory paths.

Run All Benchmarks Collectively

For a comprehensive execution of all benchmarks, use the overarching benchmark.sh script:

./bench.sh --prompt <value> --max_tokens <value> --repetitions <value> --log_file <file_path> --device <cpu/cuda/metal> --models_dir <path_to_models>

Again, customize the parameters according to your preferences, ensuring that <file_path> and <path_to_models> point to the correct locations.

Feel free to adjust the parameters as needed for your specific benchmarking requirements. Please note that, running all the benchmarks collectively can requires lot of storage (around 500 GB). Please make sure that you have enough storage to run all of them at once.

🤝 Contribute

We welcome contributions to enhance and expand our benchmarking repository. If you'd like to contribute a new benchmark, follow these steps:

Creating a New Benchmark

1. Create a New Folder

Start by creating a new folder for your benchmark. Name it bench_{new_bench_name} for consistency.

mkdir bench_{new_bench_name}

2. Folder Structure

Inside the new benchmark folder, include the following structure

bench_{new_bench_name}
├── bench.sh           # Benchmark script for setup and execution
├── requirements.txt   # Dependencies required for the benchmark
└── ...                # Any additional files needed for the benchmark

3. Benchmark Script (bench.sh):

The bench.sh script should handle setup, environment configuration, and the actual execution of the benchmark. Ensure it supports the parameters mentioned in the Benchmark Script Parameters section.

Pre-commit Hooks

We use pre-commit hooks to maintain code quality and consistency.

1. Install Pre-commit: Ensure you have pre-commit installed

pip install pre-commit

2. Install Hooks: Run the following command to install the pre-commit hooks

pre-commit install

The existing pre-commit configuration will be used for automatic checks before each commit, ensuring code quality and adherence to defined standards.

🗾 Roadmap

In our upcoming versions, we will be adding support for the following:

  1. Add more metrics on memory consumption. This includes how much RAM/GPU memory is consumed when we run the benchmarks.
  2. Add support for more models. Upcoming versions will support popular LLMs like Mamba, Mistral, Mixtral, Phi2 etc.
  3. Add ways to understand and articulate on change of generation quality with the change of frameworks and precision. We will try to add ways to understand how the generation quality of an LLM changes when we change the precision of the models or use a different inference engine framework.
  4. Add support for batching. Since batching is very important while deploying LLMs. So coming versions will benchmark LLMs on batched inputs.

If you feel like there is something more to add, feel free to open an issue or a PR. We would be super happy to take contributions from the community.

🏆 Introducing Prem Grant Program

Alt Text

🌟 Exciting news, AI enthusiasts! Prem is thrilled to launch the Prem Grant Program, exclusively designed for forward-thinking AI startups ready to reshape the future. With this program, you get six months of free access to OpenAI, Anthropic, Cohere, Llama2, Mistral (or any other open-source model) APIs, opening doors to endless AI possibilities at zero cost. Enjoy free fine-tuning, seamless model deployment, and expert ML support. This is more than a grant; it's an invite to lead the AI revolution. Don't miss out – apply now and let's build the future of AI together with Prem! 🌟

Read more about the Prem Startup grant program here. You can directly apply to the program from here.

benchmarks's People

Contributors

actions-user avatar anindyadeep avatar biswaroop1547 avatar filopedraz avatar nsosio avatar swarnimarun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchmarks's Issues

Petals

Test with a private swarm (refer to premAI-io/dev-portal#69)

Questions

  • Where are the bottlenecks?
  • There are no advantages to build_gpu? there's no way to force using both the local GPU and the swarm (without connecting the local GPU to the swarm)?
  • Compare Petals with Deepspeed in a centralized scenario.

Burn Upgrade

Upgrade burn version from 0.9.0 to 0.10.0 in llama burn

Roadmap - Task List

Minor fixes:

  • Fix the supported features table, to better reflect reality.
  • Cleanup the code and ensure names for user exposed ENV variables are consistent.
  • Handle quantization information, and provide info regarding quantization method used across models and scripts in the README. (llama.cpp quantization vs tinygrad vs CTranslate2 vs GPTQ)
  • Provide more CLI options for running individual scripts with other models as well. For testing frameworks on systems with less memory.

Feature(in order of priority):

  • Add custom model runner code, for running benchmarks and provide hooks for directly reporting performance metrics into as an output.
  • Setup scripts for running benchmarks with a single command and getting proper performance reports.
  • Improve caching for models, currently some scripts will end up redownloading the models, which has already been fixed for some.
  • Simplify running for any specific platform(nvidia/mac), with any supported model.
  • Auto-setup rust, python, git and etc, for the user before running the benchmark(low priority).

Linter

Description

Most of the repo is in Python. Set up a linter accordingly. Check prem-daemon accordingly.

AWQ Quantization

Description

Benchmarks for inference engines supporting this quantization method.

Two commands to run all the benchmarks

Description

The repo should expose two commands:

  • Command to run benchmarks on Mac (CPU/GPU when available)
  • Command to run benchmarks on NVIDIA GPUs

I should be able to clone the repo git clone and run bash ./mac.sh or bash ./nvidia.sh. If you want you can have multiple abstractions and CLI exposed, but this is the final objective.

Standard output should print the results in a consistent manner in order to be able to check them easily.

PyTorch

PyTorch (Transformers) (test multiple versions eg 1.2.1 vs 2.1.0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.