Coder Social home page Coder Social logo

aws-samples / foundation-model-benchmarking-tool Goto Github PK

View Code? Open in Web Editor NEW
135.0 6.0 19.0 77 MB

Foundation model benchmarking tool. Run any model on Amazon SageMaker and benchmark for performance across instance type and serving stack options.

License: MIT No Attribution

Jupyter Notebook 72.47% Python 27.10% Shell 0.43%
benchmarking foundation-models inferentia llama2 p4d sagemaker generative-ai benchmark bedrock llama3

foundation-model-benchmarking-tool's Issues

Try FMBench with instance count set to > 1 to see how scaling impacts latency and transactions per minute

It would be interesting to see the effect of scaling to multiple instances behind the same endpoint. How does inference latency change as endpoints start to scale (automatically, we could also add parameters for scaling policy), can we support the more transactions with auto-scaling instances while keeping the latency below a threshold and what are the cost implications of doing that. This needs to be fleshed out but this is an interesting area.

This would also need to include support for the Inference Configuration feature that is now available with SageMaker.

Add support for models hosted on platforms other than SageMaker

While this tool was never thought of as testing models hosted on anything other than SageMaker, but technically there is nothing preventing this. There are two things that need to happen for this.

1/ The models have to be deployed on the platform of choice. This part can be externalized, meaning the deployment code in this repo does not deploy those models, they are deployed separately, outside of this tool.

2/ Have support for bring your own inference script which knows how to query your endpoint. This inference script then runs inferences against the endpoints on platforms other than SageMaker. And so at this point it does not matter if the endpoint is on EKS or EC2.

The business summary plot in the report needs to have a caption for disclaimer

Visualizations are powerful, so the message that it is the full serving stack that results in a particular performance benchmark can get lost unless explicitly called out. It is possible that someone could take the impression that a given instance type always performance better without considering that it is the instance type+inference container +parameters and so the results should not be taken out of context.

[Highest priority] Add support for reading and writing files (configuration, metrics, bring your own model scripts) from Amazon S3.

This includes adding support for s3 readability and interaction, including all data and metrics being accessible via your personal s3 bucket. The goal of this issue is to abstract out the code on this repo in a way where you can bring your own script, your own model, source data files, upload it to s3, and then expect this repo to run, in order to generate these test results within your s3 bucket that you define. Aim for this is to have a folder in a bucket where you upload your source data files, a folder where you upload your bring your own model script, prompt template, and other model artifacts as need be, and then run this repo to generate test results within the 'data' folder that is programmatically generated containing information on metrics, per chunk and inference results, deployed model configurations, and more.

Need for Integration of FMBench Tool with External 3rd party Synthetic Data Generation Tools for Benchmarking usecase.

  1. It would be good to have a native integration in the Config.Yaml file of FMBench for integrating with other 3rd party tools for pulling synthetically generated datasets.
  2. How to split this synthetically generated datasets for FMBench based evaluation of FM's would be important functionality to have.
  3. Final Results (results.md) generated by this tool could have some Visual comparison capability with other 3rd party tools which could be used for holistic evaluation of FM's.

Merge model accuracy metrics also into this tool

The fact that we are running inference means that we can also measure the accuracy of those inferences i.e. through rouge score, cosine similarity (to an expert generated response) or other metrics. If we add that then this tool can provide a complete bechmarking solution that includes accuracy as well as cost.

config_filepath is incorrect

src/fmbench/config_filepath.txt
and
manifest.txt
Both show the config files being located in the config directory, but they are now split up under subdirectories.

This causes
from fmbench.utils import *
in
src/fmbench/0_setup.ipynb

to fail with:

config file current -> configs/config-bedrock-claude.yml, None
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[9], line 10
      8 import fmbench.scripts
      9 from pathlib import Path
---> 10 from fmbench.utils import *
     11 from fmbench.globals import *
     12 from typing import Dict, List, Optional

File ~/.fmbench/lib/python3.11/site-packages/fmbench/utils.py:11
      9 import unicodedata
     10 from pathlib import Path
---> 11 from fmbench import globals
     12 from fmbench import defaults
     13 from typing import Dict, List

File ~/.fmbench/lib/python3.11/site-packages/fmbench/globals.py:53
     51     CONFIG_FILE_CONTENT = response.text
     52 else:
---> 53     CONFIG_FILE_CONTENT = Path(CONFIG_FILE).read_text()
     55 # check if the file is still parameterized and if so replace the parameters with actual values
     56 # if the file is not parameterized then the following statements change nothing
     57 args = dict(region=session.region_name,
     58             role_arn=arn_string,
     59             write_bucket=f"{defaults.DEFAULT_BUCKET_WRITE}-{region_name}-{account_id}",
     60             read_bucket=f"{defaults.DEFAULT_BUCKET_READ}-{region_name}-{account_id}")

File /usr/lib/python3.11/pathlib.py:1058, in Path.read_text(self, encoding, errors)
   1054 """
   1055 Open the file in text mode, read it, and close the file.
   1056 """
   1057 encoding = io.text_encoding(encoding)
-> 1058 with self.open(mode='r', encoding=encoding, errors=errors) as f:
   1059     return f.read()

File /usr/lib/python3.11/pathlib.py:1044, in Path.open(self, mode, buffering, encoding, errors, newline)
   1042 if "b" not in mode:
   1043     encoding = io.text_encoding(encoding)
-> 1044 return io.open(self, mode, buffering, encoding, errors, newline)

FileNotFoundError: [Errno 2] No such file or directory: 'configs/config-bedrock-claude.yml'

Per notebook run support via py package repo

The need of this issue is to have granular access to each notebook while having the repo being a package that can be run via pip. This is for advanced users in the space who want to change the code for different metrics and modifications so they can run each notebook one by one along with having the option to pip install the fmbt package.

Add support for different payload formats for bring your own datasets for that might be needed for different inference containers

This tool currently supports the HF TGI container, and DJL Deep Speed container on SageMaker and both use the same format but in future other containers might need a different payload format.

Goal: To give user full flexibility to bring their payloads or contain code that generalizes payload generation irrespective of the container type that the user uses. Two options for solution to this issue here:

1/ Have the user bring in their own payload
2/ Have a generic function defined to convert the payload in support for the container type the user is using to deploy their model and generate inference from.

Assign cost per run for FMBT

To calculate the cost per config file run for this FMBT harness. This includes model instance type, inference, cost per transactions and so on to sum up the entire run's total cost.

Emit live metrics

Emit live metrics so that they can be monitored through Grafana via live dashboard. More information to come on this issue but the goal here is to provide full flexibility to the user to be able to view metrics in ways that best suits the needs of their business and technological goals.

[TBD] --> Some sort of an analytics pipeline sending and emitting live results for different model configurations, their results and different metrics based on the needs of the user.

Add code to determine the cost of running an entire experiment and include it in the final report

Add code to determine the cost of running an entire experiment and include it in the final report. This would only include the cost of running the SageMaker endpoints based on hourly public pricing (the cost of running this code on a notebook or a EC2 is trivial in comparison and can be ignored).

Running this entire benchmarking test, we can add a couple of lines to calculate the total cost being used to run the specific experiment from an end to end perspective to answer simple questions like:

I am running the experiment for this config file and got the benchmarking results successfully in 'x' time. What is the cost that will be incurred to run this experiment?

Containerize FMBench and provide instructions for running the container on EC2

Containerize FMBT and provide instructions for running the container on EC2 - Once all of the files are integrated via S3 and all the code is abstracted out in terms of generating metrics for any deployable model on SageMaker (including bring your own model/scripts), we want to be able to containerize this and run the container on EC2.

Goal: To choose the specific config file and prompt to run the container on it, generating results without any heavy lifting of any development efforts.

Provide config file for FLAN-T5 out of the box

FLAN-T5 XL is still used by multiple customers so a comparison of this model across g5.xlarge and g5.2xlarge instances would be very useful and so a config file for this should be provided.

Add support for a custom token counter

Currently only the LLama tokenizer is supported but we want to allow users to bring their own token counting logic for different models. This way regardless of the model type, or token count methodology, the user should be able to get accurate results based on their token counter that they use.

Goal: Abstracting out the repo and tool to a point where no matter what token counter type the user uses, you can bring that and run the container to get accurate test results.

Compare different models for the same dataset

There is nothing in FMBench which prevents different experiments in the same config file to use different models, but the generated report is not architected in the same way i.e. it is not created to compare different models but rather compare the same model across serving stacks, so that would need to change. This has been requested by multiple customers, the idea being if we find different models that are fit for task, we now want to find the model and serving stack combination which provides the best price:performance.

Code cleanup needed to replace the notebooks with regular python files

The work for this repo started as a skunkworks project done over the holidays in the Winter of 2024 and so at the time this was just a bunch of notebooks but now that it has transformed into a formal open-source project with a ton of functionality and a bunch of roadmap items, the notebooks have turned unwieldy!

We need to replace the notebooks with regular python scripts and plus there is a whole bunch of code cleanup that needs to happen to replace global imports, use type hints everywhere, optimizations of functions, etc., the list is long.

Assigning this to myself for now, and would create issues for specific items.

Add Bedrock benchmarking to this tool

Can we add Bedrock to this tool? While support for bring your own inference script would do that but we need to think through Bedrock specific options such as provisioned throughput, auto-generated reported formats, do we want to compare Bedrock and SageMaker side by side.

Support for custom datasets and custom inference scripts

For models other than Llama and Mistral (say BERT) we need datasets other than LongBench and these models have their own response format.

  1. Add support for bring your own dataset by parameterizing the prompt template.
  2. Add support for custom inference scripts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.