Coder Social home page Coder Social logo

bentovllm's Introduction

Self-host LLMs with vLLM and BentoML

This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine.

See here for a full list of BentoML example projects.

💡 This example is served as a basis for advanced code customization, such as custom model, inference logic or vLLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

Prerequisites

  • You have installed Python 3.8+ and pip. See the Python downloads page to learn more.
  • You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
  • If you want to test the Service locally, you need a Nvidia GPU with at least 16G VRAM.
  • (Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.

Install dependencies

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/mistral-7b-instruct
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"

Run the BentoML Service

We have defined a BentoML Service in service.py. Run bentoml serve in your project directory to start the Service.

$ bentoml serve .

2024-01-18T07:51:30+0800 [INFO] [cli] Starting production HTTP BentoServer from "service:VLLM" listening on http://localhost:3000 (Press CTRL+C to quit)
INFO 01-18 07:51:40 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-18 07:51:40 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 01-18 07:51:46 model_runner.py:547] Graph capturing finished in 6 secs.

The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.

CURL
curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Explain superconductors like I'\''m five years old",
  "tokens": null
}'
Python client
import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.generate(
        prompt="Explain superconductors like I'm five years old",
        tokens=None
    )
    for response in response_generator:
        print(response)
OpenAI-compatible endpoints

This Service uses the @openai_endpoints decorator to set up OpenAI-compatible endpoints (chat/completions and completions). This means your client can interact with the backend Service (in this case, the VLLM class) as if they were communicating directly with OpenAI's API. This utility does not affect your BentoML Service code, and you can use it for other LLMs as well.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

chat_completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.

client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')

For detailed explanations of the Service code, see vLLM inference.

Deploy to BentoCloud

After the Service is ready, you can deploy the application to BentoCloud for better management and scalability. Sign up if you haven't got a BentoCloud account.

Make sure you have logged in to BentoCloud, then run the following command to deploy it.

bentoml deploy .

Once the application is up and running on BentoCloud, you can access it via the exposed URL.

Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.

Different LLM Models

Besides the mistral-7b-instruct model, we have examples for other models in subdirectories of this repository. Below is a list of these models and links to the example subdirectories.

LLM tools integration examples

  • Every model directory contains codes to add OpenAI compatible endpoints to the BentoML service.
  • outlines-integration/ contains the code to integrate with outlines for structured generation.

bentovllm's People

Contributors

larme avatar bojiang avatar sherlock113 avatar ssheng avatar aarnphm avatar lycheel1 avatar frostming avatar

Stargazers

Josef Ježek avatar Jorge Willians avatar yixinzhang avatar Leo Lee avatar 姚文强 avatar  avatar JaeYoung Jo avatar Veera Vignesh avatar  avatar  avatar Kyle Mistele avatar Aaron avatar  avatar Yuchen Cheng avatar  avatar Camilo Piñón avatar Nikolaus Schlemm avatar Chaoyu avatar Juhong avatar Dao Xuan Do avatar Hwanython avatar

Watchers

 avatar Chaoyu avatar  avatar Kostas Georgiou avatar  avatar  avatar

bentovllm's Issues

VLLM is stuck on Outlines 0.0.34 and this sample requires 0.0.37

I am relatively new to VLLM and BentoML, but trying to get this to work fails with a range of issues.

INFO: pip is looking at multiple versions of vllm to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install outlines==0.0.37 and vllm==0.4.0.post1 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested outlines==0.0.37
vllm 0.4.0.post1 depends on outlines==0.0.34

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

I then try to adapt the sample to use 0.0.34, updating the service.py as follows:

@bentoml.api
async def adapted(
    self,
    prompt: str = DEFAULT_USER_PROMPT,
    max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
    json_schema: t.Optional[str] = DEFAULT_SCHEMA,
) -> AsyncGenerator[str, None]:
    from vllm import SamplingParams
    **from vllm.model_executor.guided_logits_processors import JSONLogitsProcessor**


    SAMPLING_PARAM = SamplingParams(
            max_tokens=max_tokens,
            **logits_processors=[JSONLogitsProcessor(json_schema, self.engine.engine)]**
    )



    prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
    stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)

    # Standard Stuff
    cursor = 0
    async for request_output in stream:
        text = request_output.outputs[0].text
        yield text[cursor:]
        cursor = len(text)

But then I get this error:
| exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/aaron/.local/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
| await func()
| File "/home/aaron/.local/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response
| async for chunk in self.body_iterator:
| File "/home/aaron/.local/lib/python3.10/site-packages/_bentoml_sdk/io_models.py", line 183, in async_stream
| async for item in obj:
| File "/home/aaron/BentoVLLM/mistral-7b-instruct/service.py", line 96, in competitors
| logits_processors=[JSONLogitsProcessor(json_schema, self.engine.engine)]
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 154, in init
| super().init(regex_string, tokenizer)
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 116, in init
| tokenizer = self.adapt_tokenizer(tokenizer)
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 44, in adapt_tokenizer
| tokenizer.vocabulary = tokenizer.get_vocab()
| AttributeError: '_AsyncLLMEngine' object has no attribute 'get_vocab'

Any help is appreciated.

BentoVLLM Service Fails to Start on Linux Server Due to Pydantic Related Errors

Hello,

I attempted to run a BentoVLLM example on a Linux server, ensuring all prerequisites were met and following the installation guide accordingly.

However, when trying to run the BentoML Service, I encountered an error preventing the service from starting.

The error log suggests an issue related to pydantic, and updating the version does not resolve the issue. The detailed error message is as follows:

ubuntu@ip-172-31-55-53:./serving/BentoVLLM$ bentoml serve .
2024-03-12T10:06:30+0900 [INFO] [cli] Starting production HTTP BentoServer from "service:VLLM" listening on http://localhost:3000 (Press CTRL+C to quit)
INFO 03-12 10:06:35 llm_engine.py:87] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-12 10:06:39 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-12 10:07:02 llm_engine.py:357] # GPU blocks: 2910, # CPU blocks: 2048
INFO 03-12 10:07:03 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-12 10:07:03 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 03-12 10:07:08 model_runner.py:756] Graph capturing finished in 5 secs.

2024-03-12T10:07:09+0900 [ERROR] [entry_service:VLLM_OpenAI:1] Initializing service error
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/type_adapter.py", line 149, in init
core_schema = _getattr_no_parents(type, 'pydantic_core_schema')
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/type_adapter.py", line 94, in _getattr_no_parents
raise AttributeError(attribute)
AttributeError: pydantic_core_schema

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_generate_schema.py", line 461, in _generate_schema
obj = _typing_extra.evaluate_fwd_ref(obj, globalns=self.types_namespace)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_typing_extra.py", line 414, in evaluate_fwd_ref
return ref._evaluate(globalns=globalns, localns=localns)
File "/usr/lib/python3.8/typing.py", line 518, in _evaluate
eval(self.forward_code, globalns, localns),
File "", line 1, in
NameError: name 'CompletionRequest' is not defined

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/_bentoml_sdk/service/factory.py", line 220, in call
return self.inner()
File "/home/ubuntu/serving/BentoVLLM/bentovllm_openai/utils.py", line 116, in init
async def create_completion(request: "CompletionRequest", raw_request: Request):
File "/home/ubuntu/.local/lib/python3.8/site-packages/fastapi/routing.py", line 956, in decorator
self.add_api_route(
...
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_generate_schema.py", line 463, in _generate_schema
raise PydanticUndefinedAnnotation.from_name_error(e) from e
pydantic.errors.PydanticUndefinedAnnotation: name 'CompletionRequest' is not defined

For further information visit https://errors.pydantic.dev/2.0.2/u/undefined-annotation

...

INFO 03-12 10:07:10 serving_chat.py:302] Using default chat template:
INFO 03-12 10:07:10 serving_chat.py:302] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
2024-03-12T10:07:10+0900 [ERROR] [entry_service:VLLM_OpenAI:1] Application startup failed. Exiting.

The Python dependencies for further debugging are as follows:

python 3.8.10
bentoml==1.2.6
pydantic==2.6.3
pydantic_core==2.16.3
pydantic-settings==2.2.1
fastapi==0.110.0

I would appreciate any advice or solutions to resolve this issue. Thank you in advance for your assistance.

Using Bentoml v 1.2.10, cant connect to host network and fails to download the debian

Hallo,

service: "service:VLLM"
labels:
  owner: bentoml-team
  stage: demo
include:
- "*.py"
- "bentovllm_openai/*.py"
python:
  requirements_txt: "./requirements.txt"
  lock_packages: false
docker:
    distro: debian
    python_version: "3.10"
    cuda_version: "12.1.1"

adding:

docker:
    distro: debian
    python_version: "3.10"
    cuda_version: "12.1.1"
    network_mode: "host"

Also deosnt work, containering does work at all. There just no documation about it how could we do it

Upgrading to VLLM 0.4.1 - TypeError

I recently upgraded to VLLM 0.4.1 and now get the following error. This looks internal to bento not my service (which is basically the default Llama 3 sample). here is my requirements.txt

accelerate==0.29.3 bentoml>=1.2.12 packaging==24.0 torch==2.2.1 transformers==4.40.0 vllm==0.4.1

2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Initializing service error
Traceback (most recent call last):
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_sdk/service/factory.py", line 230, in call
instance = self.inner()
File "/home/admin/BentoVLLM/llama3-8b-instruct/bentovllm_openai/utils.py", line 77, in init
self.openai_serving_completion = OpenAIServingCompletion(
TypeError: OpenAIServingCompletion.init() got an unexpected keyword argument 'served_model'
2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Traceback (most recent call last):
File "/home/admin/.local/lib/python3.10/site-packages/starlette/routing.py", line 732, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/admin/.local/lib/python3.10/site-packages/bentoml/_internal/server/base_app.py", line 74, in lifespan
on_startup()
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_impl/server/app.py", line 313, in create_instance
self._service_instance = self.service()
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_sdk/service/factory.py", line 230, in call
instance = self.inner()
File "/home/admin/BentoVLLM/llama3-8b-instruct/bentovllm_openai/utils.py", line 77, in init
self.openai_serving_completion = OpenAIServingCompletion(
TypeError: OpenAIServingCompletion.init() got an unexpected keyword argument 'served_model'

2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Application startup failed. Exiting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.