bentoml / bentovllm Goto Github PK

21.0 6.0 8.0 74 KB

Self-host LLMs with vLLM and BentoML

Python 99.87% Shell 0.13%

bentovllm's Introduction

Self-host LLMs with vLLM and BentoML

This is a BentoML example project, showing you how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine.

See here for a full list of BentoML example projects.

💡 This example is served as a basis for advanced code customization, such as custom model, inference logic or vLLM options. For simple LLM hosting with OpenAI compatible endpoint without writing any code, see OpenLLM.

Prerequisites

You have installed Python 3.8+ and pip. See the Python downloads page to learn more.
You have a basic understanding of key concepts in BentoML, such as Services. We recommend you read Quickstart first.
If you want to test the Service locally, you need a Nvidia GPU with at least 16G VRAM.
(Optional) We recommend you create a virtual environment for dependency isolation for this project. See the Conda documentation or the Python documentation for details.

Install dependencies

git clone https://github.com/bentoml/BentoVLLM.git
cd BentoVLLM/mistral-7b-instruct
pip install -r requirements.txt && pip install -f -U "pydantic>=2.0"

Run the BentoML Service

We have defined a BentoML Service in service.py. Run bentoml serve in your project directory to start the Service.

$ bentoml serve .

2024-01-18T07:51:30+0800 [INFO] [cli] Starting production HTTP BentoServer from "service:VLLM" listening on http://localhost:3000 (Press CTRL+C to quit)
INFO 01-18 07:51:40 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 01-18 07:51:40 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 01-18 07:51:46 model_runner.py:547] Graph capturing finished in 6 secs.

The server is now active at http://localhost:3000. You can interact with it using the Swagger UI or in other different ways.

CURL

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Explain superconductors like I'\''m five years old",
  "tokens": null
}'

Python client

import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response_generator = client.generate(
        prompt="Explain superconductors like I'm five years old",
        tokens=None
    )
    for response in response_generator:
        print(response)

OpenAI-compatible endpoints

This Service uses the @openai_endpoints decorator to set up OpenAI-compatible endpoints (chat/completions and completions). This means your client can interact with the backend Service (in this case, the VLLM class) as if they were communicating directly with OpenAI's API. This utility does not affect your BentoML Service code, and you can use it for other LLMs as well.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:3000/v1', api_key='na')

# Use the following func to get the available models
client.models.list()

chat_completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {
            "role": "user",
            "content": "Explain superconductors like I'm five years old"
        }
    ],
    stream=True,
)
for chunk in chat_completion:
    # Extract and print the content of the model's reply
    print(chunk.choices[0].delta.content or "", end="")

Note: If your Service is deployed with protected endpoints on BentoCloud, you need to set the environment variable OPENAI_API_KEY to your BentoCloud API key first.

export OPENAI_API_KEY={YOUR_BENTOCLOUD_API_TOKEN}

You can then use the following line to replace the client in the above code snippet. Refer to Obtain the endpoint URL to retrieve the endpoint URL.

client = OpenAI(base_url='your_bentocloud_deployment_endpoint_url/v1')

For detailed explanations of the Service code, see vLLM inference.

Deploy to BentoCloud

After the Service is ready, you can deploy the application to BentoCloud for better management and scalability. Sign up if you haven't got a BentoCloud account.

Make sure you have logged in to BentoCloud, then run the following command to deploy it.

bentoml deploy .

Once the application is up and running on BentoCloud, you can access it via the exposed URL.

Note: For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.

Different LLM Models

Besides the mistral-7b-instruct model, we have examples for other models in subdirectories of this repository. Below is a list of these models and links to the example subdirectories.

LLM tools integration examples

Every model directory contains codes to add OpenAI compatible endpoints to the BentoML service.
outlines-integration/ contains the code to integrate with outlines for structured generation.

bentovllm's People

Contributors

Stargazers

Watchers

Forkers

sansavvy parano sherlock113 jipyeong-lee bojiang orialpha provega saroantonellolovito

bentovllm's Issues

VLLM is stuck on Outlines 0.0.34 and this sample requires 0.0.37

I am relatively new to VLLM and BentoML, but trying to get this to work fails with a range of issues.

INFO: pip is looking at multiple versions of vllm to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install outlines==0.0.37 and vllm==0.4.0.post1 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested outlines==0.0.37
vllm 0.4.0.post1 depends on outlines==0.0.34

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

I then try to adapt the sample to use 0.0.34, updating the service.py as follows:

@bentoml.api
async def adapted(
    self,
    prompt: str = DEFAULT_USER_PROMPT,
    max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS,
    json_schema: t.Optional[str] = DEFAULT_SCHEMA,
) -> AsyncGenerator[str, None]:
    from vllm import SamplingParams
    **from vllm.model_executor.guided_logits_processors import JSONLogitsProcessor**


    SAMPLING_PARAM = SamplingParams(
            max_tokens=max_tokens,
            **logits_processors=[JSONLogitsProcessor(json_schema, self.engine.engine)]**
    )



    prompt = PROMPT_TEMPLATE.format(user_prompt=prompt)
    stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM)

    # Standard Stuff
    cursor = 0
    async for request_output in stream:
        text = request_output.outputs[0].text
        yield text[cursor:]
        cursor = len(text)

But then I get this error:
| exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/aaron/.local/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
| await func()
| File "/home/aaron/.local/lib/python3.10/site-packages/starlette/responses.py", line 250, in stream_response
| async for chunk in self.body_iterator:
| File "/home/aaron/.local/lib/python3.10/site-packages/_bentoml_sdk/io_models.py", line 183, in async_stream
| async for item in obj:
| File "/home/aaron/BentoVLLM/mistral-7b-instruct/service.py", line 96, in competitors
| logits_processors=[JSONLogitsProcessor(json_schema, self.engine.engine)]
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 154, in init
| super().init(regex_string, tokenizer)
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 116, in init
| tokenizer = self.adapt_tokenizer(tokenizer)
| File "/home/aaron/.local/lib/python3.10/site-packages/vllm/model_executor/guided_logits_processors.py", line 44, in adapt_tokenizer
| tokenizer.vocabulary = tokenizer.get_vocab()
| AttributeError: '_AsyncLLMEngine' object has no attribute 'get_vocab'

Any help is appreciated.

BentoVLLM Service Fails to Start on Linux Server Due to Pydantic Related Errors

Hello,

I attempted to run a BentoVLLM example on a Linux server, ensuring all prerequisites were met and following the installation guide accordingly.

However, when trying to run the BentoML Service, I encountered an error preventing the service from starting.

The error log suggests an issue related to pydantic, and updating the version does not resolve the issue. The detailed error message is as follows:

ubuntu@ip-172-31-55-53:./serving/BentoVLLM$ bentoml serve .
2024-03-12T10:06:30+0900 [INFO] [cli] Starting production HTTP BentoServer from "service:VLLM" listening on http://localhost:3000 (Press CTRL+C to quit)
INFO 03-12 10:06:35 llm_engine.py:87] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-12 10:06:39 weight_utils.py:163] Using model weights format ['*.safetensors']
INFO 03-12 10:07:02 llm_engine.py:357] # GPU blocks: 2910, # CPU blocks: 2048
INFO 03-12 10:07:03 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-12 10:07:03 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
INFO 03-12 10:07:08 model_runner.py:756] Graph capturing finished in 5 secs.

2024-03-12T10:07:09+0900 [ERROR] [entry_service:VLLM_OpenAI:1] Initializing service error
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/type_adapter.py", line 149, in init
core_schema = _getattr_no_parents(type, 'pydantic_core_schema')
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/type_adapter.py", line 94, in _getattr_no_parents
raise AttributeError(attribute)
AttributeError: pydantic_core_schema

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_generate_schema.py", line 461, in _generate_schema
obj = _typing_extra.evaluate_fwd_ref(obj, globalns=self.types_namespace)
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_typing_extra.py", line 414, in evaluate_fwd_ref
return ref._evaluate(globalns=globalns, localns=localns)
File "/usr/lib/python3.8/typing.py", line 518, in _evaluate
eval(self.forward_code, globalns, localns),
File "", line 1, in
NameError: name 'CompletionRequest' is not defined

The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.8/site-packages/_bentoml_sdk/service/factory.py", line 220, in call
return self.inner()
File "/home/ubuntu/serving/BentoVLLM/bentovllm_openai/utils.py", line 116, in init
async def create_completion(request: "CompletionRequest", raw_request: Request):
File "/home/ubuntu/.local/lib/python3.8/site-packages/fastapi/routing.py", line 956, in decorator
self.add_api_route(
...
File "/home/ubuntu/.local/lib/python3.8/site-packages/pydantic/_internal/_generate_schema.py", line 463, in _generate_schema
raise PydanticUndefinedAnnotation.from_name_error(e) from e
pydantic.errors.PydanticUndefinedAnnotation: name 'CompletionRequest' is not defined

For further information visit https://errors.pydantic.dev/2.0.2/u/undefined-annotation

...

INFO 03-12 10:07:10 serving_chat.py:302] Using default chat template:
INFO 03-12 10:07:10 serving_chat.py:302] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
2024-03-12T10:07:10+0900 [ERROR] [entry_service:VLLM_OpenAI:1] Application startup failed. Exiting.

The Python dependencies for further debugging are as follows:

python 3.8.10
bentoml==1.2.6
pydantic==2.6.3
pydantic_core==2.16.3
pydantic-settings==2.2.1
fastapi==0.110.0

I would appreciate any advice or solutions to resolve this issue. Thank you in advance for your assistance.

No Swagger UI on localhost:3000

Hi,

the completition works, but you can't reach Swagger UI on Port 3000.

(look: https://docs.bentoml.org/en/latest/use-cases/large-language-models/vllm.html
-> The server is active at http://localhost:3000/. You can interact with it in different ways.
-> Swagger UI
-> Visit http://localhost:3000/, scroll down to Service APIs, and click Try it out. In the Request body box, enter your prompt and click Execute.)

You recieve:

{"detail":"Not Found"}

Using Bentoml v 1.2.10, cant connect to host network and fails to download the debian

Hallo,

service: "service:VLLM"
labels:
  owner: bentoml-team
  stage: demo
include:
- "*.py"
- "bentovllm_openai/*.py"
python:
  requirements_txt: "./requirements.txt"
  lock_packages: false
docker:
    distro: debian
    python_version: "3.10"
    cuda_version: "12.1.1"

adding:

docker:
    distro: debian
    python_version: "3.10"
    cuda_version: "12.1.1"
    network_mode: "host"

Also deosnt work, containering does work at all. There just no documation about it how could we do it

Does BentoVLLM support multi LoRA adapters functionality of vLLM?

vLLM is (experimentally) supporting serving with multiple LoRA adapters and we can select the desired adapter for each request.

I am wondering whether we can still leverage those features when deploy with BentoVLLM.

Upgrading to VLLM 0.4.1 - TypeError

I recently upgraded to VLLM 0.4.1 and now get the following error. This looks internal to bento not my service (which is basically the default Llama 3 sample). here is my requirements.txt

accelerate==0.29.3 bentoml>=1.2.12 packaging==24.0 torch==2.2.1 transformers==4.40.0 vllm==0.4.1

2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Initializing service error
Traceback (most recent call last):
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_sdk/service/factory.py", line 230, in call
instance = self.inner()
File "/home/admin/BentoVLLM/llama3-8b-instruct/bentovllm_openai/utils.py", line 77, in init
self.openai_serving_completion = OpenAIServingCompletion(
TypeError: OpenAIServingCompletion.init() got an unexpected keyword argument 'served_model'
2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Traceback (most recent call last):
File "/home/admin/.local/lib/python3.10/site-packages/starlette/routing.py", line 732, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/home/admin/.local/lib/python3.10/site-packages/bentoml/_internal/server/base_app.py", line 74, in lifespan
on_startup()
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_impl/server/app.py", line 313, in create_instance
self._service_instance = self.service()
File "/home/admin/.local/lib/python3.10/site-packages/_bentoml_sdk/service/factory.py", line 230, in call
instance = self.inner()
File "/home/admin/BentoVLLM/llama3-8b-instruct/bentovllm_openai/utils.py", line 77, in init
self.openai_serving_completion = OpenAIServingCompletion(
TypeError: OpenAIServingCompletion.init() got an unexpected keyword argument 'served_model'

2024-04-25T00:29:30-0600 [ERROR] [entry_service:bentovllm-llama3-8b-insruct-service:1] Application startup failed. Exiting.

bentoml / bentovllm Goto Github PK

bentovllm's Introduction

Self-host LLMs with vLLM and BentoML

Prerequisites

Install dependencies

Run the BentoML Service

Deploy to BentoCloud

Different LLM Models

LLM tools integration examples

bentovllm's People

Contributors

Stargazers

Watchers

Forkers

bentovllm's Issues

Recommend Projects

Recommend Topics

Recommend Org