xorbitsai / inference Goto Github PK

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Home Page: https://inference.readthedocs.io

License: Apache License 2.0

Python 86.80% HTML 0.13% JavaScript 12.67% CSS 0.17% Dockerfile 0.23%

ggml pytorch chatglm chatglm2 deployment flan-t5 llm wizardlm artificial-intelligence machine-learning

inference's Introduction

Xorbits Inference: Model Serving Made Easy 🤖

English | 中文介绍 | 日本語

Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.

👉 Join our Slack community!

🔥 Hot Topics

Framework Enhancements

Support specifying worker and GPU indexes for launching models: #1195
Support SGLang backend: #1161
Support LoRA for LLM and image models: #1080
Support speech recognition model: #929
Metrics support: #906
Docker image: #855
Support multimodal: #829

New Models

Built-in support for Llama 3: #1332
Built-in support for Qwen1.5 110B: #1388
Built-in support for Mixtral-8x22B-instruct-v0.1: #1340
Built-in support for Command-R: #1310
Built-in support for Qwen1.5 MOE: #1263
Built-in support for Qwen1.5 32B: #1249

Integrations

FastGPT: a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.
Dify: an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.
Chatbox: a desktop client for multiple cutting-edge LLM models, available on Windows, Mac and Linux.

Key Features

🌟 Model Serving Made Easy: Simplify the process of serving large language, speech recognition, and multimodal models. You can set up and deploy your models for experimentation and production with a single command.

⚡️ State-of-the-Art Models: Experiment with cutting-edge built-in models using a single command. Inference provides access to state-of-the-art open-source models!

🖥 Heterogeneous Hardware Utilization: Make the most of your hardware resources with ggml. Xorbits Inference intelligently utilizes heterogeneous hardware, including GPUs and CPUs, to accelerate your model inference tasks.

⚙️ Flexible API and Interfaces: Offer multiple interfaces for interacting with your models, supporting OpenAI compatible RESTful API (including Function Calling API), RPC, CLI and WebUI for seamless model management and interaction.

🌐 Distributed Deployment: Excel in distributed deployment scenarios, allowing the seamless distribution of model inference across multiple devices or machines.

🔌 Built-in Integration with Third-Party Libraries: Xorbits Inference seamlessly integrates with popular third-party libraries including LangChain, LlamaIndex, Dify, and Chatbox.

Why Xinference

Feature	Xinference	FastChat	OpenLLM	RayLLM
OpenAI-Compatible RESTful API	✅	✅	✅	✅
vLLM Integrations	✅	✅	✅	✅
More Inference Engines (GGML, TensorRT)	✅	❌	✅	✅
More Platforms (CPU, Metal)	✅	✅	❌	❌
Multi-node Cluster Deployment	✅	❌	❌	✅
Image Models (Text-to-Image)	✅	✅	❌	❌
Text Embedding Models	✅	❌	❌	❌
Multimodal Models	✅	❌	❌	❌
Audio Models	✅	❌	❌	❌
More OpenAI Functionalities (Function Calling)	✅	❌	❌	❌

Getting Started

Please give us a star before you begin, and you'll receive instant notifications for every new release on GitHub!

Jupyter Notebook

The lightest way to experience Xinference is to try our Juypter Notebook on Google Colab.

Docker

Nvidia GPU users can start Xinference server using Xinference Docker Image. Prior to executing the installation command, ensure that both Docker and CUDA are set up on your system.

docker run --name xinference -d -p 9997:9997 -e XINFERENCE_HOME=/data -v </on/your/host>:/data --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

Quick Start

Install Xinference by using pip as follows. (For more options, see Installation page.)

pip install "xinference[all]"

To start a local instance of Xinference, run the following command:

$ xinference-local

Once Xinference is running, there are multiple ways you can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client. Check out our docs for the guide.

Getting involved

Platform	Purpose
Github Issues	Reporting bugs and filing feature requests.
Slack	Collaborating with other Xorbits users.
Twitter	Staying up-to-date on new features.

Contributors

inference's People

Contributors

Stargazers

Watchers

Forkers

uranusseven rayji01 pangyoki aresnow1 jiayini1119 bojun-feng qinxuye onesuper dkzdev brucepro jmanhype syaikhipin boosterduan frandchen sarkarda zhuewizz eltociear henryhesz anon2578 matrixji whitelilis xudongjin2019 lxb1226 mysqlsc touristshaun cossack9989 zimix0 arquehi kill8g ihearmycalls asher-wall arch-btw inayet nliver moomoofarm1 justaerian sailingkf hhy5277 techthiyanes aurits mbrukman flueticjustice keyzf shanthshivam shellkjell huangyingting takatost itsharex bmedi dongrunix tomszhou codingl2k1 chengjieli28 devilshorejr paarai yezhwi xiechengmude sycomix bmwas reggiemiller minamiyama richzw jiayaobo hqch0708 ego mjdhasan yanyuxiyangzk unixcrh caszhang lynnleelhl martymei jgbrblmd peterlcm glide-the fengsxy charli117 aix1971 xiexq2019 dfrsg yuanhuachao waltcow faroasis hybristan ai-mou hzg0601 eonsonicblue gyf19 zhuxiaosheng liunux4odoo auxpd boleyn felixzhang7 easypressai leoterry-ulrica sarsmlee onlyone-hyphen lihuibng xiaoyangmai shiyukonghui levinehuang

inference's Issues

FEAT: implement gradio actor

BLD: pypi

ENH: support alpaca chinese

we need to convert the hugging face format into ggml: tutorial.
upload the generated model to our s3 bucket.
add a class to plexar, specifying the download url, system prompt, seperator, etc.

ENH: support custom model

FEAT: generate & chat through cmdline

FEAT: support orca

TheBloke/orca_mini_13B-GGML

FEAT: model group and load blancer

Is your feature request related to a problem? Please describe

Users may want to serve multiple replicas of a model.

Describe the solution you'd like

Let users specify a model and the desired number of replicas, xinference will launch and manage them as a group. A load balancer needs to be created to distribute the incoming inference requests evenly across these model replicas.

Describe alternatives you've considered

Additional context

ENH: remove baichuan-chat

Note that the issue tracker is NOT the place for general support.

ENH: cache model

ENH: Optimize chat history to avoid exceeding token limit

ENH: implement `plexar.model.llm.core.LlamaCppModelConfig`

ENH: run models on fastest device automatically

We should try to run models on the fastest device available. Take Apple M1 as an example, we should leverage the power of metal as much as possible.

REF: refactor ModelSpec to support more built in models

FEAT: dashboard

Is your feature request related to a problem? Please describe

A dashboard will provide the necessary monitoring and performance metrics, enabling efficient management and optimization of our system.

Describe the solution you'd like

Below are the key features I envision for a dashboard:

Resource Monitoring

CPU: Real-time monitoring of CPU utilization across the distributed system nodes.
Memory: Tracking and visualization of memory usage for each node.
GPU: Monitoring GPU utilization, allowing us to identify bottlenecks or optimize resource allocation.
VRAM: Real-time monitoring of VRAM utilization for GPU-based inference.

Performance Monitoring:

Model-Specific Metrics: For each deployed model, capture and display relevant metrics such as generate task queue length, number of tokens generated per second, and any other model-specific performance indicators.
Throughput and Latency: Measure the overall throughput and latency of the system, enabling us to identify any performance issues and assess system efficiency.

Describe alternatives you've considered

Additional context

BUG: worker timeout when downloading a model

ENH: `stream=False` for API compatibility

FEAT: support baichuan-sft

https://huggingface.co/hiyouga/baichuan-7b-sft

FEAT: support multi-model serving

Implement a multi-model serving framework. The following roles are needed.

Controller

Controller manages workers, allocate resources and launches models on workers, and provides interfaces to users for interacting with models. It should contain the following components:

worker manager: run health check on workers, gather the workers' resource usage periodically
model manager: schedules and launches models on workers, maintains the lifecycle of a model
user interfaces: gradio, RESTful, ...

Worker

Execute operations on models according to controller's commands.

ENH: logging for subprocesses

FEAT: local deployment

Implement a short cut for users to launch a model locally with a single command, like:

plexar model launch -n <built-in model name>

plexar model launch -p <path to custom model>

FEAT: unified client interface for generate and chat

BUG: list index out of range on controller start

Traceback (most recent call last):
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jon/Documents/repo/plexar/plexar/actor/gradio.py", line 141, in _refresh_models
return gr.Dropdown.update(value=launched[0], choices=launched)
IndexError: list index out of range

ENH: implement `plexar.model.llm.core.LlamaCppGenerateConfig`

ENH: integrity verify after download

BUG: Missing dependencies

Several dependencies cannot be found when installing dependencies using pip install -e ., such as numpy, versioneer, and llama_cpp.

ENH: unify gradio and fastapi

https://blog.deploif.ai/posts/productionizing_your_model

FEAT: support stream generation

Currently, plexar.model.llm.core.LlamaCppModel.generate takes a prompt as the input and returns a completion.

We can optimize it by leveraging the argument stream provided by llama_cpp.llama.Llama.__call__.

TST: setup CI

ENH: support baichuan-7b

apply ggml quant: tutorial.
upload the ggml model to our s3 bucket.
add a class to plexar, specifying the download url, system prompt, seperator, etc.

BUG: gradio should call `generate` when `chat` is not available

File "/Users/jon/Documents/repo/plexar/plexar/actor/model.py", line 56, in chat
    raise AttributeError("chat")
AttributeError: [address=127.0.0.1:9999, pid=22149] chat

FEAT: LlamaIndex plugin

BLD: repo rename & make public

BUG: While downloading the model, the worker stopped executing

Describe the bug

If there is no model that needs to be launched locally, the model needs to be downloaded first.
While downloading the model, the worker stopped executing.

ENH: create model actor in the subpools with a round robin allocation strategy

BUG: list model return value has terminated models

FEAT: support chatglm-6b

We can leverage https://github.com/li-plus/chatglm.cpp to serve chatglm-6b.

download chatglm2 pytorch model from huggingface
convert it to ggml using chatglm.cpp/convert.py
run the demo
verify the python bindings
impl chatglm model
add chatglm as one of our builtin models

BUG: model.generate is not thread safe

TST: add test and ensure coverage

BUG: chatglm hang

ENH: handle worker lost

A worker should maintain a heartbeat connection with the controller. Each heartbeat should include information about the running models.

If the heartbeat is interrupted, the controller should cease scheduling models for that worker and label the models running on that worker as unavailable.

Once the heartbeat is restored, the controller should perform a health check using the information provided by the worker.

ENH: allow custom system prompts for chat models

ENH: let users install libs like llama-cpp-python

It is hard to install llama-cpp-python with optimized cmake args currently. A better choice could be letting users install it.

To do that, we should not import llamacpp in global namespace, but inside the LlamaCppModel. ImportError should be captured and raise again with an installation guide.

ENH: support whisper

Whisper is an open-source model created by OpenAI.

The author of ggml provides a high-performance inference impl using ggml called whisper.cpp.

It is very cool to support serving whisper and combine it with LLMs. Here's a demo:

https://twitter.com/ggerganov/status/1642115206544343040?s=20

Requirements

brew install portaudio
pip install sounddevice
pip install soundfile

Recording

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

Recording with Arbitrary Duration

https://github.com/spatialaudio/python-sounddevice/blob/0.4.6/examples/rec_unlimited.py

Invoke whisper

whisper: https://github.com/openai/whisper
whisper.cpp: https://github.com/ggerganov/whisper.cpp
whisper.cpp python bindings: https://github.com/aarnphm/whispercpp

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

In [5]: from whispercpp import Whisper
   ...:

In [6]: w = Whisper.from_pretrained("tiny")
whisper_init_from_file_no_state: loading model from '/Users/jon/.local/share/whispercpp/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

In [7]: w.transcribe(myrecording)
Out[7]: ' [sad music] [sad music] [sad music] [sad music]'

FEAT: LangChain plugin

FEAT: RLU cache

FEAT: RESTful API

For the API design, we can make it OpenAI compatible.

https://platform.openai.com/docs/api-reference/completions/create

FEAT: support hugging face models

References

OpenLLM: https://github.com/bentoml/OpenLLM

FastChat pytorch impl: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py

FastChat vllm impl: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py

BUG: too many clients

Describe the bug

When running the model_ref.generate() function in iPython, there seems to be a client created for every word generation, eventually leading to the following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

To Reproduce

python -m plexar.deploy.cmdline supervisor -a localhost:9999 --log-level debug

python -m plexar.deploy.cmdline worker --supervisor-address localhost:9999 -a localhost:10000 --log-level debug

import sys
from plexar.client import Client
client = Client("localhost:9999")
model_uid = client.launch_model("wizardlm-v1.0",7,"ggmlv3","q4_0")
model_ref = client.get_model(model_uid)


async for c in await model_ref.generate("Once upon a time, there was a very old computer.", {'max_tokens': 512}): sys.stdout.write(c['choices'][0]['text'])

Expected behavior

First the warnings are printed: Actor caller has created too many clients ([some number] >= 100), the global router may not be set.

Then we have the gaierror after the [some number] exceeds 240.

ENH: builtin stop words

orca 3b sometimes generates somethin like: [answer]###[unexpected tokens].

To avoid it, we may consider add builtin stop words for builtin models.

xorbitsai / inference Goto Github PK

inference's Introduction

Xorbits Inference: Model Serving Made Easy 🤖

🔥 Hot Topics

Framework Enhancements

New Models

Integrations

Key Features

Why Xinference

Getting Started

Jupyter Notebook

Docker

Quick Start

Getting involved

Contributors

inference's People

Contributors

Stargazers

Watchers

Forkers

inference's Issues

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Controller

Worker

Describe the bug

Requirements

Recording

Recording with Arbitrary Duration

Invoke whisper

References

Describe the bug

To Reproduce

Expected behavior

Recommend Projects

Recommend Topics

Recommend Org