Coder Social home page Coder Social logo

xorbitsai / inference Goto Github PK

View Code? Open in Web Editor NEW
2.6K 34.0 217.0 19.81 MB

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.

Home Page: https://inference.readthedocs.io

License: Apache License 2.0

Python 86.80% HTML 0.13% JavaScript 12.67% CSS 0.17% Dockerfile 0.23%
ggml pytorch chatglm chatglm2 deployment flan-t5 llm wizardlm artificial-intelligence machine-learning

inference's Introduction

xorbits

Xorbits Inference: Model Serving Made Easy πŸ€–

PyPI Latest Release License Build Status Slack Twitter

English | 中文介绍 | ζ—₯本θͺž


Xorbits Inference(Xinference) is a powerful and versatile library designed to serve language, speech recognition, and multimodal models. With Xorbits Inference, you can effortlessly deploy and serve your or state-of-the-art built-in models using just a single command. Whether you are a researcher, developer, or data scientist, Xorbits Inference empowers you to unleash the full potential of cutting-edge AI models.

πŸ”₯ Hot Topics

Framework Enhancements

  • Support specifying worker and GPU indexes for launching models: #1195
  • Support SGLang backend: #1161
  • Support LoRA for LLM and image models: #1080
  • Support speech recognition model: #929
  • Metrics support: #906
  • Docker image: #855
  • Support multimodal: #829

New Models

Integrations

  • FastGPT: a knowledge-based platform built on the LLM, offers out-of-the-box data processing and model invocation capabilities, allows for workflow orchestration through Flow visualization.
  • Dify: an LLMOps platform that enables developers (and even non-developers) to quickly build useful applications based on large language models, ensuring they are visual, operable, and improvable.
  • Chatbox: a desktop client for multiple cutting-edge LLM models, available on Windows, Mac and Linux.

Key Features

🌟 Model Serving Made Easy: Simplify the process of serving large language, speech recognition, and multimodal models. You can set up and deploy your models for experimentation and production with a single command.

⚑️ State-of-the-Art Models: Experiment with cutting-edge built-in models using a single command. Inference provides access to state-of-the-art open-source models!

πŸ–₯ Heterogeneous Hardware Utilization: Make the most of your hardware resources with ggml. Xorbits Inference intelligently utilizes heterogeneous hardware, including GPUs and CPUs, to accelerate your model inference tasks.

βš™οΈ Flexible API and Interfaces: Offer multiple interfaces for interacting with your models, supporting OpenAI compatible RESTful API (including Function Calling API), RPC, CLI and WebUI for seamless model management and interaction.

🌐 Distributed Deployment: Excel in distributed deployment scenarios, allowing the seamless distribution of model inference across multiple devices or machines.

πŸ”Œ Built-in Integration with Third-Party Libraries: Xorbits Inference seamlessly integrates with popular third-party libraries including LangChain, LlamaIndex, Dify, and Chatbox.

Why Xinference

Feature Xinference FastChat OpenLLM RayLLM
OpenAI-Compatible RESTful API βœ… βœ… βœ… βœ…
vLLM Integrations βœ… βœ… βœ… βœ…
More Inference Engines (GGML, TensorRT) βœ… ❌ βœ… βœ…
More Platforms (CPU, Metal) βœ… βœ… ❌ ❌
Multi-node Cluster Deployment βœ… ❌ ❌ βœ…
Image Models (Text-to-Image) βœ… βœ… ❌ ❌
Text Embedding Models βœ… ❌ ❌ ❌
Multimodal Models βœ… ❌ ❌ ❌
Audio Models βœ… ❌ ❌ ❌
More OpenAI Functionalities (Function Calling) βœ… ❌ ❌ ❌

Getting Started

Please give us a star before you begin, and you'll receive instant notifications for every new release on GitHub!

Jupyter Notebook

The lightest way to experience Xinference is to try our Juypter Notebook on Google Colab.

Docker

Nvidia GPU users can start Xinference server using Xinference Docker Image. Prior to executing the installation command, ensure that both Docker and CUDA are set up on your system.

docker run --name xinference -d -p 9997:9997 -e XINFERENCE_HOME=/data -v </on/your/host>:/data --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0

Quick Start

Install Xinference by using pip as follows. (For more options, see Installation page.)

pip install "xinference[all]"

To start a local instance of Xinference, run the following command:

$ xinference-local

Once Xinference is running, there are multiple ways you can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client. Check out our docs for the guide.

web UI

Getting involved

Platform Purpose
Github Issues Reporting bugs and filing feature requests.
Slack Collaborating with other Xorbits users.
Twitter Staying up-to-date on new features.

Contributors

inference's People

Contributors

ago327 avatar amumu96 avatar aresnow1 avatar bojun-feng avatar chengjieli28 avatar codingl2k1 avatar eltociear avatar fengsxy avatar hainaweiben avatar jiayini1119 avatar liunux4odoo avatar mikeshi80 avatar minamiyama avatar mujin2 avatar notsyncing avatar onesuper avatar pangyoki avatar qinxuye avatar rayji01 avatar richzw avatar takatost avatar uranusseven avatar utopia2077 avatar waltcow avatar wertycn avatar xiaodouzi666 avatar yiboyasss avatar yukun-cui avatar zhanghx0905 avatar zhangtianrong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inference's Issues

ENH: support alpaca chinese

  1. we need to convert the hugging face format into ggml: tutorial.
  2. upload the generated model to our s3 bucket.
  3. add a class to plexar, specifying the download url, system prompt, seperator, etc.

FEAT: model group and load blancer

Is your feature request related to a problem? Please describe

Users may want to serve multiple replicas of a model.

Describe the solution you'd like

Let users specify a model and the desired number of replicas, xinference will launch and manage them as a group. A load balancer needs to be created to distribute the incoming inference requests evenly across these model replicas.

Describe alternatives you've considered

Additional context

FEAT: dashboard

Is your feature request related to a problem? Please describe

A dashboard will provide the necessary monitoring and performance metrics, enabling efficient management and optimization of our system.

Describe the solution you'd like

Below are the key features I envision for a dashboard:

  1. Resource Monitoring
  • CPU: Real-time monitoring of CPU utilization across the distributed system nodes.
  • Memory: Tracking and visualization of memory usage for each node.
  • GPU: Monitoring GPU utilization, allowing us to identify bottlenecks or optimize resource allocation.
  • VRAM: Real-time monitoring of VRAM utilization for GPU-based inference.
  1. Performance Monitoring:
  • Model-Specific Metrics: For each deployed model, capture and display relevant metrics such as generate task queue length, number of tokens generated per second, and any other model-specific performance indicators.
  • Throughput and Latency: Measure the overall throughput and latency of the system, enabling us to identify any performance issues and assess system efficiency.

Describe alternatives you've considered

Additional context

FEAT: support multi-model serving

Implement a multi-model serving framework. The following roles are needed.

Controller

Controller manages workers, allocate resources and launches models on workers, and provides interfaces to users for interacting with models. It should contain the following components:

  1. worker manager: run health check on workers, gather the workers' resource usage periodically
  2. model manager: schedules and launches models on workers, maintains the lifecycle of a model
  3. user interfaces: gradio, RESTful, ...

Worker

Execute operations on models according to controller's commands.

FEAT: local deployment

Implement a short cut for users to launch a model locally with a single command, like:

plexar model launch -n <built-in model name>

or

plexar model launch -p <path to custom model>

BUG: list index out of range on controller start

Traceback (most recent call last):
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/routes.py", line 437, in run_predict
output = await app.get_blocks().process_api(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1352, in process_api
result = await self.call_function(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/gradio/blocks.py", line 1077, in call_function
prediction = await anyio.to_thread.run_sync(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jon/Documents/miniconda3/envs/ml/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jon/Documents/repo/plexar/plexar/actor/gradio.py", line 141, in _refresh_models
return gr.Dropdown.update(value=launched[0], choices=launched)
IndexError: list index out of range

BUG: Missing dependencies

Several dependencies cannot be found when installing dependencies using pip install -e ., such as numpy, versioneer, and llama_cpp.

FEAT: support stream generation

Currently, plexar.model.llm.core.LlamaCppModel.generate takes a prompt as the input and returns a completion.

We can optimize it by leveraging the argument stream provided by llama_cpp.llama.Llama.__call__.

ENH: support baichuan-7b

  1. apply ggml quant: tutorial.
  2. upload the ggml model to our s3 bucket.
  3. add a class to plexar, specifying the download url, system prompt, seperator, etc.

ENH: handle worker lost

A worker should maintain a heartbeat connection with the controller. Each heartbeat should include information about the running models.

If the heartbeat is interrupted, the controller should cease scheduling models for that worker and label the models running on that worker as unavailable.

Once the heartbeat is restored, the controller should perform a health check using the information provided by the worker.

ENH: let users install libs like llama-cpp-python

It is hard to install llama-cpp-python with optimized cmake args currently. A better choice could be letting users install it.

To do that, we should not import llamacpp in global namespace, but inside the LlamaCppModel. ImportError should be captured and raise again with an installation guide.

ENH: support whisper

Whisper is an open-source model created by OpenAI.

The author of ggml provides a high-performance inference impl using ggml called whisper.cpp.

It is very cool to support serving whisper and combine it with LLMs. Here's a demo:

https://twitter.com/ggerganov/status/1642115206544343040?s=20

Requirements

brew install portaudio
pip install sounddevice
pip install soundfile

Recording

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

Recording with Arbitrary Duration

https://github.com/spatialaudio/python-sounddevice/blob/0.4.6/examples/rec_unlimited.py

Invoke whisper

whisper: https://github.com/openai/whisper
whisper.cpp: https://github.com/ggerganov/whisper.cpp
whisper.cpp python bindings: https://github.com/aarnphm/whispercpp

In [1]: import sounddevice as sd

In [2]: myrecording = sd.rec(int(10 * 48000), samplerate=48000, channels=1)

In [3]: sd.play(myrecording)

In [5]: from whispercpp import Whisper
   ...:

In [6]: w = Whisper.from_pretrained("tiny")
whisper_init_from_file_no_state: loading model from '/Users/jon/.local/share/whispercpp/ggml-tiny.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 1
whisper_model_load: mem required  =  127.00 MB (+    3.00 MB per decoder)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: model ctx     =   73.58 MB
whisper_model_load: model size    =   73.54 MB
whisper_init_state: kv self size  =    2.62 MB
whisper_init_state: kv cross size =    8.79 MB

In [7]: w.transcribe(myrecording)
Out[7]: ' [sad music] [sad music] [sad music] [sad music]'

BUG: too many clients

Describe the bug

When running the model_ref.generate() function in iPython, there seems to be a client created for every word generation, eventually leading to the following error:

gaierror: [Errno 8] nodename nor servname provided, or not known

To Reproduce

python -m plexar.deploy.cmdline supervisor -a localhost:9999 --log-level debug

python -m plexar.deploy.cmdline worker --supervisor-address localhost:9999 -a localhost:10000 --log-level debug

import sys
from plexar.client import Client
client = Client("localhost:9999")
model_uid = client.launch_model("wizardlm-v1.0",7,"ggmlv3","q4_0")
model_ref = client.get_model(model_uid)


async for c in await model_ref.generate("Once upon a time, there was a very old computer.", {'max_tokens': 512}): sys.stdout.write(c['choices'][0]['text'])

Expected behavior

First the warnings are printed: Actor caller has created too many clients ([some number] >= 100), the global router may not be set.

Then we have the gaierror after the [some number] exceeds 240.

ENH: builtin stop words

orca 3b sometimes generates somethin like: [answer]###[unexpected tokens].

To avoid it, we may consider add builtin stop words for builtin models.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.