Coder Social home page Coder Social logo

cria's Introduction

Cria - Local llama OpenAI-compatible API

The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on GPU using ggml-sys crate with specific compilation flags.

Quickstart:

  1. Git clone project

    git clone [email protected]:AmineDiro/cria.git
    cd cria/
    git submodule update --init --recursive
  2. Build project ( I ❤️ cargo !).

    cargo b --release
    • For cuBLAS (nvidia GPU ) acceleration use
      cargo b --release --features cublas
    • For metal acceleration use
      cargo b --release --features metal

      ❗ NOTE: If you have issues building for GPU, checkout the building issues section

  3. Download GGML .bin LLama-2 quantized model (for example llama-2-7b)

  4. Run API, use the use-gpu flag to offload model layers to your GPU

    ./target/cria llama-2 {MODEL_BIN_PATH} --use-gpu --gpu-layers 32

Completion Example

You can use openai python client or directly use the sseclient python library and stream messages. Here is an example :

Here is a example using a Python client
import json
import sys
import time

import sseclient
import urllib3

url = "http://localhost:3000/v1/completions"


http = urllib3.PoolManager()
response = http.request(
    "POST",
    url,
    preload_content=False,
    headers={
        "Content-Type": "application/json",
    },
    body=json.dumps(
        {
            "prompt": "Morocco is a beautiful country situated in north africa.",
            "temperature": 0.1,
        }
    ),
)

client = sseclient.SSEClient(response)

s = time.perf_counter()
for event in client.events():
    chunk = json.loads(event.data)
    sys.stdout.write(chunk["choices"][0]["text"])
    sys.stdout.flush()
e = time.perf_counter()

print(f"\nGeneration from completion took {e-s:.2f} !")

You can clearly see generation using my M1 GPU:

Building with GPU issues

I had some issues compiling llm crate with cuda support for my RTX2070 Super (Turing architecture). After some debugging, I needed to provide nvcc with the correct gpu-architecture version, for now ggml-sys crates only supports. Here are the set of changes to the build.rs :

diff --git a/crates/ggml/sys/build.rs b/crates/ggml/sys/build.rs
index 3a6e841..ef1e1b0 100644
--- a/crates/ggml/sys/build.rs
+++ b/crates/ggml/sys/build.rs
@@ -330,8 +330,9 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("--compile")
             .arg("-cudart")
             .arg("static")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-D_WINDOWS")
             .arg("-DNDEBUG")
             .arg("-DGGML_USE_CUBLAS")
@@ -361,8 +362,7 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("-Illama-cpp/include/ggml")
             .arg("-mtune=native")
             .arg("-pthread")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-DGGML_USE_CUBLAS")
             .arg("-I/usr/local/cuda/include")
             .arg("-I/opt/cuda/include")

The only thing left to do is to change Cargo.toml file to

TODO/ Roadmap:

  • Run Llama.cpp on CPU using llm-chain
  • Run Llama.cpp on GPU using llm-chain
  • Implement /models route
  • Implement basic /completions route
  • Implement streaming completions SSE
  • Cleanup cargo features with llm
  • Support MacOS Metal
  • Merge completions / completion_streaming routes in same endpoint
  • Implement /embeddings route
  • Setup good tracing
  • Better errors
  • Implement route /chat/completions
  • Implement streaming chat completions SSE
  • Metrics ??
  • Batching requests(ala iouring):
    • For each response put an entry in a ringbuffer queue with : Entry(Flume mpsc (resp_rx,resp_tx))
    • Spawn a model in separate task reading from ringbuffer, get entry and put each token in response
    • Construct stream from flue resp_rx chan and return SSE(stream) to user.

Routes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.