lee-b / cria Goto Github PK

This project forked from aminediro/cria

OpenAI compatible API for serving LLAMA-2 model

License: MIT License

Python 20.69% Rust 79.31%

cria's Introduction

Cria - Local llama OpenAI-compatible API

The objective is to serve a local llama-2 model by mimicking an OpenAI API service. The llama2 model runs on GPU using ggml-sys crate with specific compilation flags.

Quickstart:

Git clone project

git clone [email protected]:AmineDiro/cria.git
cd cria/
git submodule update --init --recursive

Build project ( I ❤️ cargo !).
```
cargo b --release
```
- For cuBLAS (nvidia GPU ) acceleration use
```
cargo b --release --features cublas
```
- For metal acceleration use
```
cargo b --release --features metal
```
  ❗ NOTE: If you have issues building for GPU, checkout the building issues section
Download GGML .bin LLama-2 quantized model (for example llama-2-7b)

Run API, use the use-gpu flag to offload model layers to your GPU

./target/cria llama-2 {MODEL_BIN_PATH} --use-gpu --gpu-layers 32

Completion Example

You can use openai python client or directly use the sseclient python library and stream messages. Here is an example :

Here is a example using a Python client

import json
import sys
import time

import sseclient
import urllib3

url = "http://localhost:3000/v1/completions"


http = urllib3.PoolManager()
response = http.request(
    "POST",
    url,
    preload_content=False,
    headers={
        "Content-Type": "application/json",
    },
    body=json.dumps(
        {
            "prompt": "Morocco is a beautiful country situated in north africa.",
            "temperature": 0.1,
        }
    ),
)

client = sseclient.SSEClient(response)

s = time.perf_counter()
for event in client.events():
    chunk = json.loads(event.data)
    sys.stdout.write(chunk["choices"][0]["text"])
    sys.stdout.flush()
e = time.perf_counter()

print(f"\nGeneration from completion took {e-s:.2f} !")

You can clearly see generation using my M1 GPU:

Building with GPU issues

I had some issues compiling llm crate with cuda support for my RTX2070 Super (Turing architecture). After some debugging, I needed to provide nvcc with the correct gpu-architecture version, for now ggml-sys crates only supports. Here are the set of changes to the build.rs :

diff --git a/crates/ggml/sys/build.rs b/crates/ggml/sys/build.rs
index 3a6e841..ef1e1b0 100644
--- a/crates/ggml/sys/build.rs
+++ b/crates/ggml/sys/build.rs
@@ -330,8 +330,9 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("--compile")
             .arg("-cudart")
             .arg("static")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-D_WINDOWS")
             .arg("-DNDEBUG")
             .arg("-DGGML_USE_CUBLAS")
@@ -361,8 +362,7 @@ fn enable_cublas(build: &mut cc::Build, out_dir: &Path) {
             .arg("-Illama-cpp/include/ggml")
             .arg("-mtune=native")
             .arg("-pthread")
-            .arg("--generate-code=arch=compute_52,code=[compute_52,sm_52]")
-            .arg("--generate-code=arch=compute_61,code=[compute_61,sm_61]")
+            .arg("--generate-code=arch=compute_75,code=[compute_75,sm_75]")
             .arg("-DGGML_USE_CUBLAS")
             .arg("-I/usr/local/cuda/include")
             .arg("-I/opt/cuda/include")

The only thing left to do is to change Cargo.toml file to

TODO/ Roadmap:

Routes

Checkout : https://platform.openai.com/docs/api-reference/

Recommend Projects

lee-b / cria Goto Github PK

cria's Introduction

Cria - Local llama OpenAI-compatible API

Quickstart:

Completion Example

Building with GPU issues

TODO/ Roadmap:

Routes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent