Coder Social home page Coder Social logo

janhq / nitro Goto Github PK

View Code? Open in Web Editor NEW
1.6K 15.0 75.0 43.97 MB

An inference server on top of llama.cpp. OpenAI-compatible API, queue, & scaling. Embed a prod-ready, local inference engine in your apps. Powers Jan

Home Page: https://nitro.jan.ai/

License: GNU Affero General Public License v3.0

Shell 0.38% CMake 0.53% C++ 77.99% Batchfile 0.14% C 17.51% Makefile 0.13% TypeScript 3.31% JavaScript 0.02%
gguf llama2 llamacpp tensorrt-llm accelerated ai inference-engine openai-api stable-diffusion cuda

nitro's Introduction

Nitro - Embeddable AI

nitrologo

Documentation - API Reference - Changelog - Bug reports - Discord

โš ๏ธ Nitro is currently in Development: Expect breaking changes and bugs!

Features

  • Fast Inference: Built on top of the cutting-edge inference library llama.cpp, modified to be production ready.
  • Lightweight: Only 3MB, ideal for resource-sensitive environments.
  • Easily Embeddable: Simple integration into existing applications, offering flexibility.
  • Quick Setup: Approximately 10-second initialization for swift deployment.
  • Enhanced Web Framework: Incorporates drogon cpp to boost web service efficiency.

About Nitro

Nitro is a high-efficiency C++ inference engine for edge computing, powering Jan. It is lightweight and embeddable, ideal for product integration.

The binary of nitro after zipped is only ~3mb in size with none to minimal dependencies (if you use a GPU need CUDA for example) make it desirable for any edge/server deployment ๐Ÿ‘.

Read more about Nitro at https://nitro.jan.ai/

Repo Structure

.
โ”œโ”€โ”€ controllers
โ”œโ”€โ”€ docs 
โ”œโ”€โ”€ llama.cpp -> Upstream llama C++
โ”œโ”€โ”€ nitro_deps -> Dependencies of the Nitro project as a sub-project
โ””โ”€โ”€ utils

Quickstart

Step 1: Install Nitro

  • For Linux and MacOS

    curl -sfL https://raw.githubusercontent.com/janhq/nitro/main/install.sh | sudo /bin/bash -
  • For Windows

    powershell -Command "& { Invoke-WebRequest -Uri 'https://raw.githubusercontent.com/janhq/nitro/main/install.bat' -OutFile 'install.bat'; .\install.bat; Remove-Item -Path 'install.bat' }"

Step 2: Downloading a Model

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Step 3: Run Nitro server

nitro

Step 4: Load model

curl http://localhost:3928/inferences/llamacpp/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/model/llama-2-7b-model.gguf",
    "ctx_len": 512,
    "ngl": 100,
  }'

Step 5: Making an Inference

curl http://localhost:3928/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      },
    ]
  }'

Table of parameters

Parameter Type Description
llama_model_path String The file path to the LLaMA model.
ngl Integer The number of GPU layers to use.
ctx_len Integer The context length for the model operations.
embedding Boolean Whether to use embedding in the model.
n_parallel Integer The number of parallel operations.
cont_batching Boolean Whether to use continuous batching.
user_prompt String The prompt to use for the user.
ai_prompt String The prompt to use for the AI assistant.
system_prompt String The prompt to use for system rules.
pre_prompt String The prompt to use for internal configuration.
cpu_threads Integer The number of threads to use for inferencing (CPU MODE ONLY)
n_batch Integer The batch size for prompt eval step
caching_enabled Boolean To enable prompt caching or not
clean_cache_threshold Integer Number of chats that will trigger clean cache action
grp_attn_n Integer Group attention factor in self-extend
grp_attn_w Integer Group attention width in self-extend
mlock Boolean Prevent system swapping of the model to disk in macOS
grammar_file String You can constrain the sampling using GBNF grammars by providing path to a grammar file
model_type String Model type we want to use: llm or embedding, default value is llm

OPTIONAL: You can run Nitro on a different port like 5000 instead of 3928 by running it manually in terminal

./nitro 1 127.0.0.1 5000 ([thread_num] [host] [port] [uploads_folder_path])
  • thread_num : the number of thread that nitro webserver needs to have
  • host : host value normally 127.0.0.1 or 0.0.0.0
  • port : the port that nitro got deployed onto
  • uploads_folder_path: custom path for file uploads in Drogon.

Nitro server is compatible with the OpenAI format, so you can expect the same output as the OpenAI ChatGPT API.

Compile from source

To compile nitro please visit Compile from source

Download

Version Type Windows MacOS Linux
Stable (Recommended) CPU CUDA Intel M1/M2 CPU CUDA
Experimental (Nighlty Build) GitHub action artifactory

Download the latest version of Nitro at https://nitro.jan.ai/ or visit the GitHub Releases to download any previous release.

Nightly Build

Nightly build is a process where the software is built automatically every night. This helps in detecting and fixing bugs early in the development cycle. The process for this project is defined in .github/workflows/build.yml

You can join our Discord server here and go to channel github-nitro to monitor the build process.

The nightly build is triggered at 2:00 AM UTC every day.

The nightly build can be downloaded from the url notified in the Discord channel. Please access the url from the browser and download the build artifacts from there.

Manual Build

Manual build is a process where the software is built manually by the developers. This is usually done when a new feature is implemented or a bug is fixed. The process for this project is defined in .github/workflows/build.yml

It is similar to the nightly build process, except that it is triggered manually by the developers.

Contact

  • For support, please file a GitHub ticket.
  • For questions, join our Discord here.
  • For long-form inquiries, please email [email protected].

Star History

Star History Chart

nitro's People

Contributors

0xsage avatar cameronng avatar dan-jan avatar dotieuthien avatar hahuyhoang411 avatar henryh0x1 avatar hiento09 avatar hientominh avatar hiro-v avatar ikraduya avatar imtuyethan avatar innoobwetrust avatar jan-service-account avatar louis-jan avatar maurodruwel avatar mevemo avatar psugihara avatar shavit avatar tikikun avatar tohrnii avatar urmauur avatar vansangpfiev avatar wujjpp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nitro's Issues

chore: refactor jan-inference -> Nitro repo

Nitro, at the moment, encompasses:

  • llama-python-backend, llama.cpp
  • C++ server
  • Accelerated models (submodule?)
  • GGML models (submodule?)

The point is that we'll be adding more to it long term

Load model fail should exit with code 1 instead of continue serving http server

[1] stderr: gguf_init_from_file: invalid magic number 0a8a0280
[1] 
[1] stderr: error loading model: llama_model_loader: failed to load model from /Users/louis/Library/Application Support/jan-electron/pytorch_model.bin
[1] 
[1] llama_load_model_from_file: failed to load model
[1] llama_init_from_gpt_params: error: failed to load model '/Users/louis/Library/Application Support/jan-electron/pytorch_model.bin'
[1] 
[1] stdout: 20231005 01:38:04.960344 UTC 4991698 INFO   - main.cc:27
[1] 20231005 01:38:04.971173 UTC 4991698 INFO  {"timestamp":1696469884,"level":"WARNING","function":"llamaCPP","line":1198,"message":"build info","build":1273,"commit":"99115f3"} - llamaCPP.h:108
[1] 20231005 01:38:04.971215 UTC 4991698 INFO  {"timestamp":1696469884,"level":"WARNING","function":"llamaCPP","line":1204,"message":"system info","n_threads":6,"total_threads":10,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | "} - llamaCPP.h:108
[1] 20231005 01:38:04.971447 UTC 4991698 INFO  {"timestamp":1696469884,"level":"ERROR","function":"loadModel","line":245,"message":"unable to load model","model":"/Users/louis/Library/Application Support/jan-electron/pytorch_model.bin"} - llamaCPP.h:108
[1] 20231005 01:38:04.971451 UTC 4991698 INFO  "Error loading the model" - llamaCPP.h:108
[1]       ___                                   ___           ___     
[1]      /__/        ___           ___        /  /\         /  /\    
[1]      \  \:\      /  /\         /  /\      /  /::\       /  /::\   
[1]       \  \:\    /  /:/        /  /:/     /  /:/\:\     /  /:/\:\  
[1]   _____\__\:\  /__/::\       /  /:/     /  /:/  \:\   /  /:/  \:\ 
[1]  /__/::::::::\ \__\/\:\__   /  /::\    /__/:/ /:/___ /__/:/ \__\:\
[1]  \  \:\~~\~~\/    \  \:\/\ /__/:/\:\   \  \:\/:::::/ \  \:\ /  /:/
[1]   \  \:\  ~~~      \__\::/ \__\/  \:\   \  \::/~~~~   \  \:\  /:/ 
[1]    \  \:\          /__/:/       \  \:\   \  \:\        \  \:\/:/  
[1]     \  \:\         \__\/         \__\/    \  \:\        \  \::/   
[1]      \__\/                                 \__\/         \__\/    
[1] 

bug: cuBlas build is currently not working

mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release

This command supposed to make nitro able to work on NVIDIA GPU, but somehow it causing segfault now

Ship Nitro as a Binary

Nitro should be statically built and distribution as a binary

Tasks

  • Build drogon with llama cpp
  • Spin up Mac VM for testing
  • Target mac os, x86, metal supported binary
  • (clarify?) We have llm endpoint using ggml
  • Server can be configured using a config file

Success criteria

  • Nitro is multi platform binary
  • Runs a Drogon C++ server
  • Serves llama-cpp for Metal or CPU only modes
  • Include encoding / decoding
  • An architecture diagram showings whats up

Add github action for nitro build

  • Use Github action in janhq
  • Artifacts: Github releases
  • Runner matrix for build status

Platform

  • Linux - amd64 - with/ without(cuda)
  • Mac - amd64 - without Metal
  • Mac - arm64 - with Metal

feat: Nitro speed up for 1st inference time after model loaded

Problem
The first request to nitro web server is slow, which make me frustrated. When it's ready, pls make sure me as a user can get quick result.

Success Criteria

  • 1st request from user should be quick
  • You should do mock request after loading model for the first time
  • The /health should yield 500 indicating the model warm up process is not done yet, else 200 if done. The case for process exit has been handled

Additional context
None atm

feat: Nitro should support docker image

Problem

  • Manual steps in README.md to build make me frustrated, esp. when there are system deps. bugs

Success Criteria

  • Dockerfile - related to #32
  • Prebuilt docker images

Additional context
None

Nitro has a installation script and configurations

Success Criteria

  • User should be able to configure Nitro and change defaults
  • User should be able to run a single script to install Nitro/OS dependencies
  • User should be able to deploy Nitro service
  • User should be able to integrate Nitro with Jan seamlessly

Rough spec

An installation path could look like the following

  1. Install dependencies: ./install.sh

logs:
sh # If gpu_mode: echo "Running Nitro on GPUs, checking dependencies" # install nvidia-smi ...

  1. Configure .env

    NITRO_PORT: 8000
    GPU_MODE: true
    
    # What other configs are possible for a good UX?
  2. Install the model(s) into directory

    wget ... /models
  3. Run Nitro: run.sh

Epic: Refactor Nitro into a standalone inference service on top of Llama.cpp, compatible with Jan

Deliverable

  • janhq/nitro Github Repo #20
  • Nitro documentation with overview and installation janhq/jan#113
  • Stretch goal: endpoint /models returns a list of models that have been downloaded & are ready to be used

Owners

Big Picture

  • Jan can take in a Nitro server URL
  • Nitro can run on Apple Silicon (GGUF, can drop GGML)
  • Nitro can run on Nvidia GPUs (with llama.cpp)
  • We are a .cpp compatible server

Exclusions

  • Focus on Llama.cpp first, we will tackle TensorRT in subsequent sprint (align with our DGX cluster arriving)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.