c0sogi / llama-api Goto Github PK

View Code? Open in Web Editor NEW

109.0 109.0 10.0 680 KB

An OpenAI-like LLaMA inference API

License: MIT License

Python 99.23% Batchfile 0.11% Shell 0.11% Dockerfile 0.55%

api exllama fastapi llama llamacpp

llama-api's People

Contributors

Stargazers

Watchers

Forkers

aresa7796 seanbenhur bet0x felbdogg wendlerc shivamsanju yashm5528 michel34343 zhangjiekui techthiyanes

llama-api's Issues

FastAPI + llamapi issue

We are facing "ValueError - Can't patch loop of type <class 'uvloop.Loop'>" while using llamaapi with FastAPI. Are there any known issues and resolutions?

How can I use a specific prompt template?

For example openchat 3.5 wants this prompt template format:

GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:

I tried a few things a managed to crash the server so I am stuck. Can anyone help. I think the author is away...

Proxy to openAI

Hi!
I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it).

For example:
openai_replacement_models = {"gpt-3.5-turbo": "my_ggml", "gpt-4": "openai_proxy", "lllama": "another_ggml"}
If user call gpt-3.5-turbo - api server will use my_ggml, if user call gpt-4 - send request to openai.
This will make it easy to use both local llama and openai at the same time.

PS: Thanx so much for example with LangChain!

Long generations dont return data but server says 200 OK. Swagger screen just says LOADING forever.

How to reproduce:

1) Model being used:

wizardlm_70b_q4_gguf = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

2) From swagger run this query against the chat completion endpoint. Please note there are backslashes in front of quotes
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'."
}
],
"model": "wizardlm_70b_q4_gguf"
}

3) When the server completes the query it says:

llama_print_timings: load time = 70698.17 ms
llama_print_timings: sample time = 353.01 ms / 861 runs ( 0.41 ms per token, 2439.01 tokens per second)
llama_print_timings: prompt eval time = 56156.99 ms / 95 tokens ( 591.13 ms per token, 1.69 tokens per second)
llama_print_timings: eval time = 920273.58 ms / 860 runs ( 1070.09 ms per token, 0.93 tokens per second)
llama_print_timings: total time = 978060.67 ms
[2023-09-17 15:00:28,909] llama_api.server.pools.llama:INFO - 🦙 [done for wizardlm_70b_q4_gguf]: (elapsed time: 978.1s | tokens: 860( 0.9tok/s))
INFO: 216.8.141.240:47056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
doug@Ubuntu-2204-jammy-amd64-base:~/llama-api$

4) The swagger call still says LOADING infinitely

Support min_p sampler

Support min_p sampler, which is implemented in ExLlamav2.-

Generation stops at 251 tokens - works fine on oobabooga

I hate to be a pain. You have been so helpful already, but I am stuck.

My generations are ending prematurely: "finish_reason": "length" as seen below

{
"id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3",
"object": "chat.completion",
"created": 1694531345,
"model": "llama-2-70b-chat.Q5_K_M",
"choices": [
{
"message": {
"role": "assistant",
"content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot"
},
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 123,
"completion_tokens": 128,
"total_tokens": 251
}
}

My definition is:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=16384,
use_mlock=False
)

When I load I get:

llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 82684.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

From the sever start screen I get:

llama2_70b_q5_gguf
model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25

I have tried:

Starting the server specifying the max tokens: python3 main.py --max-tokens-limit 4096
I have set my ulimit to unlimited
I have set max_total_tokens: 16384
I tried setting the rope settings to be the same as oobabooga:
rope_freq_base=10000,
rope_freq_scale=1,
BUT THESE SETTINGS WERE IGNORED.

The same model works perfectly on oobabooga.

I am not sure what else to try.

Thanks so so much, Doug

BUG: I found the model path bug!

So this has been driving me crazy. I thought I was losing my mind. So I finally figured it out.

In my model definitions I had:

WizardLM_70B_q4_GGUF = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

but when I listed the model definitions in the API I got:

{
  "id": "wizardlm_70b_q4_gguf",
  "object": "model",
  "owned_by": "LlamaCppModel",
  "permissions": [
    "model_path:wizardlm-70b-v1.0.Q4_K_M.gguf",

......

It converted the model id to lower case!!!!!!!!!!
So I changed my model definition to be all lower case AND IT WORKS!

So to fix either we need to clearly document that model definitions variable names MUST be in lower case. Or change the code to not convert to lower case.

** But this is not the whole story. I have a working model definition with upper case letters working... So something I am saying is not correct. But the above procedure definitely fixed my problem.

Zephyr7b gives gobbly gook output but Mistral7b works fine.

Could there be some new format of gguf that we need to update the code for or something?

Using with LangChain instead openai API

Thank for a promising project!
Can I use llama-api with LangChain instead OpenAI? Can U provide an example?

Any way to define embeddings model in model_definitions.py?

First of all, thank you for creating llama-api, it really works great! Just wanted to ask: is there a possibility to add embeddings models as well to the model_definitions.py?

It seems that the automatic downloader sometimes gets corrupted or times out. I tried it with a smaller embeddings model and everything worked fine, it cached the model and embeddings work fine. But anything over roughly 100MB times out at some point, and I'm not sure why.

Alternatively, is there any way to manually put an embeddings model into the .cache folder? I'm not really sure about the structure here, it looks quite different than a regular model directory that I would download on my own.

Thank you!

PS: Happy to contribute a bit to the codebase if it is still actively maintained, as we will probably make some changes for better production-serving. Even if it's just the readme file to explain how to serve it in production over Ngnix with load balancing and multiple instances on one server.

exllama GPU split

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.

Is there a way to use this on google Colab and have the url be public?

I would love to use this in google Colab but I would need the url to be public, is there a way to do that with this?

High RAM and CPU usage

When I run a model on my GPU, my CPU and RAM Usage is insanely high

exllamav2

Please add support for exllamav2

how to run this api in cpu only mode

Hello can someone guide me to run this nice API in CPU mode only

Usage of embedding through langchain

Hello,

I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?

Jordan

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

I am getting this memory problem trying to run in llama-api. The same exact model works perfect in oobabooga

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

This is my model_definition:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=4096
)

Llama2_70b_q5_gguf - llama-api

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

Working from oobabooga Llama2_70b_q5_gguf

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB

Stopped working after enabling CUDA

Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load due to my 3080 not having enough vRAM.

How do I completely force off CUDA so that I can use the tool again? I've tried taking the PATH and LD_ paths away, but the installer still seems to be building in CUDA mode.

Thanks

model_definitions.py

I know its probably easy for everyone else but I am struggling every time I add a new model to test. I often get "model not found" when it seems it is there. It would be a huge help if it didn't just return model not found, but the exact path and filename it is trying to load. Either in the terminal window or return info.

ps. when will the new branch be ready that handles these definitions better?

Thanks so much, Doug

Set number of cores being used on cpu?

I am on a box with 19 physical cores, but only it looks like only 9 or 10 are being used. Is there a way to specify the number of cores to use?

Dumb question: definitions.py model parameters

I am very sorry for this newbie question. In the definitions.py there are a number of parameters for each model. I assume these correspond to the settings given on the model page. My question is how do I know the variable names you have used for each setting? For example:

airoboros_l2_13b_gguf = LlamaCppModel(
model_path="TheBloke/Airoboros-L2-13B-2.1-GGUF", # automatic download
max_total_tokens=8192,
rope_freq_base=26000,
rope_freq_scale=0.5,
n_gpu_layers=30,
n_batch=8192,

rope_freq_base : It doesn't appear in any of your other examples. I assume your examples are a non-exhaustive usage of all possible parameters. How can I know the variable names you used? Is there a mapping chart somewhere?

Again I apologize for the newbie question that is probably painfully obvious to others.

Thanks, Doug