c0sogi / llama-api Goto Github PK
View Code? Open in Web Editor NEWAn OpenAI-like LLaMA inference API
License: MIT License
An OpenAI-like LLaMA inference API
License: MIT License
We are facing "ValueError - Can't patch loop of type <class 'uvloop.Loop'>" while using llamaapi with FastAPI. Are there any known issues and resolutions?
For example openchat 3.5 wants this prompt template format:
GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:
I tried a few things a managed to crash the server so I am stuck. Can anyone help. I think the author is away...
Hi!
I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it).
For example:
openai_replacement_models = {"gpt-3.5-turbo": "my_ggml", "gpt-4": "openai_proxy", "lllama": "another_ggml"}
If user call gpt-3.5-turbo - api server will use my_ggml, if user call gpt-4 - send request to openai.
This will make it easy to use both local llama and openai at the same time.
PS: Thanx so much for example with LangChain!
How to reproduce:
1) Model being used:
wizardlm_70b_q4_gguf = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)
2) From swagger run this query against the chat completion endpoint. Please note there are backslashes in front of quotes
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'."
}
],
"model": "wizardlm_70b_q4_gguf"
}
3) When the server completes the query it says:
llama_print_timings: load time = 70698.17 ms
llama_print_timings: sample time = 353.01 ms / 861 runs ( 0.41 ms per token, 2439.01 tokens per second)
llama_print_timings: prompt eval time = 56156.99 ms / 95 tokens ( 591.13 ms per token, 1.69 tokens per second)
llama_print_timings: eval time = 920273.58 ms / 860 runs ( 1070.09 ms per token, 0.93 tokens per second)
llama_print_timings: total time = 978060.67 ms
[2023-09-17 15:00:28,909] llama_api.server.pools.llama:INFO - ๐ฆ [done for wizardlm_70b_q4_gguf]: (elapsed time: 978.1s | tokens: 860( 0.9tok/s))
INFO: 216.8.141.240:47056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
doug@Ubuntu-2204-jammy-amd64-base:~/llama-api$
4) The swagger call still says LOADING infinitely
Support min_p sampler, which is implemented in ExLlamav2.-
I hate to be a pain. You have been so helpful already, but I am stuck.
My generations are ending prematurely: "finish_reason": "length" as seen below
{
"id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3",
"object": "chat.completion",
"created": 1694531345,
"model": "llama-2-70b-chat.Q5_K_M",
"choices": [
{
"message": {
"role": "assistant",
"content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot"
},
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 123,
"completion_tokens": 128,
"total_tokens": 251
}
}
My definition is:
llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=16384,
use_mlock=False
)
When I load I get:
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 82684.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
From the sever start screen I get:
llama2_70b_q5_gguf
model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25
I have tried:
The same model works perfectly on oobabooga.
I am not sure what else to try.
Thanks so so much, Doug
So this has been driving me crazy. I thought I was losing my mind. So I finally figured it out.
In my model definitions I had:
WizardLM_70B_q4_GGUF = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)
but when I listed the model definitions in the API I got:
{
"id": "wizardlm_70b_q4_gguf",
"object": "model",
"owned_by": "LlamaCppModel",
"permissions": [
"model_path:wizardlm-70b-v1.0.Q4_K_M.gguf",
......
It converted the model id to lower case!!!!!!!!!!
So I changed my model definition to be all lower case AND IT WORKS!
So to fix either we need to clearly document that model definitions variable names MUST be in lower case. Or change the code to not convert to lower case.
** But this is not the whole story. I have a working model definition with upper case letters working... So something I am saying is not correct. But the above procedure definitely fixed my problem.
Could there be some new format of gguf that we need to update the code for or something?
Thank for a promising project!
Can I use llama-api with LangChain instead OpenAI? Can U provide an example?
First of all, thank you for creating llama-api, it really works great! Just wanted to ask: is there a possibility to add embeddings models as well to the model_definitions.py
?
It seems that the automatic downloader sometimes gets corrupted or times out. I tried it with a smaller embeddings model and everything worked fine, it cached the model and embeddings work fine. But anything over roughly 100MB times out at some point, and I'm not sure why.
Alternatively, is there any way to manually put an embeddings model into the .cache folder? I'm not really sure about the structure here, it looks quite different than a regular model directory that I would download on my own.
Thank you!
PS: Happy to contribute a bit to the codebase if it is still actively maintained, as we will probably make some changes for better production-serving. Even if it's just the readme file to explain how to serve it in production over Ngnix with load balancing and multiple instances on one server.
It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.
I would love to use this in google Colab but I would need the url to be public, is there a way to do that with this?
Please add support for exllamav2
Hello can someone guide me to run this nice API in CPU mode only
Hello,
I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?
Jordan
I am getting this memory problem trying to run in llama-api. The same exact model works perfect in oobabooga
warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)
This is my model_definition:
llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=4096
)
Llama2_70b_q5_gguf - llama-api
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)
Working from oobabooga Llama2_70b_q5_gguf
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '''
llm_load_print_meta: EOS token = 2 '
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load due to my 3080 not having enough vRAM.
How do I completely force off CUDA so that I can use the tool again? I've tried taking the PATH and LD_ paths away, but the installer still seems to be building in CUDA mode.
Thanks
I know its probably easy for everyone else but I am struggling every time I add a new model to test. I often get "model not found" when it seems it is there. It would be a huge help if it didn't just return model not found, but the exact path and filename it is trying to load. Either in the terminal window or return info.
ps. when will the new branch be ready that handles these definitions better?
Thanks so much, Doug
I am on a box with 19 physical cores, but only it looks like only 9 or 10 are being used. Is there a way to specify the number of cores to use?
I am very sorry for this newbie question. In the definitions.py there are a number of parameters for each model. I assume these correspond to the settings given on the model page. My question is how do I know the variable names you have used for each setting? For example:
airoboros_l2_13b_gguf = LlamaCppModel(
model_path="TheBloke/Airoboros-L2-13B-2.1-GGUF", # automatic download
max_total_tokens=8192,
rope_freq_base=26000,
rope_freq_scale=0.5,
n_gpu_layers=30,
n_batch=8192,
rope_freq_base : It doesn't appear in any of your other examples. I assume your examples are a non-exhaustive usage of all possible parameters. How can I know the variable names you used? Is there a mapping chart somewhere?
Again I apologize for the newbie question that is probably painfully obvious to others.
Thanks, Doug
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.