Coder Social home page Coder Social logo

llama-api's People

Contributors

abr177 avatar c0sogi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

llama-api's Issues

FastAPI + llamapi issue

We are facing "ValueError - Can't patch loop of type <class 'uvloop.Loop'>" while using llamaapi with FastAPI. Are there any known issues and resolutions?

How can I use a specific prompt template?

For example openchat 3.5 wants this prompt template format:

GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:

I tried a few things a managed to crash the server so I am stuck. Can anyone help. I think the author is away...

Proxy to openAI

Hi!
I have a strange suggestion :) Do a proxy object that will send requests to openal if in openai_replacement_models specifies openai_proxy (or something like it).

For example:
openai_replacement_models = {"gpt-3.5-turbo": "my_ggml", "gpt-4": "openai_proxy", "lllama": "another_ggml"}
If user call gpt-3.5-turbo - api server will use my_ggml, if user call gpt-4 - send request to openai.
This will make it easy to use both local llama and openai at the same time.

PS: Thanx so much for example with LangChain!

Long generations dont return data but server says 200 OK. Swagger screen just says LOADING forever.

How to reproduce:

1) Model being used:

wizardlm_70b_q4_gguf = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

2) From swagger run this query against the chat completion endpoint. Please note there are backslashes in front of quotes
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "The topic is: 'Infant baptism is not biblical'. Give me at least 5 points. Output a table with these 4 columns: 'Point For Title Sentence','Point For Explanation with quotes and examples (min 5 sentences)', 'Point Against Title Sentence','Point Against Explanation with quotes and examples (min 5 sentences)'."
}
],
"model": "wizardlm_70b_q4_gguf"
}

3) When the server completes the query it says:

llama_print_timings: load time = 70698.17 ms
llama_print_timings: sample time = 353.01 ms / 861 runs ( 0.41 ms per token, 2439.01 tokens per second)
llama_print_timings: prompt eval time = 56156.99 ms / 95 tokens ( 591.13 ms per token, 1.69 tokens per second)
llama_print_timings: eval time = 920273.58 ms / 860 runs ( 1070.09 ms per token, 0.93 tokens per second)
llama_print_timings: total time = 978060.67 ms
[2023-09-17 15:00:28,909] llama_api.server.pools.llama:INFO - ๐Ÿฆ™ [done for wizardlm_70b_q4_gguf]: (elapsed time: 978.1s | tokens: 860( 0.9tok/s))
INFO: 216.8.141.240:47056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
doug@Ubuntu-2204-jammy-amd64-base:~/llama-api$

4) The swagger call still says LOADING infinitely

image

Generation stops at 251 tokens - works fine on oobabooga

I hate to be a pain. You have been so helpful already, but I am stuck.

My generations are ending prematurely: "finish_reason": "length" as seen below

{
"id": "chatcmpl-4f6ac32a-287f-41ba-a4ec-8768e70ad2c3",
"object": "chat.completion",
"created": 1694531345,
"model": "llama-2-70b-chat.Q5_K_M",
"choices": [
{
"message": {
"role": "assistant",
"content": " Despite AI argue that AI advancements in technology, humans will always be required i, some professions.\nSTERRT Artificial intelligence (AI) has made significant advancementsin the recent years, it's impact on various industries, including restaurants and bars. While AI cannot replace bartenders, therelatively few tasks, AI argue that humans will always be ne needed these establishments.\nSTILL be required in ssociated with sERvices sector. Here are r several reasons whythat AI explainBelow:\nFirstly, AI cannot"
},
"index": 0,
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 123,
"completion_tokens": 128,
"total_tokens": 251
}
}

My definition is:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=16384,
use_mlock=False
)

When I load I get:

llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 82684.0
llm_load_print_meta: freq_scale = 0.25
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

From the sever start screen I get:

llama2_70b_q5_gguf
model_path: llama-2-70b-chat.Q5_K_M.gguf / max_total_tokens: 16384 / auto_truncate: True / n_parts: -1 / n_gpu_layers: 30 / seed: -1 / f16_kv: True / logits_all: False / vocab_only: False / use_mlock: False / n_batch: 512 / last_n_tokens_size: 64 / use_mmap: True / cache: False / verbose: True / echo: True / cache_type: ram / cache_size: 2147483648 / low_vram: False / embedding: False / rope_freq_base: 82684.0 / rope_freq_scale: 0.25

I have tried:

  1. Starting the server specifying the max tokens: python3 main.py --max-tokens-limit 4096
  2. I have set my ulimit to unlimited
  3. I have set max_total_tokens: 16384
  4. I tried setting the rope settings to be the same as oobabooga:
    rope_freq_base=10000,
    rope_freq_scale=1,
    BUT THESE SETTINGS WERE IGNORED.

The same model works perfectly on oobabooga.

I am not sure what else to try.

Thanks so so much, Doug

BUG: I found the model path bug!

So this has been driving me crazy. I thought I was losing my mind. So I finally figured it out.

In my model definitions I had:

WizardLM_70B_q4_GGUF = LlamaCppModel(
model_path="wizardlm-70b-v1.0.Q4_K_M.gguf", # manual download
max_total_tokens=4096,
use_mlock=False,
)

but when I listed the model definitions in the API I got:

{
  "id": "wizardlm_70b_q4_gguf",
  "object": "model",
  "owned_by": "LlamaCppModel",
  "permissions": [
    "model_path:wizardlm-70b-v1.0.Q4_K_M.gguf",

......

It converted the model id to lower case!!!!!!!!!!
So I changed my model definition to be all lower case AND IT WORKS!

So to fix either we need to clearly document that model definitions variable names MUST be in lower case. Or change the code to not convert to lower case.

** But this is not the whole story. I have a working model definition with upper case letters working... So something I am saying is not correct. But the above procedure definitely fixed my problem.

Any way to define embeddings model in model_definitions.py?

First of all, thank you for creating llama-api, it really works great! Just wanted to ask: is there a possibility to add embeddings models as well to the model_definitions.py?

It seems that the automatic downloader sometimes gets corrupted or times out. I tried it with a smaller embeddings model and everything worked fine, it cached the model and embeddings work fine. But anything over roughly 100MB times out at some point, and I'm not sure why.

Alternatively, is there any way to manually put an embeddings model into the .cache folder? I'm not really sure about the structure here, it looks quite different than a regular model directory that I would download on my own.

Thank you!

PS: Happy to contribute a bit to the codebase if it is still actively maintained, as we will probably make some changes for better production-serving. Even if it's just the readme file to explain how to serve it in production over Ngnix with load balancing and multiple instances on one server.

exllama GPU split

It's not clear from the documentation how to split VRAM over multiple GPUs with exllama.

Usage of embedding through langchain

Hello,

I appreciate this API, but I am struggling to use the embedding part with langchain, is there any support regarding how to (if possible) use the embedding with langchain?

Jordan

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

I am getting this memory problem trying to run in llama-api. The same exact model works perfect in oobabooga

warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

This is my model_definition:

llama2_70b_Q5_gguf = LlamaCppModel(
model_path="llama-2-70b-chat.Q5_K_M.gguf", # manual download
max_total_tokens=4096
)

Llama2_70b_q5_gguf - llama-api

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
warning: failed to mlock 245760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
llm_load_tensors: mem required = 46494.72 MB (+ 1280.00 MB per state)

Working from oobabooga Llama2_70b_q5_gguf

llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q5_K - Medium
llm_load_print_meta: model size = 68.98 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.23 MB
llm_load_tensors: mem required = 46494.72 MB (+ 5120.00 MB per state)
....................................................................................................
llama_new_context_with_model: kv self size = 5120.00 MB
llama_new_context_with_model: compute buffer total size = 2097.47 MB

Stopped working after enabling CUDA

Hi, this was working really quite well on CPU for me, but I gave the tool access to the paths for libcublas, it compiled and now can't start or load due to my 3080 not having enough vRAM.

How do I completely force off CUDA so that I can use the tool again? I've tried taking the PATH and LD_ paths away, but the installer still seems to be building in CUDA mode.

Thanks

model_definitions.py

I know its probably easy for everyone else but I am struggling every time I add a new model to test. I often get "model not found" when it seems it is there. It would be a huge help if it didn't just return model not found, but the exact path and filename it is trying to load. Either in the terminal window or return info.

ps. when will the new branch be ready that handles these definitions better?

Thanks so much, Doug

Dumb question: definitions.py model parameters

I am very sorry for this newbie question. In the definitions.py there are a number of parameters for each model. I assume these correspond to the settings given on the model page. My question is how do I know the variable names you have used for each setting? For example:

airoboros_l2_13b_gguf = LlamaCppModel(
model_path="TheBloke/Airoboros-L2-13B-2.1-GGUF", # automatic download
max_total_tokens=8192,
rope_freq_base=26000,
rope_freq_scale=0.5,
n_gpu_layers=30,
n_batch=8192,

rope_freq_base : It doesn't appear in any of your other examples. I assume your examples are a non-exhaustive usage of all possible parameters. How can I know the variable names you used? Is there a mapping chart somewhere?

Again I apologize for the newbie question that is probably painfully obvious to others.

Thanks, Doug

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.