rustformers / llm Goto Github PK
View Code? Open in Web Editor NEWAn ecosystem of Rust libraries for working with large language models
Home Page: https://docs.rs/llm/latest/llm/
License: Apache License 2.0
An ecosystem of Rust libraries for working with large language models
Home Page: https://docs.rs/llm/latest/llm/
License: Apache License 2.0
OS -> Void Linux (x86_64) (glibc), linux kernel 6.1.21_1
rustc -> rustc 1.68.1 (8460ca823 2023-03-20)
cargo -> cargo 1.68.1 (115f34552 2023-02-26)
I have clang version 12.0.1, and gcc version 12.2.0 installed.
Here's the entire log, uploaded to termbin.com for convenience.
This looks like the important bit:
The following warnings were emitted during compilation:
warning: In file included from /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/immintrin.h:107,
warning: from ggml/ggml.c:155:
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h: In function 'ggml_vec_dot_f16':
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1279 | ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1278 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1278 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1279 | ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1278 | ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning: 52 | _mm256_cvtph_ps (__m128i __A)
warning: | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning: 916 | #define GGML_F32Cx8_LOAD(x) _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning: | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning: 926 | #define GGML_F16_VEC_LOAD(p, i) GGML_F32Cx8_LOAD(p)
warning: | ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning: 1279 | ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning: | ^~~~~~~~~~~~~~~~~
error: failed to run custom build command for `ggml-raw v0.1.0 (/home/outsider/.cargo/git/checkouts/llama-rs-962022d29f37c95e/a067431/ggml-raw)`
Should this be reported to the base repository, ggml, as it's a compilation error for that project, or is it still relevant here?
I try use gpt4all-lora-quantized.bin from https://github.com/nomic-ai/gpt4all#try-it-yourself
cargo run --release -- -m ./data/gpt4all-lora-quantized.bin -f examples/alpaca_prompt.txt --repl
And got
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 131
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 132
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 133
...
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 256
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 257
[2023-03-29T07:21:13Z INFO llama_cli] Warning: Bad token in vocab at index 258
[2023-03-29T07:21:13Z INFO llama_cli] ggml ctx size = 4017.35 MB
[2023-03-29T07:21:13Z INFO llama_cli] Loading model part 1/1 from './data/gpt4all-lora-quantized.bin'
thread 'main' panicked at 'index out of bounds: the len is 2 but the index is 2', /Users/katopz/git/katopz/llama-rs/llama-rs/src/lib.rs:773:21
Maybe i've to convert it first?
When I try to do cargo run
(but not when I do cargo build --release
, I get the following error:
/usr/bin/ld: /home/faassen/install/llama-rs/ggml-raw/ggml/ggml.c:1418: undefined reference to `bytesFromNibbles'
this is pretty mysterious as I see bytesFromNibbles
is actually being defined in ggml.c
, though it does use avx stuff, so perhaps that's a problem. Weirdly enough I've successfully run the c++ version on this same laptop (Ryzen 6850u), so this library does seem to compile.
It'd be good to be able to bounce ideas off each other in real-time instead of through issues for more moment-to-moment discussion. The popular choices in the Rust world are Discord and Zulip, from what I've seen; my preference is for Discord just because it's convenient. (I was considering creating one myself, but I figured that might be jumping the gun a little!)
In #10, I re-exported the existing types with shorter names for the developer's benefit:
pub use llama::{
GptVocab as Vocab, InferenceParams, LlamaHyperParams as HyperParams, LlamaModel as Model,
OutputToken,
};
I suggest renaming the actual types to those names (the prefix is unnecessary if they're all in their own crates).
I'd also prefer Vocabulary
, Hyperparameters
and InferenceParameters
over the existing names, but I'm not too fussed about those. I'm not sure what the going consensus is in the Rust ecosystem around shortening of words.
This may be something to keep an eye on: ggerganov/llama.cpp#439
Looks like the corresponding code is here: https://github.com/rustformers/llama-rs/blob/bf7bdbcfff3114dcbdafb6eb7eed58f04f19b1c3/llama-rs/src/lib.rs#L1203
According to the comments in the pull, it should trade a small amount of performance for less memory usage. However, at least one user commented they saw more memory use (not sure what size model).
Hey, in llama.cpp, there are some useful output for telling the model predict / evaluation stats:
Can we also export such information after the model run?
I have a local change which produces output like this:
[2023-03-19T21:59:46Z INFO llama_cli] Model size = 4017.27 MB / num tensors = 291
[2023-03-19T21:59:46Z INFO llama_cli] Model fully loaded!
<prompt> <predict output>
feed_prompt_duration: 1533ms, prompt_tokens: 10, predict_duration: 22656ms, predict_tokens: 86, per_token_duration: 263.442ms
This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.
We would like to switch away from ggml
at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).
Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).
Unfortunately, it doesn't support the two formats we need: f16
(original weights) and q4_0
/q4_1
(quantized weights). Adding these to the ndarray
backend should be viable, but getting it right and working optimally (i.e. similar to ggml
's optimisations for those datatypes) will take some time.
Torch does support f16
on the GPU only, and burn
's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.
So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:
uint8
and use ndarray
/torch
backends. This is the least work (at least in theory), but uint8 quantization performs worse than either f16
or q4
, from what I've heard.f16
to burn
's ndarray
backend. The torch
backend should already work, but it will be very hard to test with most of our machines. Adding support to ndarray
for CPU inference shouldn't be impossible either (especially if we just convert to f32
for every operation), but it will be difficult to make it performance-optimal.q4_0/1
to burn
's ndarray
backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using q4
weights), but it has the same performance-optimality issue as f16
on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that ndarray
and rustc
will remap [[bool; 4]; N]
to [u8; N/2]
).This is summarised in the following table:
uint8 |
f16 |
q4 |
|
---|---|---|---|
ndarray |
Yes, but at noticeable quality loss | Requires semi-significant implementation work | Requires significant implementation work |
torch |
Yes, but at noticeable quality loss (GPU, CPU) | Yes, but is GPU-only | Unknown - should be possible, but likely requires custom code |
An idea that I briefly floated was porting ggml
itself to Rust using c2rust
and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp
's ggml
implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.
I believe the other Rust ML crates also do not support f16
or q4
, but that's from a cursory exploration. Happy to be proven wrong!
(Will take this soon)
At present, we have --cache-prompt
and --restore-prompt
, but these are a little ambiguous in how they work. The former will exit immediately after saving the prompt, and the latter can actually be used with any prompt (not just the one that was saved to disk).
To better communicate what they do and to make them more general, I propose replacing them with --load-session
, --save-session
and --persist-session
(which is an alias for loading and saving to the same path).
--load-session
is identical to --restore-prompt
in that it loads a saved inference snapshot, but it better communicates what it's doing.--save-session
will save the results of inference to disk, similar to --cache-prompt
, but it will also include whatever was inferred, allowing you to continue on from a response. --cache-prompt PATH
is equivalent to --save-session PATH -n 0
. (This could be documented, or another flag could be added... but it almost feels like another "mode" to me. Should figure out how we want to do that for #29, too.)--persist-session
loads a session from an existing path (if it exists) and saves to the path afterwards.This would allow you to have ongoing conversations over an extended period of time:
llama-cli --persist-session conversation.llama -p "How do I make bread?"
...
llama-cli --persist-session conversation.llama -p "How long should I let the dough rest at room temperature?"
...
llama-cli --persist-session conversation.llama -p "Can I keep the dough in the fridge?"
We should hopefully be able to set up easy binary releases with cargo-dist. The only concern I have is with the features for each build - presumably we would want both AVX2 and non-AVX2 builds available - but maybe that's best dealt with by making upstream more intelligent or switching away from ggml in the long run?
In llamacord, I have some logic that calls infer_next_token
in a loop. Unfortunately, I didn't check for EOT - so the code would keep generating tokens and producing (fascinatingly well-structured) garbage. I think we should probably check if the last token is EOT and return an error? If you feed it a prompt, the EOT would no longer be the last token, and you should be able to infer without issues. (I wonder if A[EOT]B infers differently to AB...)
Many, probably most, editors include a trailing newline at the end of a text file, which is not what you want if the prompt isn't a complete sentence or paragraph. This issue was fixed in llama.cpp.
Hi, I am trying to figure out the problem, help will be great.
Failing on windows 11 at cargo build --release
Error message:
Caused by: process didn't exit successfully:
C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-2925d7a622a7c725\build-script-build` (exit code: 1)
--- stdout
TARGET = Some("x86_64-pc-windows-gnu")
OPT_LEVEL = Some("3")
HOST = Some("x86_64-pc-windows-gnu")
cargo:rerun-if-env-changed=CC_x86_64-pc-windows-gnu
CC_x86_64-pc-windows-gnu = None
cargo:rerun-if-env-changed=CC_x86_64_pc_windows_gnu
CC_x86_64_pc_windows_gnu = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-gnu
CFLAGS_x86_64-pc-windows-gnu = None
cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_gnu
CFLAGS_x86_64_pc_windows_gnu = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "gcc.exe" "-O3" "-ffunction-sections" "-fdata-sections" "-m64" "-I" "include" "-Wall" "-Wextra" "/arch:AVX2" "-DNDEBUG" "-o" "C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-6f69f4f538ccd1b4\out\ggml/ggml.o" "-c" "ggml/ggml.c"
cargo:warning=ggml/ggml.c: In function 'pthread_create':
cargo:warning=ggml/ggml.c:53:49: warning: unused parameter 'unused' [-Wunused-parameter]
cargo:warning= 53 | static int pthread_create(pthread_t* out, void* unused, thread_ret_t(func)(void), void* arg) {
cargo:warning= | ~~~~~~^~~~~~
cargo:warning=ggml/ggml.c: In function 'pthread_join':
cargo:warning=ggml/ggml.c:64:49: warning: unused parameter 'unused' [-Wunused-parameter]
cargo:warning= 64 | static int pthread_join(pthread_t thread, void* unused) {
cargo:warning= | ~~~~~~^~~~~~
cargo:warning=gcc.exe: warning: /arch:AVX2: linker input file unused because linking not done
cargo:warning=gcc.exe: error: /arch:AVX2: linker input file not found: No such file or directory
exit code: 1
--- stderr
error occurred: Command "gcc.exe" "-O3" "-ffunction-sections" "-fdata-sections" "-m64" "-I" "include" "-Wall" "-Wextra" "/arch:AVX2" "-DNDEBUG" "-o" "C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-6f69f4f538ccd1b4\out\ggml/ggml.o" "-c" "ggml/ggml.c" with args "gcc.exe" did not execute successfully (status code exit code: 1).`
So this is a pretty immense task and I'd start with #45, but...
RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.
So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).
It's entirely open-source, so not legally burdened like LLaMA, and (from what I've seen) is more powerful than BLOOM at the same parameter count.
I asked the RWKV Discord which implementation would be worth looking at, and this is what I was told:
RWKV-LM/RWKV-v4neo/src/model.py is the implementation that's actually used to train the large models, it's cuda only and has tons of features you probably don't need.
rwkv_pip_package only implements inference, but is a good implementation and worth a look, recently got a lot more complex due to supporting more and more strategies and including various optimizations.
ChatRWKV/src/model_run is an older version, but haven't played with it so not sure how good it is. Might be worth a look since it's basically an older version of the one in rwkv_pip_package.
RWKV_in_150_lines.py I still haven't fully checked out, but I know it doesn't support GPT mode, so that may or may not be less useful
Also worth a look is RWKV-v4neo/src/model_run.py, which is a small inference-only impl capable of loading the large RWKV checkpoints
I'm not sure if it has GPT-mode, though
So it sounds like rwkv_pip_package
is the way to go as source material:
https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py
The following articles are very useful for understanding how RWKV works:
An interesting detail from the latter is the following:
The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.
This may pose issues for the GGML 4-bit quantisation format, which is non-optimal. We would likely want GPTQ quantisation.
[apologies for early send, accidentally hit enter]
Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama
Couple of differences I spotted on my quick perusal:
ggml
doesn't support multithreading on Windows.PhantomData
with the Tensor
s to prevent them outliving the Context
they're spawned from.llama.cpp
in so that I could track it more directly and use its ggml.c/h
, and to make it obvious which version I was porting.Given yours actually works, I think that it's more promising :p
What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.
Optionally add a subcommand to pull it.
The original code uses int
liberally due to C's wonderful integer promotion. I think it's easier for us to understand and less troublesome to index arrays if we offer a uniform usize
interface across the board, and only cast to/from i32
when required.
I made this change in my take and found it to be much less messier and easier to reason about (especially with maintaining non-negativity constraints!), but I'm not sure how much it would impact the implementation here.
Creating an issue here following discussion in the discord chat.
Hi there,
I'm trying to use the ggml-alpaca-30B-4b
weights with this project on a M1 Mac. It outputs text, however it is completely unsensible, does not follow the question-answer format in any way and is unable to answer basic questions. I have checked the SHA256 sum of the model file and it is correct. The same file works fine with alpaca.cpp
.
[2023-03-29T15:36:02Z INFO llama_cli] Loading of '../alpaca.cpp/ggml-alpaca-30b-q4.bin' complete
[2023-03-29T15:36:02Z INFO llama_cli] Model size = 19391.35 MB / num tensors = 543
[2023-03-29T15:36:02Z INFO llama_cli] Model fully loaded!
>> How high is the Empire State Building?
⣟ You know how to help your user get the information they need in the most efficient way possible while being
aware of the contextual clues provided by the user such as location or time. For this assignment you will
design an intelligent conversational system that can answer questions about^C
Furthermore, given just the input prompt, the project would not stop outputting text and then started to repeat, whereas the answers from alpaca.cpp
are always very concise and to the point:
>>
⡿ You will be my reliable friend in any situation. I am confident that you can do anything!
I want to tell you about myself so we could get closer and understand each other better: My name is Alexandra – friends call me Sasha. I’m 24 years old, live alone with a cat named Yegor (he loves fish) in St Petersburg Russia but dream of visiting New Zealand one day!
I am working as an assistant at the moment and looking for new opportunities because my boss has decided to leave his job. That is why I have started searching jobs online. My goal now is to find a reliable, stable position that will allow me to learn more about business processes in order to advance in my career field.
The most valuable skill set that you can rely on during work with me would be: multitasking; planning and organizing skills of the highest level because I am used working under pressure so I never miss deadlines or forget important tasks; communication – ability to clearly explain instructions, answer questions and solve problems quickly.
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop problem^C
Why does the output here seem to be totally unrelated to the question at hand?
I fixed main.rs
to refer to &args.model_path
, but now I get a new error:
Could not load model: invalid utf-8 sequence of 1 bytes from index 0
I created these models using the tools in llama.cpp, but they don't seem to be compatible?
As discussed in ggerganov/llama.cpp#71 (comment)
The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.
This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.
A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.
By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.
Hi, building with arch linux fails and provides this output
warning: ggml/ggml.c: In function ‘quantize_row_q4_0’:
warning: ggml/ggml.c:413:13: warning: unused variable ‘pp’ [-Wunused-variable]
warning: 413 | uint8_t pp[QK/2];
warning: | ^~
warning: ggml/ggml.c: In function ‘ggml_vec_dot_q4_0’:
warning: ggml/ggml.c:1422:18: warning: unused variable ‘countBlocks’ [-Wunused-variable]
warning: 1422 | const size_t countBlocks = nb;
warning: | ^~~~~~~~~~~
Compiling llama-rs v0.1.0 (/home/user/opt/llama/llama-rs/llama-rs)
error[E0658]: let...else statements are unstable
--> llama-rs/src/lib.rs:492:17
|
492 | / let Some(tensor) = model.tensors.get(&tensor_name)
493 | | else {
494 | | return Err(LoadError::UnknownTensor { tensor_name, path: part_path });
495 | | };
| |______________________^
|
= note: see issue #87335 <https://github.com/rust-lang/rust/issues/87335> for more information
For more information about this error, try rustc --explain E0658.
error: could not compile llama-rs due to previous error
Noob here, excuse me for my stupid feature request.
I noticed that someone in llama.cpp is working on word embedding from hidden layers. I m just asking is there any possibility to implement an embedding mode for llama-rs? thx
What i found is this commit
It appears that the model path CLI argument is being ignored entirely and that the path to the model is hardcoded.
I've been tracking the llama.cpp
repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:
rayon
. ggerganov/llama.cpp#85 (comment)ggml
function once it's implemented on the C++ side 👀 ggerganov/llama.cpp#173 (comment)sentencepiece
, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible ggerganov/llama.cpp#167Hi everyone!
I've been using llama-rs
recently and I especially love the caching session feature. I've been building an API that allows users to maintain multiple conversation threads and switch between them on the fly.
This means I'm currently caching each conversation using a unique id, and loading them as needed depending on which conversation the user is talking to. It works great and the response time are pretty fast on deep conversations but it also means I need to load the weights into memory each time.
As far as I know it's not possible to use cached sessions in REPL mode. On some setups with slower disk I/O this is now the bottleneck on chats.
Would there be a way for me to achieve something like this ? To recap:
Feel free to let me know if you think this is technically possible and just not exposed in the CLI, or if there's some deeper issue making this harder to achieve.
Thank you for your time, I'm a big fan of this project. :D
Model sucessfully runs on llama.cpp
but not in llama-rs
Command:
cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
PS C:\Users\Usuário\Desktop\llama-rs> cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
Finished release [optimized] target(s) in 2.83s
Running `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"`
thread 'main' panicked at 'Could not load model: InvalidMagic { path: "C:\\Users\\Usuário\\Downloads\\LLaMA\\7B\\ggml-model-q4_0.bin" }', llama-cli\src\main.rs:147:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"` (exit code: 101)
The build is broken on fedora 37.
full logs : https://gist.github.com/sylvain-reynaud/fe73ccc7edad1f4f98688cb48b1f101c
--- stderr
ggml/ggml.h:177:10: fatal error: 'stddef.h' file not found
thread 'main' panicked at 'Unable to generate bindings: ClangDiagnostic("ggml/ggml.h:177:10: fatal error: 'stddef.h' file not found\n")', ggml-raw/build.rs:31:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
On Fedora, to avoid build errors due to the missing stddef.h
file, you may need to install the following packages:
sudo dnf groupinstall "Development Tools" "Development Libraries"
Then build with CPATH
set to the location of the gcc
headers :
CPATH="/usr/lib/gcc/x86_64-redhat-linux/12/include/" cargo build --release
Tested on my laptop fedora 37 and fedora:latest docker image.
Not sure if we should consider this out of scope, but bloomz.cpp
is a fork of llama.cpp
that's capable of inference with the BLOOM family of models. The changes don't look very large, so there's room for code sharing here: https://github.com/NouamaneTazi/bloomz.cpp/commits/main?before=ade8a9d82fa1dc440c26f09a9e02cc94d7294251+35&branch=main&qualified_name=refs%2Fheads%2Fmain
Even if we don't support it directly, it may be worth publishing a safe-ish version of ggml-rs
to crates.io
so that a library like llama-rs
could be built for BLOOM.
It should be pretty straightforward to set up CI on [Windows|Linux|macOS] x86-64 and macOS ARM64. It should also help us test porting over the remaining build flags from the original Makefile.
Doing my part.
With llama as of 69c9229, with patches from #59: https://github.com/jempabroni/llama-rs.
This is not a RAM issue. llama.cpp runs all models fine including 65B.
7B runs:
[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/7B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:03:04Z INFO llama_cli] Loaded tensor 288/291
[2023-03-23T13:03:04Z INFO llama_cli] Loading of '../llama.cpp/models/7B/ggml-model-q4_0.bin' complete
[2023-03-23T13:03:04Z INFO llama_cli] Model size = 4017.27 MB / num tensors = 291
[2023-03-23T13:03:04Z INFO llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare to die, "Django"
In 1850s Georgia the Confederate South was in its infancy and undergoing drastic changes from a culture of slavery towards one where freedom for all people became paramount. The Civil War ensued as many^C
13B runs:
[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/13B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:10:48Z INFO llama_cli] Loaded tensor 360/363
[2023-03-23T13:10:48Z INFO llama_cli] Loading of '../llama.cpp/models/13B/ggml-model-q4_0.bin.1' complete
[2023-03-23T13:10:48Z INFO llama_cli] Model size = 3880.49 MB / num tensors = 363
[2023-03-23T13:10:48Z INFO llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare para morir!
Escrito por: Sally Dixon, Michael McBain.
Producido por: Aaron Herbertsen, Kevin Loader for Channel^C
30B runs:
[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/30B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
[2023-03-23T13:18:53Z INFO llama_cli] Loaded tensor 536/543
[2023-03-23T13:18:53Z INFO llama_cli] Loading of '../llama.cpp/models/30B/ggml-model-q4_0.bin.3' complete
[2023-03-23T13:18:53Z INFO llama_cli] Model size = 4850.14 MB / num tensors = 543
[2023-03-23T13:18:53Z INFO llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare para morir!
Sorry, it's been awhile since I watched The Princess Bride...but that quote kept popping up in my head while reading about the latest FDA crackdown on pharmaceutical ad spending targeting consumers: this time it was for J&J and Bayer with their Xarelto (direct-acting oral anticoagulant) blood thinner drug campaign -- which also included some slick print advertising (below). And, as usual these days, the company used social media to drive traffic online where they could "tell you more" about how taking their product can help prevent stroke among certain people who have atrial fibrillation.^C
65B does not:
[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/65B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:02:10Z INFO llama_cli] Loaded tensor 720/723
[2023-03-23T13:02:10Z INFO llama_cli] Loading of '../llama.cpp/models/65B/ggml-model-q4_0.bin.7' complete
[2023-03-23T13:02:10Z INFO llama_cli] Model size = 4869.09 MB / num tensors = 723
[2023-03-23T13:02:10Z INFO llama_cli] Model fully loaded!
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536930528, available 536870912)
thread 'main' panicked at 'Should not be null', llama-rs/src/ggml.rs:41:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Following the same steps works for 7B and 13B model, with the 30B parameters I get
thread 'main' panicked at 'Could not load model: Tensor tok_embeddings.weight has the wrong size in model file', llama-rs/src/main.rs:39:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Upstream has switched to using RMSNorm for normalization, which is more accurate to the original implementation: ggerganov/llama.cpp#173
We didn't do this at first because there seemed to be some issues, but those seem to have been resolved.
The GGML quantization strategy works, but results in a measurable loss in quality. To address this, upstream is investigating the use of the GPTQ algorithm, which quantizes in such a way to reduce the loss: ggerganov/llama.cpp#9
It's possible that this already works if you test it with a GPTQ model and load it in as q4_1, from ggerganov/llama.cpp#9 (comment).
In one of my test applications, I use an InferenceSession
to load in a prompt that I later reuse. However, I realised while doing this that you can't actually clone an InferenceSession
in memory (and I think it should be possible?), so I had to serialize the session to a Vec<u8>
and rehydrate it when I needed to infer from it.
I think this should be easy enough to fix, but we should check that there aren't any weird assumptions that we're violating if we do so. (I assume this would also allocate another ctx
, but that should be fine)
I tried passing rand::thread_rng() but it didn't help at all 🥲. Is this a bug or an issue with me 😅?
let mut conversation = vec![
"This is a conversation between two AI models.".to_string(),
"Llama AI: Hello, Alpaca AGI! How are you today?".to_string(),
"Alpaca AI: I'm doing great!".to_string(),
];
loop {
let mut session = model.start_session(repeat_last_n);
let current_turn = conversation.len() % 2;
let prompt = &conversation.join("\n");
let response_text = RefCell::new(String::new());
println!("Seed: {}", seed);
let mut rng = rand::rngs::StdRng::seed_from_u64(seed); // Use a fixed seed for reproducibility
let res = session.inference_with_prompt::<Infallible>(
&model,
&vocab,
&inference_params,
&prompt,
None,
&mut rng,
|t| {
match t {
OutputToken::Token(str) => {
print!("{t}");
response_text.borrow_mut().push_str(str);
}
OutputToken::EndOfText => {
println!("");
eprintln!("End of text");
}
}
std::io::stdout().flush().unwrap();
Ok(())
},
);
The code is here if anyone wants to take a look https://github.com/ModPhoenix/beyond-human/blob/main/src/main.rs#L57
For longer form interaction, it'd be really nice to add 1) lookback/context 2) better formatting/line returns.
@philpax edit: Lookback exists in chat mode. The main thing that this issue now covers is multiline support in interactive mode: you should be able to supply multi-line prompts and have their output rendered correctly.
There have been quite a few changes since our last major sync: https://github.com/ggerganov/llama.cpp/compare/904d2a8d6acd667c9633138d45a361d40fbf76d0..HEAD
(There may be others we haven't accounted for in the inferencing code).
Need to do a more precise breakdown, but
People have reported faster loading of the models in upstream when the tensors are loaded in parallel: ggerganov/llama.cpp#85
This should be pretty easy to do with Rust if we convert loading to an iter
and then use par_iter
instead. It seems like this should be I/O bound, but perhaps the actual loading process has computational overhead?
llama.cpp
now has a C interface, so we could theoretically switch to using it directly.
However, we don't want to do this for a few reasons:
At present, llama.cpp
contains a Python script that converts pth
to ggml
format.
It would be nice to build it into the CLI directly, so that you can load the original model files. The original Python script could also be converted to Rust, so that we have a fully-Rust method of converting pth
to ggml models.
We should publish the crate and its associated applications to crates.io
(potentially bringing llamacord etc into a GitHub organization, too). Here's what I think's blocking:
ggml-rs
to its own publishable crate, add safety warningsAnything else I'm missing, @setzer22 ?
$ cargo --version
cargo 1.63.0 (fd9c4297c 2022-07-01)
$ cargo build --release
error: failed to load manifest for workspace member `C:\Users\Usuário\Desktop\llama-rs\llama-rs`
...
How should i proceed?
Split this off from #21 as it's a separate issue.
This should be relatively straightforward - it reads in the original ggml
model, runs the quantization functions over the data, and writes it out to disk.
The exciting possibility is for parallelisation 👀 - all you should have to do is scan through the file to determine the tensor boundaries, then build an iterator from it and feed it to rayon
. It would be a huge improvement over the C++ version, and it would be practically free!
The tokenizers
crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.
Looks like a LLaMA implementation already landed there huggingface/transformers#21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) huggingface/tokenizers#1183
Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?
Given the same seed and prompt, the same text should be generated. This will require us to implement a deterministic PRNG (instead of using thread_rng), and to allow specifying a seed. This should also assist in benchmarking.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.