rustformers / llm Goto Github PK

View Code? Open in Web Editor NEW

5.9K 49.0 335.0 6.3 MB

An ecosystem of Rust libraries for working with large language models

Home Page: https://docs.rs/llm/latest/llm/

License: Apache License 2.0

Rust 99.70% Dockerfile 0.13% Nix 0.16%

ai ggml llm ml rust

llm's People

Contributors

Stargazers

Watchers

Forkers

kinddevil bombfuse mwbryant floppydiskette cheatcod chaosprint yonasbsd iamwilhelm chfi xieren58 zhutony hieutrluu sylvain-reynaud hertera1 d-ufrik anskarl cloudnepal darthdeus zayfen iuriimattos2 soon14 ukaserge wormon davdor kishorekarthik-04 leeola tngamemo 14yulianacastro stjordanis techventurebuilder markschmidty fujohnwang ksyoon0321 jack-michaud karelnagel trizko garybeyerlein7 huobibny01 hlhr202 muqito bcho mbrukman fidelahalcjw am1037 wdshin hhy5277 mqy hhamud simplect mehmetcansahin kerfufflev2 oyelowo katopz gigacrab royvorster am2rican5 dmarx rydamckinney rayhern lukasmoellerch floppydisck lukeangove kejith middas-0706 cjrh aibabelx nsarrazin christophercelaya rajrkrish leonardohn if001 jarvishuang18 utkarshsingh77 hscspring tot0 rhkdgh255 criticalpulsar complexityclass big-aaron fmeef mmizutani pravinshahi0007 dataqueen-center suryatmodulus hbcbh1999 nytopop mirrir0 sarroutbi shawnharmsen levidehaan iacore juho-p fisherdarling retrage lambdaofgod yoazmenda jackiej onehr suisui-koubou jon-chuang

llm's Issues

System Info

OS -> Void Linux (x86_64) (glibc), linux kernel 6.1.21_1
rustc -> rustc 1.68.1 (8460ca823 2023-03-20)
cargo -> cargo 1.68.1 (115f34552 2023-02-26)

Are the required C compilers present?

I have clang version 12.0.1, and gcc version 12.2.0 installed.

What error does the installation give?

Here's the entire log, uploaded to termbin.com for convenience.
This looks like the important bit:

The following warnings were emitted during compilation:

warning: In file included from /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/immintrin.h:107,
warning:                  from ggml/ggml.c:155:
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h: In function 'ggml_vec_dot_f16':
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1278:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1278 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~
warning: /usr/lib/gcc/x86_64-unknown-linux-gnu/12.2/include/f16cintrin.h:52:1: error: inlining failed in call to 'always_inline' '_mm256_cvtph_ps': target specific option mismatch
warning:    52 | _mm256_cvtph_ps (__m128i __A)
warning:       | ^~~~~~~~~~~~~~~
warning: ggml/ggml.c:916:33: note: called from here
warning:   916 | #define GGML_F32Cx8_LOAD(x)     _mm256_cvtph_ps(_mm_loadu_si128((__m128i *)(x)))
warning:       |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
warning: ggml/ggml.c:926:37: note: in expansion of macro 'GGML_F32Cx8_LOAD'
warning:   926 | #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx8_LOAD(p)
warning:       |                                     ^~~~~~~~~~~~~~~~
warning: ggml/ggml.c:1279:21: note: in expansion of macro 'GGML_F16_VEC_LOAD'
warning:  1279 |             ay[j] = GGML_F16_VEC_LOAD(y + i + j*GGML_F16_EPR, j);
warning:       |                     ^~~~~~~~~~~~~~~~~

error: failed to run custom build command for `ggml-raw v0.1.0 (/home/outsider/.cargo/git/checkouts/llama-rs-962022d29f37c95e/a067431/ggml-raw)`

Should this be reported to the base repository, ggml, as it's a compilation error for that project, or is it still relevant here?

Use F16 for memory_k and memory_v

see ggerganov/llama.cpp#154 & ggerganov/llama.cpp#270 (comment)

Did we support gpt4all-lora-quantized.bin?

I try use gpt4all-lora-quantized.bin from https://github.com/nomic-ai/gpt4all#try-it-yourself

cargo run --release -- -m ./data/gpt4all-lora-quantized.bin -f examples/alpaca_prompt.txt --repl

And got

[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 131
[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 132
[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 133
...
[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 256
[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 257
[2023-03-29T07:21:13Z INFO  llama_cli] Warning: Bad token in vocab at index 258
[2023-03-29T07:21:13Z INFO  llama_cli] ggml ctx size = 4017.35 MB
    
[2023-03-29T07:21:13Z INFO  llama_cli] Loading model part 1/1 from './data/gpt4all-lora-quantized.bin'

thread 'main' panicked at 'index out of bounds: the len is 2 but the index is 2', /Users/katopz/git/katopz/llama-rs/llama-rs/src/lib.rs:773:21

Maybe i've to convert it first?

bytesFromNibbles error

When I try to do cargo run (but not when I do cargo build --release, I get the following error:

    /usr/bin/ld: /home/faassen/install/llama-rs/ggml-raw/ggml/ggml.c:1418: undefined reference to `bytesFromNibbles'

this is pretty mysterious as I see bytesFromNibbles is actually being defined in ggml.c, though it does use avx stuff, so perhaps that's a problem. Weirdly enough I've successfully run the c++ version on this same laptop (Ryzen 6850u), so this library does seem to compile.

Real-time chat platform

It'd be good to be able to bounce ideas off each other in real-time instead of through issues for more moment-to-moment discussion. The popular choices in the Rust world are Discord and Zulip, from what I've seen; my preference is for Discord just because it's convenient. (I was considering creating one myself, but I figured that might be jumping the gun a little!)

Renaming of types

In #10, I re-exported the existing types with shorter names for the developer's benefit:

pub use llama::{
    GptVocab as Vocab, InferenceParams, LlamaHyperParams as HyperParams, LlamaModel as Model,
    OutputToken,
};

I suggest renaming the actual types to those names (the prefix is unnecessary if they're all in their own crates).

I'd also prefer Vocabulary, Hyperparameters and InferenceParameters over the existing names, but I'm not too fussed about those. I'm not sure what the going consensus is in the Rust ecosystem around shortening of words.

Making results independent from threadcount/batch size (from llama.cpp)

This may be something to keep an eye on: ggerganov/llama.cpp#439

Looks like the corresponding code is here: https://github.com/rustformers/llama-rs/blob/bf7bdbcfff3114dcbdafb6eb7eed58f04f19b1c3/llama-rs/src/lib.rs#L1203

According to the comments in the pull, it should trade a small amount of performance for less memory usage. However, at least one user commented they saw more memory use (not sure what size model).

Reporting model stats

Hey, in llama.cpp, there are some useful output for telling the model predict / evaluation stats:

https://github.com/ggerganov/llama.cpp/blob/da5303c1ea68aa19db829c634f1e10d08d409680/main.cpp#L1086-L1095

Can we also export such information after the model run?

I have a local change which produces output like this:

[2023-03-19T21:59:46Z INFO  llama_cli] Model size = 4017.27 MB / num tensors = 291
[2023-03-19T21:59:46Z INFO  llama_cli] Model fully loaded!
<prompt> <predict output>
feed_prompt_duration: 1533ms, prompt_tokens: 10, predict_duration: 22656ms, predict_tokens: 86, per_token_duration: 263.442ms

When will we get Alpaca.rs?

Non-`ggml` backend

This has been a topic of some discussion in #4 and on the Discord, so I figured I'd document our initial findings so far.

We would like to switch away from ggml at some point so that we can remove the C compiler dependency, and enable running on other types of devices (namely the GPU).

Our primary candidate for a Rust-native ML/tensor backend is burn, which is a flexible deep learning framework that supports multiple backends (including ndarray and torch).

Unfortunately, it doesn't support the two formats we need: f16 (original weights) and q4_0/q4_1 (quantized weights). Adding these to the ndarray backend should be viable, but getting it right and working optimally (i.e. similar to ggml's optimisations for those datatypes) will take some time.

Torch does support f16 on the GPU only, and burn's Torch backend supports it. The main problem there is actually just testing: the 7B weights are 14GB, which is difficult to make work with most consumer GPUs.

So we're in a bit of a pickle - there are three options available, all of which will require some work, and all of which have individual drawbacks:

Quantize the model to standard uint8 and use ndarray/torch backends. This is the least work (at least in theory), but uint8 quantization performs worse than either f16 or q4, from what I've heard.
Add support for f16 to burn's ndarray backend. The torch backend should already work, but it will be very hard to test with most of our machines. Adding support to ndarray for CPU inference shouldn't be impossible either (especially if we just convert to f32 for every operation), but it will be difficult to make it performance-optimal.
Add support to q4_0/1 to burn's ndarray backend. This is the option that will give us the most parity with the current implementation (assuming the majority of our users are using q4 weights), but it has the same performance-optimality issue as f16 on the CPU (every cumulative operation, like matrix multiplication and such, will need to be specialised). Additionally, there is no way to natively store a 4-bit element, so there's no guarantee that this will be space-optimal (e.g. we can't assume that ndarray and rustc will remap [[bool; 4]; N] to [u8; N/2]).

This is summarised in the following table:

	`uint8`	`f16`	`q4`
`ndarray`	Yes, but at noticeable quality loss	Requires semi-significant implementation work	Requires significant implementation work
`torch`	Yes, but at noticeable quality loss (GPU, CPU)	Yes, but is GPU-only	Unknown - should be possible, but likely requires custom code

An idea that I briefly floated was porting ggml itself to Rust using c2rust and some cleanup work, but that's likely to be quite time-consuming and it locks us out of the relatively-free improvements we get from people making PRs against llama.cpp's ggml implementation. The gain from having pure Rust would be outweighed by the maintenance burden we'd put on ourselves.

I believe the other Rust ML crates also do not support f16 or q4, but that's from a cursory exploration. Happy to be proven wrong!

Replace prompt caching with session caching in the CLI

(Will take this soon)

At present, we have --cache-prompt and --restore-prompt, but these are a little ambiguous in how they work. The former will exit immediately after saving the prompt, and the latter can actually be used with any prompt (not just the one that was saved to disk).

To better communicate what they do and to make them more general, I propose replacing them with --load-session, --save-session and --persist-session (which is an alias for loading and saving to the same path).

--load-session is identical to --restore-prompt in that it loads a saved inference snapshot, but it better communicates what it's doing.
--save-session will save the results of inference to disk, similar to --cache-prompt, but it will also include whatever was inferred, allowing you to continue on from a response. --cache-prompt PATH is equivalent to --save-session PATH -n 0. (This could be documented, or another flag could be added... but it almost feels like another "mode" to me. Should figure out how we want to do that for #29, too.)
--persist-session loads a session from an existing path (if it exists) and saves to the path afterwards.

This would allow you to have ongoing conversations over an extended period of time:

llama-cli --persist-session conversation.llama -p "How do I make bread?"
...
llama-cli --persist-session conversation.llama -p "How long should I let the dough rest at room temperature?"
...
llama-cli --persist-session conversation.llama -p "Can I keep the dough in the fridge?"

Automatic builds of `llama[-rs]-cli` binaries to download

We should hopefully be able to set up easy binary releases with cargo-dist. The only concern I have is with the features for each build - presumably we would want both AVX2 and non-AVX2 builds available - but maybe that's best dealt with by making upstream more intelligent or switching away from ggml in the long run?

API footgun: `infer_next_token` still works after end of text

In llamacord, I have some logic that calls infer_next_token in a loop. Unfortunately, I didn't check for EOT - so the code would keep generating tokens and producing (fascinatingly well-structured) garbage. I think we should probably check if the last token is EOT and return an error? If you feed it a prompt, the EOT would no longer be the last token, and you should be able to infer without issues. (I wonder if A[EOT]B infers differently to AB...)

Strip trailing newline in prompt file

Many, probably most, editors include a trailing newline at the end of a text file, which is not what you want if the prompt isn't a complete sentence or paragraph. This issue was fixed in llama.cpp.

Failing on windows 11 at cargo build --release

Hi, I am trying to figure out the problem, help will be great.

Failing on windows 11 at cargo build --release

Error message:

Caused by: process didn't exit successfully: C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-2925d7a622a7c725\build-script-build` (exit code: 1)
--- stdout
TARGET = Some("x86_64-pc-windows-gnu")
OPT_LEVEL = Some("3")
HOST = Some("x86_64-pc-windows-gnu")
cargo:rerun-if-env-changed=CC_x86_64-pc-windows-gnu
CC_x86_64-pc-windows-gnu = None
cargo:rerun-if-env-changed=CC_x86_64_pc_windows_gnu
CC_x86_64_pc_windows_gnu = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-gnu
CFLAGS_x86_64-pc-windows-gnu = None
cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_gnu
CFLAGS_x86_64_pc_windows_gnu = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "gcc.exe" "-O3" "-ffunction-sections" "-fdata-sections" "-m64" "-I" "include" "-Wall" "-Wextra" "/arch:AVX2" "-DNDEBUG" "-o" "C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-6f69f4f538ccd1b4\out\ggml/ggml.o" "-c" "ggml/ggml.c"
cargo:warning=ggml/ggml.c: In function 'pthread_create':
cargo:warning=ggml/ggml.c:53:49: warning: unused parameter 'unused' [-Wunused-parameter]
cargo:warning= 53 | static int pthread_create(pthread_t* out, void* unused, thread_ret_t(func)(void), void* arg) {
cargo:warning= | ~~~~~~^~~~~~
cargo:warning=ggml/ggml.c: In function 'pthread_join':
cargo:warning=ggml/ggml.c:64:49: warning: unused parameter 'unused' [-Wunused-parameter]
cargo:warning= 64 | static int pthread_join(pthread_t thread, void* unused) {
cargo:warning= | ~~~~~~^~~~~~
cargo:warning=gcc.exe: warning: /arch:AVX2: linker input file unused because linking not done
cargo:warning=gcc.exe: error: /arch:AVX2: linker input file not found: No such file or directory
exit code: 1

--- stderr

error occurred: Command "gcc.exe" "-O3" "-ffunction-sections" "-fdata-sections" "-m64" "-I" "include" "-Wall" "-Wextra" "/arch:AVX2" "-DNDEBUG" "-o" "C:\Users\touhidul.alam\Desktop\llama\llama-rs\target\release\build\ggml-raw-6f69f4f538ccd1b4\out\ggml/ggml.o" "-c" "ggml/ggml.c" with args "gcc.exe" did not execute successfully (status code exit code: 1).`

Consider removing the `bindgen` dependency

#22 and #24 show that clang is required as a build dependency because bindgen uses it to read in ggml.h and produce bindings. We should consider just writing the C bindings ourselves - it shouldn't be too difficult as long as we account for the idiosyncrasies (like #6)

Support for RWKV

So this is a pretty immense task and I'd start with #45, but...

RWKV is an RNN with Transformer-level LLM performance, which can also be directly trained like a GPT transformer (parallelizable). And it's 100% attention-free. You only need the hidden state at position t to compute the state at position t+1. You can use the "GPT" mode to quickly compute the hidden state for the "RNN" mode.

So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding (using the final hidden state).

It's entirely open-source, so not legally burdened like LLaMA, and (from what I've seen) is more powerful than BLOOM at the same parameter count.

I asked the RWKV Discord which implementation would be worth looking at, and this is what I was told:

RWKV-LM/RWKV-v4neo/src/model.py is the implementation that's actually used to train the large models, it's cuda only and has tons of features you probably don't need.
rwkv_pip_package only implements inference, but is a good implementation and worth a look, recently got a lot more complex due to supporting more and more strategies and including various optimizations.
ChatRWKV/src/model_run is an older version, but haven't played with it so not sure how good it is. Might be worth a look since it's basically an older version of the one in rwkv_pip_package.
RWKV_in_150_lines.py I still haven't fully checked out, but I know it doesn't support GPT mode, so that may or may not be less useful
Also worth a look is RWKV-v4neo/src/model_run.py, which is a small inference-only impl capable of loading the large RWKV checkpoints
I'm not sure if it has GPT-mode, though

So it sounds like rwkv_pip_package is the way to go as source material:

https://github.com/BlinkDL/ChatRWKV/blob/main/rwkv_pip_package/src/rwkv/model.py

The following articles are very useful for understanding how RWKV works:

An interesting detail from the latter is the following:

The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.

This may pose issues for the GGML 4-bit quantisation format, which is non-optimal. We would likely want GPTQ quantisation.

Let's collaborate

[apologies for early send, accidentally hit enter]

Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama

Couple of differences I spotted on my quick perusal:

My version builds on both Windows and Linux, but fails to infer correctly past the first round. Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.
I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.
I vendored llama.cpp in so that I could track it more directly and use its ggml.c/h, and to make it obvious which version I was porting.

Given yours actually works, I think that it's more promising :p

What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.

Please, document how to get the model

Optionally add a subcommand to pull it.

Use `usize` over `i32` where possible

The original code uses int liberally due to C's wonderful integer promotion. I think it's easier for us to understand and less troublesome to index arrays if we offer a uniform usize interface across the board, and only cast to/from i32 when required.

I made this change in my take and found it to be much less messier and easier to reason about (especially with maintaining non-negativity constraints!), but I'm not sure how much it would impact the implementation here.

Make a Chat like Application

Creating an issue here following discussion in the discord chat.

Does not follow a Q&A-like format like Alpaca is supposed to and seems to provide invalid output

Hi there,

I'm trying to use the ggml-alpaca-30B-4b weights with this project on a M1 Mac. It outputs text, however it is completely unsensible, does not follow the question-answer format in any way and is unable to answer basic questions. I have checked the SHA256 sum of the model file and it is correct. The same file works fine with alpaca.cpp.

[2023-03-29T15:36:02Z INFO  llama_cli] Loading of '../alpaca.cpp/ggml-alpaca-30b-q4.bin' complete
[2023-03-29T15:36:02Z INFO  llama_cli] Model size = 19391.35 MB / num tensors = 543
[2023-03-29T15:36:02Z INFO  llama_cli] Model fully loaded!
>> How high is the Empire State Building?
⣟  You know how to help your user get the information they need in the most efficient way possible while being
aware of the contextual clues provided by the user such as location or time. For this assignment you will
design an intelligent conversational system that can answer questions about^C

Furthermore, given just the input prompt, the project would not stop outputting text and then started to repeat, whereas the answers from alpaca.cpp are always very concise and to the point:

>>
⡿  You will be my reliable friend in any situation. I am confident that you can do anything!
I want to tell you about myself so we could get closer and understand each other better: My name is Alexandra – friends call me Sasha. I’m 24 years old, live alone with a cat named Yegor (he loves fish) in St Petersburg Russia but dream of visiting New Zealand one day!
I am working as an assistant at the moment and looking for new opportunities because my boss has decided to leave his job. That is why I have started searching jobs online. My goal now is to find a reliable, stable position that will allow me to learn more about business processes in order to advance in my career field.
The most valuable skill set that you can rely on during work with me would be: multitasking; planning and organizing skills of the highest level because I am used working under pressure so I never miss deadlines or forget important tasks; communication – ability to clearly explain instructions, answer questions and solve problems quickly.
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop my problem-solving abilities as well as learn how different organizations work on a daily basis. My organizational skills are top notch since working with various files, emails and documents has become routine for me by now. I am also used to dealing with difficult people or stressful situations that require quick thinking so you can always count on my help!
I have experience in administration management which helped develop problem^C

Why does the output here seem to be totally unrelated to the question at hand?

error when loading a model

I fixed main.rs to refer to &args.model_path, but now I get a new error:

Could not load model: invalid utf-8 sequence of 1 bytes from index 0

I created these models using the tools in llama.cpp, but they don't seem to be compatible?

Swap strategy for infinite output

As discussed in ggerganov/llama.cpp#71 (comment)

The idea is to achieve a naive implementation for infinite output generation using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.

This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.

A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.

By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal. However, since llama.cpp has recently implemented this, I feel like we should at least add this naive version too until someone can figure out a real solution.

build fails on arch linux

Hi, building with arch linux fails and provides this output

warning: ggml/ggml.c: In function ‘quantize_row_q4_0’:
warning: ggml/ggml.c:413:13: warning: unused variable ‘pp’ [-Wunused-variable]
warning:   413 |     uint8_t pp[QK/2];
warning:       |             ^~
warning: ggml/ggml.c: In function ‘ggml_vec_dot_q4_0’:
warning: ggml/ggml.c:1422:18: warning: unused variable ‘countBlocks’ [-Wunused-variable]
warning:  1422 |     const size_t countBlocks = nb;
warning:       |                  ^~~~~~~~~~~
   Compiling llama-rs v0.1.0 (/home/user/opt/llama/llama-rs/llama-rs)
error[E0658]: let...else  statements are unstable
   --> llama-rs/src/lib.rs:492:17
    |
492 | /                 let Some(tensor) = model.tensors.get(&tensor_name)
493 | |                     else {
494 | |                         return Err(LoadError::UnknownTensor { tensor_name, path: part_path });
495 | |                     };
    | |______________________^
    |
    = note: see issue #87335 <https://github.com/rust-lang/rust/issues/87335> for more information

For more information about this error, try rustc --explain E0658.
error: could not compile llama-rs due to previous error

Hi, is there any plans for word embeddings?

Noob here, excuse me for my stupid feature request.
I noticed that someone in llama.cpp is working on word embedding from hidden layers. I m just asking is there any possibility to implement an embedding mode for llama-rs? thx

What i found is this commit

hardcoded model_path, ignoring -m

It appears that the model path CLI argument is being ignored entirely and that the path to the model is hardcoded.

Good ideas from llama.cpp

I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

GPTQ quantization 👀 ggerganov/llama.cpp#9
Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. ggerganov/llama.cpp#85 (comment)
Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side 👀 ggerganov/llama.cpp#173 (comment)
It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of #14 take half the size on disk, which is a nice bonus: ggerganov/llama.cpp#154 (review)
Looks like the fix from #1 just landed upstream. We should make sure to fix it here too ggerganov/llama.cpp#161
The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible ggerganov/llama.cpp#167

Using cached sessions in REPL mode ?

Hi everyone!

I've been using llama-rs recently and I especially love the caching session feature. I've been building an API that allows users to maintain multiple conversation threads and switch between them on the fly.

This means I'm currently caching each conversation using a unique id, and loading them as needed depending on which conversation the user is talking to. It works great and the response time are pretty fast on deep conversations but it also means I need to load the weights into memory each time.

As far as I know it's not possible to use cached sessions in REPL mode. On some setups with slower disk I/O this is now the bottleneck on chats.

Would there be a way for me to achieve something like this ? To recap:

Binary stays loaded in memory in REPL mode.
For every prompt in REPL mode, pass a different cached session so it can load context before answering, and save the results after.

Feel free to let me know if you think this is technically possible and just not exposed in the CLI, or if there's some deeper issue making this harder to achieve.

Thank you for your time, I'm a big fan of this project. :D

llama-cli: Could not load model: InvalidMagic { path: ... }

Model sucessfully runs on llama.cpp but not in llama-rs

Command:

cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"

PS C:\Users\Usuário\Desktop\llama-rs> cargo run --release -- -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"
    Finished release [optimized] target(s) in 2.83s
     Running `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"`
thread 'main' panicked at 'Could not load model: InvalidMagic { path: "C:\\Users\\Usuário\\Downloads\\LLaMA\\7B\\ggml-model-q4_0.bin" }', llama-cli\src\main.rs:147:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: process didn't exit successfully: `target\release\llama-cli.exe -m C:\Users\Usuário\Downloads\LLaMA\7B\ggml-model-q4_0.bin -p "Tell me how cool the Rust programming language is:"` (exit code: 101)

broken build on fedora

The build is broken on fedora 37.

full logs : https://gist.github.com/sylvain-reynaud/fe73ccc7edad1f4f98688cb48b1f101c

--- stderr
  ggml/ggml.h:177:10: fatal error: 'stddef.h' file not found
  thread 'main' panicked at 'Unable to generate bindings: ClangDiagnostic("ggml/ggml.h:177:10: fatal error: 'stddef.h' file not found\n")', ggml-raw/build.rs:31:10
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Fix

On Fedora, to avoid build errors due to the missing stddef.h file, you may need to install the following packages:

sudo dnf groupinstall "Development Tools" "Development Libraries"

Then build with CPATH set to the location of the gcc headers :

CPATH="/usr/lib/gcc/x86_64-redhat-linux/12/include/" cargo build --release

Tested on my laptop fedora 37 and fedora:latest docker image.

Support for BLOOM

Not sure if we should consider this out of scope, but bloomz.cpp is a fork of llama.cpp that's capable of inference with the BLOOM family of models. The changes don't look very large, so there's room for code sharing here: https://github.com/NouamaneTazi/bloomz.cpp/commits/main?before=ade8a9d82fa1dc440c26f09a9e02cc94d7294251+35&branch=main&qualified_name=refs%2Fheads%2Fmain

Even if we don't support it directly, it may be worth publishing a safe-ish version of ggml-rs to crates.io so that a library like llama-rs could be built for BLOOM.

Continuous integration

It should be pretty straightforward to set up CI on [Windows|Linux|macOS] x86-64 and macOS ARM64. It should also help us test porting over the remaining build flags from the original Makefile.

65B model does not run

Doing my part.

With llama as of 69c9229, with patches from #59: https://github.com/jempabroni/llama-rs.

This is not a RAM issue. llama.cpp runs all models fine including 65B.

7B runs:

[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/7B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:03:04Z INFO  llama_cli] Loaded tensor 288/291
[2023-03-23T13:03:04Z INFO  llama_cli] Loading of '../llama.cpp/models/7B/ggml-model-q4_0.bin' complete
[2023-03-23T13:03:04Z INFO  llama_cli] Model size = 4017.27 MB / num tensors = 291
[2023-03-23T13:03:04Z INFO  llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare to die, "Django"
In 1850s Georgia the Confederate South was in its infancy and undergoing drastic changes from a culture of slavery towards one where freedom for all people became paramount. The Civil War ensued as many^C

13B runs:

[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/13B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:10:48Z INFO  llama_cli] Loaded tensor 360/363
[2023-03-23T13:10:48Z INFO  llama_cli] Loading of '../llama.cpp/models/13B/ggml-model-q4_0.bin.1' complete
[2023-03-23T13:10:48Z INFO  llama_cli] Model size = 3880.49 MB / num tensors = 363
[2023-03-23T13:10:48Z INFO  llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare para morir!
Escrito por: Sally Dixon, Michael McBain.
Producido por: Aaron Herbertsen, Kevin Loader for Channel^C

30B runs:

[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/30B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
[2023-03-23T13:18:53Z INFO  llama_cli] Loaded tensor 536/543
[2023-03-23T13:18:53Z INFO  llama_cli] Loading of '../llama.cpp/models/30B/ggml-model-q4_0.bin.3' complete
[2023-03-23T13:18:53Z INFO  llama_cli] Model size = 4850.14 MB / num tensors = 543
[2023-03-23T13:18:53Z INFO  llama_cli] Model fully loaded!
My name is Inigo Montoya. You killed my father. Prepare para morir!
Sorry, it's been awhile since I watched The Princess Bride...but that quote kept popping up in my head while reading about the latest FDA crackdown on pharmaceutical ad spending targeting consumers: this time it was for J&J and Bayer with their Xarelto (direct-acting oral anticoagulant) blood thinner drug campaign -- which also included some slick print advertising (below). And, as usual these days, the company used social media to drive traffic online where they could "tell you more" about how taking their product can help prevent stroke among certain people who have atrial fibrillation.^C

65B does not:

[pem@jabroni llama-rs]$ cargo run --release -- -m ../llama.cpp/models/65B/ggml-model-q4_0.bin -p "My name is Inigo Montoya. You killed my father. Prepare"
...
[2023-03-23T13:02:10Z INFO  llama_cli] Loaded tensor 720/723
[2023-03-23T13:02:10Z INFO  llama_cli] Loading of '../llama.cpp/models/65B/ggml-model-q4_0.bin.7' complete
[2023-03-23T13:02:10Z INFO  llama_cli] Model size = 4869.09 MB / num tensors = 723
[2023-03-23T13:02:10Z INFO  llama_cli] Model fully loaded!
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 536930528, available 536870912)
thread 'main' panicked at 'Should not be null', llama-rs/src/ggml.rs:41:36
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

30B model doesn't load

Following the same steps works for 7B and 13B model, with the 30B parameters I get

thread 'main' panicked at 'Could not load model: Tensor tok_embeddings.weight has the wrong size in model file', llama-rs/src/main.rs:39:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Use `RMSNorm` for normalization

Upstream has switched to using RMSNorm for normalization, which is more accurate to the original implementation: ggerganov/llama.cpp#173

We didn't do this at first because there seemed to be some issues, but those seem to have been resolved.

GPTQ quantization

The GGML quantization strategy works, but results in a measurable loss in quality. To address this, upstream is investigating the use of the GPTQ algorithm, which quantizes in such a way to reduce the loss: ggerganov/llama.cpp#9

It's possible that this already works if you test it with a GPTQ model and load it in as q4_1, from ggerganov/llama.cpp#9 (comment).

Make `InferenceSession` `Clone`-able

In one of my test applications, I use an InferenceSession to load in a prompt that I later reuse. However, I realised while doing this that you can't actually clone an InferenceSession in memory (and I think it should be possible?), so I had to serialize the session to a Vec<u8> and rehydrate it when I needed to infer from it.

I think this should be easy enough to fix, but we should check that there aren't any weird assumptions that we're violating if we do so. (I assume this would also allocate another ctx, but that should be fine)

Noob question: Why do I get the same text generation despite passing a different seed?

I tried passing rand::thread_rng() but it didn't help at all 🥲. Is this a bug or an issue with me 😅?

    let mut conversation = vec![
        "This is a conversation between two AI models.".to_string(),
        "Llama AI: Hello, Alpaca AGI! How are you today?".to_string(),
        "Alpaca AI: I'm doing great!".to_string(),
    ];

    loop {
        let mut session = model.start_session(repeat_last_n);

        let current_turn = conversation.len() % 2;
        let prompt = &conversation.join("\n");

        let response_text = RefCell::new(String::new());

        println!("Seed: {}", seed);

        let mut rng = rand::rngs::StdRng::seed_from_u64(seed); // Use a fixed seed for reproducibility

        let res = session.inference_with_prompt::<Infallible>(
            &model,
            &vocab,
            &inference_params,
            &prompt,
            None,
            &mut rng,
            |t| {
                match t {
                    OutputToken::Token(str) => {
                        print!("{t}");

                        response_text.borrow_mut().push_str(str);
                    }
                    OutputToken::EndOfText => {
                        println!("");
                        eprintln!("End of text");
                    }
                }

                std::io::stdout().flush().unwrap();
                Ok(())
            },
        );

The code is here if anyone wants to take a look https://github.com/ModPhoenix/beyond-human/blob/main/src/main.rs#L57

Better multiline support in interactive mode

~~For longer form interaction, it'd be really nice to add 1) lookback/context 2) better formatting/line returns.~~

@philpax edit: Lookback exists in chat mode. The main thing that this issue now covers is multiline support in interactive mode: you should be able to supply multi-line prompts and have their output rendered correctly.

Update to latest llama.cpp

There have been quite a few changes since our last major sync: https://github.com/ggerganov/llama.cpp/compare/904d2a8d6acd667c9633138d45a361d40fbf76d0..HEAD

(There may be others we haven't accounted for in the inferencing code).

Need to do a more precise breakdown, but

#59 and #61

Parallel loading of the model tensors

People have reported faster loading of the models in upstream when the tensors are loaded in parallel: ggerganov/llama.cpp#85

This should be pretty easy to do with Rust if we convert loading to an iter and then use par_iter instead. It seems like this should be I/O bound, but perhaps the actual loading process has computational overhead?

Explain differences from llama.cpp in README

llama.cpp now has a C interface, so we could theoretically switch to using it directly.

However, we don't want to do this for a few reasons:

You still need a C++ compiler, which complicates deployment to other platforms
Rust makes the code easier to work with
We want to make ggml an optional backend in future (#31)

Directly load `pth`/PyTorch tensor model files

At present, llama.cpp contains a Python script that converts pth to ggml format.

It would be nice to build it into the CLI directly, so that you can load the original model files. The original Python script could also be converted to Rust, so that we have a fully-Rust method of converting pth to ggml models.

Publish to crates.io

We should publish the crate and its associated applications to crates.io (potentially bringing llamacord etc into a GitHub organization, too). Here's what I think's blocking:

Move ggml-rs to its own publishable crate, add safety warnings
Fix warnings
Add deny public docs, document everything
Get all pending PRs in

Anything else I'm missing, @setzer22 ?

Warning: Bad token in vocab at index xxx

running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..."

The path points to ggml converted llama model, which I have verified that they work with llama.cpp

[Cargo]: feature `workspace-inheritance` is required

$ cargo --version
cargo 1.63.0 (fd9c4297c 2022-07-01)

$ cargo build --release
error: failed to load manifest for workspace member `C:\Users\Usuário\Desktop\llama-rs\llama-rs`
...

How should i proceed?

Convert `quantize.cpp` to Rust

Split this off from #21 as it's a separate issue.

This should be relatively straightforward - it reads in the original ggml model, runs the quantization functions over the data, and writes it out to disk.

The exciting possibility is for parallelisation 👀 - all you should have to do is scan through the file to determine the tensor boundaries, then build an iterator from it and feed it to rayon. It would be a huge improvement over the C++ version, and it would be practically free!

Use the HuggingFace llama Tokenizer

The tokenizers crate by HuggingFace should give us a more correct tokenizer implementation than the one we're currently using.

Looks like a LLaMA implementation already landed there huggingface/transformers#21955, and then @Narsil shared an additional PR on the tokenizers crate (not sure what this fixes, but I assume the changes are necessary?) huggingface/tokenizers#1183

Seems like we have everything we need to use the new tokenizer. An important point remains though: Are we allowed to distribute the tokenizer file? Can it be considered a completely independent thing from the weights?

Deterministic generations

Given the same seed and prompt, the same text should be generated. This will require us to implement a deterministic PRNG (instead of using thread_rng), and to allow specifying a seed. This should also assist in benchmarking.