If we use -n 1000000 to have a very long output (for

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://github

it seems there are two possible solutions swap idea:</strong

Can confirm that on the latest commit <a class="commit-link" data-hovercard-type="comm

Longer and infinite output,about ggerganov/llama.cpp

Comments (59)

ggerganov commented on July 24, 2024 13

Infinite generation should now be supported. The current implementation works like this:

Keep generating until the context n_ctx (i.e. 2048) becomes full
When full, set n_past == n_keep where n_keep is a user-provided parameter. By default, it is 0. Can be set to something in order to get a "static" prompt. See examples/chat.sh how we make Bob instructions to be a "static" prompt. You can observe the "static" prompt by adding the --verbose-prompt argument
These n_keep tokens are instantly available thanks to the KV cache, so no need to recompute anything so far
Next, we also pick half of the n_ctx - n_keep tokens that were generated last and insert them for re-inference. This is currently happening serially so there is a delay when it occurs

For example, this command should generate text forever:

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 512 -t 8 -n -1 --ignore-eos

from llama.cpp.

ggerganov commented on July 24, 2024 6

Hmm, I think yes - we need to shift the KV cache. I haven't implemented this yet in any of the ggml examples.
And when the context is full - stop increasing n_past.

from llama.cpp.

rabidcopy commented on July 24, 2024 2

Yeah, the context swap is not free and causes a pause for me when it happens. Still beats the former behavior and lack thereof. Long starting prompts and large contexts really starts to creep on speed. I never thought I'd want to upgrade my CPU and RAM for CPU-based inference. Really look forward to breakthroughs and "OH" moments. Think a useful strategy for now is to really condense starting prompts so you cover as much as possible with less tokens.

from llama.cpp.

ggerganov commented on July 24, 2024 2

@tjohnman

make clean
LLAMA_OPENBLAS=1 make

from llama.cpp.

eous commented on July 24, 2024 1

Depending on how much memory you have you can increase the context size to get longer outputs. On a 64gb machine I was able to have a 12k context with the 7B model and 2k context with the 65B model. You can change it here

from llama.cpp.

j-f1 commented on July 24, 2024 1

e0213e0

from llama.cpp.

DKormann commented on July 24, 2024 1

it seems there are two possible solutions

swap idea:
while using the model take half of the input tokens and start building inference over them in the background. When out of context size swap two the second instance. The model will loose all memory of before the new context window. This
-: this is computationally intensive
+: I am confident that this is technically possible

rolling context
the problem here is positional encoding. the original positional encoding for transformers is not abel to do rolling context because the transformer is trained on fixed encodings. Something like the Alibi encoding might work for this. We cant choose the encoding because facebook chose Rope as mentioned during training. Rotational positional encoding sounds promising but after I skimmed over the paper it doesn't seem probable to me that RopE would support somehing like that.
+: would be highly elegant.
+: as mentioned the model could carry over information from past the context window.
-: It seems unlikely to me that its possible (im not confident in that prediction)

from llama.cpp.

anzz1 commented on July 24, 2024 1

Please correct me if I'm wrong since I'm yet to fully grasp LLM's in general and I'm still very much in the learning phase.

If I understood correctly, basically the general idea of these models is to infer the next token by answering this question: "Given the previous token stream X, what should be the next token?" using probability analysis.

And that includes the end-of-stream token, correct? So when the calculation ends up predicting "stop here", it stops there.

Currently there is the option to ignore that token for the whole session since this commit 50fae10

What is your opinion on including "magic keywords" for the interactive mode to control the variables. That could be extended to other variables too, but what I'm specifically thinking about here is having a "continue" magic keyword. So in the normal operation mode (not overridden by --ignore-eos), after reaching the end-of-stream token and control being given back to the user, inputting "continue" wouldn't put that in the input stream but rather skip the eos token and continue from there, once. It'd be more flexible than having a global variable for the whole session.

While I can't be obviously certain, it seems to me that this is how openai's interactive chatgpt demo does it.

This would only solve the "reached end-of-stream token too early" problem though while the other half is the need to have a sliding context window to be able to have a infinite output.

Again, please feel free to correct me if I understood something wrong.

from llama.cpp.

anzz1 commented on July 24, 2024 1

Can confirm that on the latest commit b391579

the infinite generation mode with main -m ./models/llama-13B-ggml/ggml-model-gptq4.bin -n -1 -c 2024 --ignore-eos --n_parts 1 --interactive-first works. Also for the first time since the tokenizer change I'm able to run to it indefinitely without any crashes so it seems that the segfault problem has also been fixed recently.

However as my CPU is pretty underwhelming (i5-6600k 4c/4t) , after ~1800 words the generation slows to such a crawl that not a single token gets generated in several minutes. Note that I said words because there isn't a log on how many tokens were generated, I'm thinking of adding an option where interjecting with CTRL+C would print out debug info like the amount of tokens generated so far as with my current hardware I'm not able to do stuff like attach a debugger and run traces as the performance penalty from debugging tools is too big to be able to do anything useful.

This PR

#477

also looks very interesting for the purposes of debugging. Especially the link between output length and speed degradation could be researched to understand the cause of the bottleneck better.

from llama.cpp.

matthew-mcallister commented on July 24, 2024

Huh... I thought the context size was determined when the model was trained due to the positional encoding used. (I am only a layman.) But #78 is still useful for when you eventually hit the limit of your context, right?

from llama.cpp.

drewcrawford commented on July 24, 2024

When trying large contexts, I often encounter

aggml_new_tensor_impl: not enough space in the context's memory pool (needed 702840624, available 701883745)

I played around a bit with increasing ctx_size but that did not work, I suspect an underlying memory UB as the cause, as lldb seems to trap on some suspicious memory accesses

from llama.cpp.

Vent3st commented on July 24, 2024

Depending on how much memory you have you can increase the context size to get longer outputs. On a 64gb machine I was able to have a 12k context with the 7B model and 2k context with the 65B model. You can change it here

Your link goes to this code snippet if (!llama_model_load(params.model, model, vocab, 512)) { // TODO: set context from user input ?? is this correct to change the context size?

from llama.cpp.

bitRAKE commented on July 24, 2024

ggml_new_tensor_impl: not enough space in the context's memory pool (needed 704905456, available 704155676) Assertion failed: false, file ggml.c, line 2516

@drewcrawford for me, that error doesn't appear to be context size related. I've run the same prompt at different context sizes and they all fail.

from llama.cpp.

eous commented on July 24, 2024

Typically if you get the not enough space in the context error you tried setting the context too large, though on the larger models I have had to tweak this line too. The math around memory allocation in this thing isn't perfectly scaling on the larger models and unfortunately my fork has substantially diverged from master and too lazy to work on merging.

llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 8192
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 20562.56 MB
llama_model_load: memory_size = 12800.00 MB, n_mem = 327680

8192 context size on quantized 13B model

from llama.cpp.

ggerganov commented on July 24, 2024

Yes, the math for computing necessary memory for the ggml buffers and graphs has to be updated.

from llama.cpp.

Piezoid commented on July 24, 2024

What's the best way to enable infinite output? Can we just shift-out the old contexts in K and V tensors (along the n_ctx dim) when they are full, or is there a better approach?

from llama.cpp.

bitRAKE commented on July 24, 2024

The 65B model sometimes crashes with 2k, but usually works ...

llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 49157.73 MB
llama_model_load: memory_size = 10240.00 MB, n_mem = 163840

The increased context doesn't increase memory use (still under 41GB) - until the context gets filled.

Yet, on the 7B model the calculation overflow at the start for 16k or greater. Usually, 15360 will work.
llama_model_load: ggml ctx size = 17592186044337.35 MB

from llama.cpp.

eous commented on July 24, 2024

https://github.com/eous/llama.cpp/commit/e0213e08c03a3ac72cdec4596b872073b51655aa here is some easy stuff I pulled out of my local hackery if anyone wants to play with it

from llama.cpp.

eous commented on July 24, 2024

Btw if anyone wants to slice/dice/refactor/cleanup/dissect/mixup/etc that changeset feel free, I don't need to be credited.

from llama.cpp.

LoganDark commented on July 24, 2024

https://github.com/eous/llama.cpp/commit/e0213e08c03a3ac72cdec4596b872073b51655aa here is some easy stuff I pulled out of my local hackery if anyone wants to play with it

It's only been 4 days and this is apparently already a dead link. is that because this has already been added or?

from llama.cpp.

Piezoid commented on July 24, 2024

I'm wondering what is happening when the context goes over the 2048 tokens limits (#274). The models seems to break down quickly, in unexpected ways.

As a non initiated to the field, I asked ChatGPT, but I'm not sure if I should believe it:

If the KV memory contains more contexts during inference than were used during training, it can lead to a phenomenon known as "model fragmentation". Model fragmentation occurs when the model becomes unable to retrieve useful information from the KV memory because the stored contexts are too diverse or unrelated to the current context.
During training, the model learns to associate the input text with the corresponding output text by processing a set of training examples. If the KV memory contains more contexts during inference than were used during training, the model may encounter contexts that it has not seen before, which can lead to difficulties in retrieving relevant information from the memory. In other words, the model may not be able to generalize well to unseen contexts if the KV memory contains too many unrelated or diverse contexts.

From what I gather, an excessive number of retrieved values are summed up in the output, resulting in a signal loss. However, I find it surprising that the degradation is so fast and severe.

What are the strategies for rolling or pruning the contexts? As briefly discussed, the trivial one is using a rolling window (FIFO) that discard tokens older than n_ctx. AFAIK, the contexts in the KV memory are not position (row) dependent, and we should be able to overwrite old contexts by wrapping-around the index.

ChatGPT told me that another common strategy uses a LRU priority queue. Querying a context is not a binary but analog process. I guess that some sort of threshold can be used to call when a specific context is queried. Or maybe, derive some sort of score using an exponential smoothing of the past Queries \dot Key similarities.

I browsed some of the references provided by ChatGPT on this matter and they seemed to be mostly hallucinated; So I wouldn't trust it.
It would be very helpful if someone knowledgeable could offer their perspective.

from llama.cpp.

setzer22 commented on July 24, 2024

Hmm, I think yes - we need to shift the KV cache. I haven't implemented this yet in any of the ggml examples. And when the context is full - stop increasing n_past.

Hi @ggerganov! (and anyone else interested 👀) I wanted to reach out to you because I've been working on implementing the idea you mentioned about shifting the KV cache to support infinite output mode in LLaMA. I've run some experiments, and it seems like we might need to rethink the approach.

I implementing shifting for the KV cache (and I triple checked everything), but after the window was shifted by a single token, the model started to output garbage. After a lot of testing and frustration, it hit me: positional encoding.

I realized that the embeddings stored in the memory_k and memory_v tensors indirectly store positionally encoded data. Therefore, shifting their contents messes with the semantics of those embeddings. While the code in llama_eval computes the values for cur_k and cur_v from the inpL before RoPE is applied, this is only true for the first layer. For every other layer, the contents of inpL are replaced by the output of the previous layer, meaning the tensors for every other layer already contain positional information in an indirect way.

I'm not entirely sure if my reasoning is correct, but my results seem to validate the idea that shifting the KV cache might not be enough. I'm reaching out to you (and anyone else who understands this) to see if you agree with my conclusions and to ask if you have any suggestions on how to proceed.

I think it would be really beneficial to have a way to slide the context window for LLaMA inference. This could be the key to unlocking a true ChatGPT clone with LLaMA. While having a fixed 2048 token window is a good start, being able to slide that window would enable the self-attention mechanism to remember much more than just the last 2048 tokens.

So anyway, sorry for the somewhat incoherent wall of text. My point is, I wanted to reach out because I'm out of ideas here, but I'm eager to hear your thoughts and any suggestions you (or others reading this) might have. 😄

from llama.cpp.

tjohnman commented on July 24, 2024

@setzer22 I only understand what you're saying in part, but reading your post made me think of something. In abstract terms, would it be possible in some way, instead of having a context that shifts, to have two contexts and do some sort of swap strategy?

I'm sure that carries its own issues. Due to my ignorance I can't be more specific, but I'm throwing the idea out there just in case it applies in some way or inspires you to think of something else.

from llama.cpp.

ggerganov commented on July 24, 2024

@setzer22
That's true - it's not as simple as we initially thought.
How does it work in the Python code? Maybe we get an idea from there

from llama.cpp.

setzer22 commented on July 24, 2024

to have two contexts and do some sort of swap strategy?

@tjohnman I'm unsure what you mean by a swap strategy here. My first guess is that swapping things out wouldn't work. It sounds similar to clearing the full context window, unless I'm miunsderstanding something. 😅

That's true - it's not as simple as we initially thought.
How does it work in the Python code? Maybe we get an idea from there

@ggerganov I've only done a very quick look to the python code (assuming you mean https://github.com/facebookresearch/llama/), but I haven't seen anything referring to a sliding window, so I'm not sure if that's implemented there.

from llama.cpp.

tjohnman commented on July 24, 2024

@setzer22 Yes, it's a very naïve idea I was proposing in hopes that more knowledgeable minds would be inspired somehow by it. If (or until) the context window can be shifted properly somehow, maybe a good compromise could be to use a new, cleared context but carrying over information from the previous one.

Again, perhaps it's a very naïve solution, but what about:

Save the last n tokens.
Clear the context window.
Prompt/prime it using the saved tokens.

Would something like this work?
Or perhaps something similar to this idea: https://twitter.com/miolini/status/1635559164297752577

Leveraging the model itself to summarize previous information and seeding a brand new context with it every time it fills up.

from llama.cpp.

Piezoid commented on July 24, 2024

@ggerganov I've only done a very quick look to the python code (assuming you mean https://github.com/facebookresearch/llama/), but I haven't seen anything referring to a sliding window, so I'm not sure if that's implemented there.

So far, I haven't seen it used in fb/llama. I've been searching for a reference implementation of RoPE-based inference with a sliding window, but I haven't had any luck finding one. It would be great to have one to learn from.

I'm still going over the RoPE paper paper and haven't quite figured out how it relates to the ggml implementation. But I think we'll need to do some shuffling for the i2 < ne2 dimension.
(Edit: with this "Position Information in Transformers" review, it's easier to see how all the pieces fit together.)

from llama.cpp.

theontho commented on July 24, 2024

How about looking at what https://github.com/oobabooga/text-generation-webui does? It can also do 4bit llama with GPUs & CPUs and has an infinite chatbot mode too. It's way more work to setup than llama.cpp, and llama.cpp w/ llama-30b seems to perform better with an M1 pro than a 3090 can with oobabooga, so I'm looking forward to llama.cpp getting this feature. @oobabooga is also pretty friendly.

from llama.cpp.

tjohnman commented on July 24, 2024

@DKormann Forgive me because I'm still a total layman when it comes to the terminology. When you say "building inference" you mean using those tokens for prediction? The idea I got in very simple terms:

You have two contexts.
When you get to 50% capacity on the first one, start filling up the second one in parallel.
When you get to 100% capacity on the first, switch to the second (which is now at 50%).
Create a new one.
Repeat.

This effectively means your context is half of the size, but you can keep going forever like this. Would this work?

from llama.cpp.

DKormann commented on July 24, 2024

@tjohnman first of im just a layman too:)
'building inference' is just me trying to describe that you need to feed the previous tokens into the model again. I understand the approach you are suggesting. I want to stress that feeding 50% of the context size into the model again requires half of the computation again.
As soon as you swap to the next instance this instance is allready half full right? so you will need to start filling up a new instance again. Every token will go through the model twice (apart from the tokens in the very beggining). So it will double the computational effort. I not sure how hard it would be to have a model operate on two contexts simultaneously.

It would be nice to fill the second context while wating for user input. Maybe one could hide a lot of the extra cost from the user

from llama.cpp.

tjohnman commented on July 24, 2024

Doubling costs is no joke. Unless this can be optimized somehow.

Building inference on the second context while waiting for input is not a bad idea, but if I compare the time I spend reading the output versus typing input, I don’t think that will be enough.

I can live with having the speed halved and RAM usage doubled when using the 7B model because it generates quite fast. But I don’t know if the effort of implementing it is worth it if the performance is going to be prohibitive in most use cases.

Perhaps someone with knowledge about the internals of the process can shed some light on whether this could be implemented in a saner way.

from llama.cpp.

rabidcopy commented on July 24, 2024

Infinite generation should now be supported. The current implementation works like this:

Keep generating until the context n_ctx (i.e. 2048) becomes full

When full, set n_past == n_keep where `n_keep is a user-provided parameter. By default, it is 0. Can be set to something in order to get a "static" prompt. See examples/chat.sh how we make Bob instructions to be a "static" prompt

These n_keep tokens are instantly available thanks to the KV cache, so no need to recompute anything so far

Next, we also pick half of the n_ctx - n_keep tokens that were generated last and insert them for re-inference. This is currently happening serially so there is a delay when it occurs

For example, this command should generate text forever:
make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 512 -t 8 -n -1 --ignore-eos 

This is great. Gonna test it out soon. One question, if --keep isn't specified or 0, does it just use the initial prompt's size or is the initial prompt discarded once n_ctx is reached? (I think the former is the case unless I'm misinterpreting e2d490d#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805R201?)

from llama.cpp.

ggerganov commented on July 24, 2024

If --keep 0 there is no "static" prompt. So when the context is full with n_ctx tokens, we will pick the second half of them [n_ctx/2, n_ctx] and use that as a new prompt. The initial prompt that has been provided will eventually disappear after one or more swaps / rotations of the context.

Currently, the new context is constructed as n_keep + last (n_ctx - n_keep)/2 tokens, but this can also become a user-provided parameter. For example, instead of always picking half of the tokens, we can pick a specific number of tokens or a percentage. It's easy to extend in such ways.

from llama.cpp.

rabidcopy commented on July 24, 2024

Thanks for the clarification. I think it would be nice to possibly have an automatic value that sets n_keep to the prompt size. --keep -1 possibly? May PR it later if I'm feeling confident in implementing it.

--keep 0 or not specified - no static prompt kept in context when n_ctx runs out
--keep 70 - keep 70 out of 140 tokens of the initial 140 token prompt when n_ctx runs out
--keep -1 - keep all 140 tokens of the initial 140 token prompt when n_ctx runs out

from llama.cpp.

ggerganov commented on July 24, 2024

It is currently supported by setting --keep to some very large value - larger than the number of tokens in the original prompt:

llama.cpp/examples/main/main.cpp

Lines 220 to 222 in 34ab526


	params.n_keep = std::min(params.n_keep, (int) embd_inp.size());

We can add --keep -1 to make it more explicit though

from llama.cpp.

anzz1 commented on July 24, 2024

Hmm I can't currently get this to work at all, in all the combinations I've tried with -n -1 it simply doesn't produce any output and keeps passing the control back to user in either interactive or instruct modes.

from llama.cpp.

rabidcopy commented on July 24, 2024

Hmm I can't currently get this to work at all, in all the combinations I've tried with -n -1 it simply doesn't produce any output and keeps passing the control back to user in either interactive or instruct modes.

-n is n_predict. Which is basically the max tokens that can be put out in one response. You want something like -c 2048 -n 2048 --keep 2048 I think if you want endless output and your initial prompt to be remembered. Lowering -n if you want particularly shorter messages if you aren't using a reverse prompt + chatbot template. If you set -n too low you're just going to get a response using that amount of tokens and then given back control in interactive.

from llama.cpp.

Green-Sky commented on July 24, 2024

judging buy 79b2b26 , n -1 was supposed to work?

from llama.cpp.

rabidcopy commented on July 24, 2024

judging buy 79b2b26 , n -1 was supposed to work?

Oh, I missed that part. Sorry. No idea then.

from llama.cpp.

Green-Sky commented on July 24, 2024

can confirm. n -1 does what you describe, so probably a bug.

from llama.cpp.

rabidcopy commented on July 24, 2024

Can also confirm. -n -1 is just handing back input. ~~Try without setting -n at all. That works for me.~~ It appears manually setting -1 for infinity isn't working. Edit: Wait, is this just defaulting to 128?

llama.cpp/examples/common.h

Line 19 in e2d490d

int32_t n_predict = 128; // new tokens to predict

llama.cpp/examples/common.cpp

Line 207 in 79b2b26

    
           fprintf(stderr, "  -n N, --n_predict N   number of tokens to predict (default: %d, -1 - infinity)\n", params.n_predict);

Guess the workaround is to set -n to an extremely high number.

from llama.cpp.

anzz1 commented on July 24, 2024

#523

from llama.cpp.

Green-Sky commented on July 24, 2024

the generation slows to such a crawl that not a single token gets generated in several minutes.

sounds like the context swap.

from llama.cpp.

ggerganov commented on July 24, 2024

@rabidcopy
You might want to try reducing the context to 512 for example. Will improve the performance at the cost of shorter memory.
Also make sure to link with OpenBLAS - it really helps during context swap / rotation

from llama.cpp.

rabidcopy commented on July 24, 2024

@rabidcopy You might want to try reducing the context to 512 for example. Will improve the performance at the cost of shorter memory. Also make sure to link with OpenBLAS - it really helps during context swap / rotation

Oh man. I reinstalled Linux a while back and didn't realize I still had regular blas and not openblas. I've been compiling without OpenBLAS support this whole time. Thanks for the tip!

from llama.cpp.

tjohnman commented on July 24, 2024

@rabidcopy You might want to try reducing the context to 512 for example. Will improve the performance at the cost of shorter memory. Also make sure to link with OpenBLAS - it really helps during context swap / rotation

Oh man. I reinstalled Linux a while back and didn't realize I still had regular blas and not openblas. I've been compiling without OpenBLAS support this whole time. Thanks for the tip!

Do you need to pass any special flags to make or cmake to make this happen? I'm on Linux too, but no matter which packages I install, llama.cpp reports no BLAS support. Do I need my CPU to support it as well or something along those lines?

from llama.cpp.

tjohnman commented on July 24, 2024

@tjohnman

make clean
LLAMA_OPENBLAS=1 make

Thank you.
EDIT: I feel a bit foolish for not having figured this out by myself. It's right there in the Makefile...

from llama.cpp.

rabidcopy commented on July 24, 2024

Actually, I'm having trouble compiling with OpenBLAS..

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lopenblas
/usr/bin/ld: ggml.o: in function `ggml_compute_forward_mul_mat_f16_f32':
ggml.c:(.text+0x3adb): undefined reference to `cblas_sgemm'
/usr/bin/ld: ggml.o: in function `ggml_compute_forward_mul_mat_q4_0_f32':
ggml.c:(.text+0x5fe9): undefined reference to `cblas_sgemm'
/usr/bin/ld: ggml.o: in function `ggml_compute_forward':
ggml.c:(.text+0xd055): undefined reference to `cblas_sgemm'
/usr/bin/ld: ggml.c:(.text+0xda9e): undefined reference to `cblas_sgemm'
collect2: error: ld returned 1 exit status
make: *** [Makefile:234: main] Error 1

Edit: Nevermind, solved. If you're on Arch, install openblas-lapack and replace openblas cblas and lapack.

from llama.cpp.

sgoll commented on July 24, 2024

Regarding the implementation in #71 (comment), one problem that I noticed is somewhat of a continuity (and consistency) degradation whenever the reset and restart happens, i.e. when the context length was exceeded and the prompt had to be reseeded (with or without n_keep).

Is this due to the internal states (self-attention) not matching up with the previous input anymore, or is it just because the model behaves exactly as if it had been fed with [n_keep from prompt] [last tokens from output] to begin with? In other words: is there some internal state that does not cope well with the sudden discontinuity in input tokens?

Not being an expert, neither in theory nor in implementation, I was just wondering if a different approach could also be taken:

I was thinking of some kind of round-robin approach at the input and output layers: instead of inputting tokens into one "line" with maximum length of the context size that eventually has to be reset and restarted, could we let that "line" wrap around?

I do not simply mean shuffling around the next embd tokens fed into the model but rather at the model layer itself. Greatly simplified, this is how I imagine how the input at the first layer could look like with regard to self-attention states (i.e. the tokens put into the model in previous eval calls):

Do this ([] marks the last input token):

[A] → A[B] → AB[C] → ABC[D] → [E]BCD → E[F]CD → …

Instead of the current approach (where A is kept around by n_keep and the last output D is fed into the model again via embd):

[A] → A[B] → AB[C] → ABC[D] → (restart) → A[DE] → ADE[F] → …

If I understand it correctly, this would not have a positional issue (since existing state in the model stays where it was). But I have no idea if the model would be able to cope with this situation at all.

Essentially, this would be equivalent to the "shift everything to the left" approach when including all self-attention states but without the copying around that comes with it.

It may solve the problem that there is somewhat of a continuity loss when the reset and restart of the current implementation kicks in. I assume this is because it essentially restarts without self-attention states other than for the initial n_keep tokens.

Unfortunately, I lack the necessary background to do a plausibility check here but I hope that the general gist of my idea can be understood by someone with the knowledge to tell me if this would be something that could possibly work and whether a draft implementation would be worthwhile.

I apologize for my naivety in this post. I'm afraid I have only a rough idea how LLMs works.

PS: It seems some more details have been provided over at the llama-rs repository in rustformers/llm#77 (comment):

The idea … in #71 (comment) is … using a strategy that simply clears the context window (you can keep the original prompt around), and starts adding new tokens.

This is a hack that doesn't properly leverage the advantages of the attention mechanism: When the context window gets full, the transformer's hidden state has information about more than just the last 2048 tokens, because this information is there indirectly embedded in the outputs for the self-attention mechanism. For example, if token 25 attended to tokens 10 and 12, even when tokens 10 and 12 fall outside the context window, a lot of information about these tokens will still be encoded at position 25.

A solution that slides the context window would achieve a gradually "fading" context window, instead of something where the transformer 100% forgets about a word the moment a token falls outside of context. I have some reason to suspect systems like ChatGPT are relying on a mechanism like this based on their ability to consistently recall parts of the conversation that occured way before the token window was exceeded. However, I'm not knowledgeable enough to figure out if there's a way to actually make this work, given the fact that the positional encoding function used in LLaMA (RoPE) is absolute, not relative.

By doing the swap trick proposed here, the transformer will effectively forget all prior context whenever the swap occurs, and there will be a lag spike due to the last few tokens having to be reprocessed. So this is very much non-ideal …

It may be that this is where I read what I thought about internal self-attention states but I was unable to find it again until now. Then again, since this also doesn't provide a direct solution (but stresses the general idea that it would be useful to somehow preserve the implicit state from earlier inputs that is still stored in the model), I am not sure if this information might be useful here 🙁

from llama.cpp.

ggerganov commented on July 24, 2024

@sgoll

The idea that you propose is interesting and I think I understand conceptually what you mean.
Like you, I am not sure if it makes total sense or if it can work at all. Probably with some modification of the KQ mask, one can achieve such type of round-robin / sliding window context. But I don't see it at the moment.

The drawbacks of the existing approach are correctly outlined in the llama-rs comment.
It would be great to have something better implemented, but if I understand correctly, there is no publicly known method. ChatGPT could be doing something like this, but it might as well be doing context swap + sumarization for all we know, no?

from llama.cpp.

setzer22 commented on July 24, 2024

there is no publicly known method. ChatGPT could be doing something like this, but it might as well be doing context swap + sumarization for all we know, no?

Agreed there. It's very hard to know anything for sure and OpenAI isn't going to tell anyone.

This is just a suspicion on my end based on interactions with the tool. One think that makes me think they're not using something similar to the swap strategy implemented here is that there's never a clear point in the conversation where a lag spike occurs, but I'm also guessing there are ways to tick users by hiding the latency. And they also seem to pull off other kinds of magic like effortlessly reading through several pages of text and start generating right away in less than a second, so maybe their trick is just having super fast inference 🤷‍♂️

from llama.cpp.

ggerganov commented on July 24, 2024

The lack of latency with ChatGPT can be explained by the high memory bandwidth of GPUs. On a GPU, the memory throughput is much higher compared to CPU and since the prompt evaluation is memory bound, I can easily see a high-end GPU being 10-100x times faster than a CPU for large context swaps. I expect the CPU - GPU difference for single token inference to be much smaller though

from llama.cpp.

setzer22 commented on July 24, 2024

I see. That makes a lot of sense. And also explains why this project seems to be competitive in tokens/s with people running the oobabooga webui on GPU 🤔

from llama.cpp.

eshaanagarwal commented on July 24, 2024

Hi I am facing this issue while doing CPU Inference using GPT4ALL-1.3groovy model.
Can somebody please help with that

from llama.cpp.

jboero commented on July 24, 2024

I know this is closed but I just wanted to leave my $0.02 experience in case others come along. I run a workstation with 566GB RAM and Nvidia RTX 4070ti, usually using server (./examples/server/). The 4070ti with 12GB is great for chat and responses and reasonable models+prompt context size. If using completion instead of chat I regularly have good luck running CPU mode with a prompt context size of 400,000+ (~300GB RAM with derived 13Bx8 models) for writing long responses or writing book chapters. Sometimes it takes multiple completions (Hitting Start or forcing it to continue completion with hints) but it will write indefinitely up to ctx size if you coax it a bit like a choose your own adventure.

Example:

<EPICSTORY...> and then Johnny finished the code and they lived happily ever after. <END> this seems like a perfectly logical stopping point, so Llama may finish even if it has more context and -1 predictions specified. What you can do is either add to this and re-prompt completion or edit it and adjust the ending for something that clearly needs more explaining:

<EPICSTORY...> and then Johnny finished the code and they lived happily ever after until he found a bug and then

This will hint to the completion that it needs to keep going. It's great because you can sort of guide it or edit parts of the completion you would like adjusted.

from llama.cpp.

sprappcom commented on July 24, 2024

@jboero can you provide an example llama model u are using that can have 400k context size? i thought context generated is limited by the llama model u use. max i know is 192k which i never get to touch coz it's way above my 8gb vram and 64gb ram.

how do you get 400k tokens generated?

from llama.cpp.

jboero commented on July 24, 2024

Sorry I should have specified "prompt context" not model context. Ala --ctx-size arg. No I haven't actually used a model with 400k context.

-c N, --ctx-size N: Set the size of the prompt context. The default is 512, but LLaMA models were built with a context of 2048, which will provide better results for longer input/inference. The size may differ in other models, for example, baichuan models were build with a context of 4096.

from llama.cpp.

MrMBag commented on July 24, 2024

I realize I'm quite late to this party... but I'm able to get an infinite response running llama.cpp on my raspberry pi 4. When I load it up, I use these commands.

./main -m models/[your model]  --color \  --ctx-size 2048 \ -n -1 \ -ins -b 256 \ --top_k 10000 \ --temp 1.5 \ 
--repeat_penalty 1.1 \ --ignore-eos

I'm going to let you all know that I've been playing around with AI for literally the past 2 weeks, so I barely know what I'm doing, but I'm still learning. (In case you looked at those commands and gagged, laughed or asked yourself, "What in God's holy name is this moron doing?"). I'm kind of like the guy that in an emergency situation that can lift a car off of someone, because in that moment I'm not thinking about all the reasons I can't lift a car. What I mean is, I think I got llama.cpp to work in the first place by brute force and ignorance, so I can't explain why it works, it just does for me.

So I hope that helps anyone who knows less than I do, or it opens doors for someone who knows more than I do.

from llama.cpp.

matti commented on July 24, 2024

@MrMBag thank you, that really helped to understand.

from llama.cpp.

Longer and infinite output about llama.cpp HOT 59 CLOSED

Comments (59)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent