Have you tried to segment the hidden states within the attention class? about infini-attention HOT 7 CLOSED

ZackZikaiXiao commented on June 16, 2024

Have you tried to segment the hidden states within the attention class?

from infini-attention.

Comments (7)

jlamprou commented on June 16, 2024 1

@ZackZikaiXiao Glad to be of help. I suggest Yannic's Kilcher comprehensive explanation of the paper , it's worth your time.

from infini-attention.

jlamprou commented on June 16, 2024

@ZackZikaiXiao If you carefully read the README, I explain why I don't do the segmentation in-class. The paper follows the same segmentation logic behind Transformer-XL, Memformers etc. The complexity of in-class segmentation is O(S^2) which doesn't match the complexity described by the paper O(S).

from infini-attention.

ZackZikaiXiao commented on June 16, 2024

Yeah, the author segments the input in the training loop and I have read the code of Transformer-XL. I'm still curious about the potential performance of segmenting hidden states as its concise code (only replaces the attention class). Do you think segmenting in the attention still has acceptable perplexity, not considering complexity?

from infini-attention.

jlamprou commented on June 16, 2024

@ZackZikaiXiao I will reiterate, read the README carefully, I have tested both ways of doing it. The whole point of the paper is having a fixed-bounded memory while using huge global context. Doing the segmentation in-class consumes about the same memory as global SDPA attention, so you can't reach huge context lengths like 1M tokens because it's impossible to fit it in any VRAM. There is really no point in segmenting in-class , memory-wise it's the same as global SDPA attention and performance-wise it's worse. You can't achieve the same performance with global softmax (which is non-linear) with linear kernels, at least with the existing research.

from infini-attention.

ZackZikaiXiao commented on June 16, 2024

Yes, having “fixed-bounded memory“, “O(S) complexity“ and "the code of Transformer-XL" provides clear evidence for segmenting in the training loop for the paper. I'm not very clear about what global softmax, non-linear, and linear kernels mean—what is their relationship with segmenting? If it's possible to perform internal segmentation, combined with PEFT and some memory structures (like ring, tree) or DNC and NTMs as mentioned in the README, what do you think about the feasibility of this approach?

from infini-attention.

jlamprou commented on June 16, 2024

@ZackZikaiXiao When I say "You can't achieve the same performance with global softmax (which is non-linear) with linear kernels, at least with the existing research.", I'm trying to explain to you that if you do the segmentation in-class you have zero advantages, because you don't save memory and you lose performance since normal SDPA attention is just better. It's possible to do it in-class, there other implementations of the paper on GitHub that do it that way, but there is no point to it. This is not a better attention scheme, it's a series of math tricks to avoid keeping the whole global attention in the memory. We have a memory structure already in-class that is essentially what the paper calls "compressed memory". Let's leave the DNC and NTM suggestions out of this cause this conversation is gonna get even more complicated.🙃

from infini-attention.

ZackZikaiXiao commented on June 16, 2024

I'm starting to understand what you mean. If segmenting is within the attention, we still need to concatenate all fragments when the attention function returns, and this concatenated matrix is still as large as with SDPA, so it doesn't reduce the VRAM. By segmenting during the training loop, only one set of compressive memory needs to be maintained. Thank you very much for your response!

from infini-attention.

Have you tried to segment the hidden states within the attention class? about infini-attention HOT 7 CLOSED

Comments (7)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent