Coder Social home page Coder Social logo

Comments (7)

jlamprou avatar jlamprou commented on June 16, 2024 1

@ZackZikaiXiao Glad to be of help. I suggest Yannic's Kilcher comprehensive explanation of the paper , it's worth your time.

from infini-attention.

jlamprou avatar jlamprou commented on June 16, 2024

@ZackZikaiXiao If you carefully read the README, I explain why I don't do the segmentation in-class. The paper follows the same segmentation logic behind Transformer-XL, Memformers etc. The complexity of in-class segmentation is O(S^2) which doesn't match the complexity described by the paper O(S).

from infini-attention.

ZackZikaiXiao avatar ZackZikaiXiao commented on June 16, 2024

Yeah, the author segments the input in the training loop and I have read the code of Transformer-XL. I'm still curious about the potential performance of segmenting hidden states as its concise code (only replaces the attention class). Do you think segmenting in the attention still has acceptable perplexity, not considering complexity?

from infini-attention.

jlamprou avatar jlamprou commented on June 16, 2024

@ZackZikaiXiao I will reiterate, read the README carefully, I have tested both ways of doing it. The whole point of the paper is having a fixed-bounded memory while using huge global context. Doing the segmentation in-class consumes about the same memory as global SDPA attention, so you can't reach huge context lengths like 1M tokens because it's impossible to fit it in any VRAM. There is really no point in segmenting in-class , memory-wise it's the same as global SDPA attention and performance-wise it's worse. You can't achieve the same performance with global softmax (which is non-linear) with linear kernels, at least with the existing research.

from infini-attention.

ZackZikaiXiao avatar ZackZikaiXiao commented on June 16, 2024

Yes, having “fixed-bounded memory“, “O(S) complexity“ and "the code of Transformer-XL" provides clear evidence for segmenting in the training loop for the paper. I'm not very clear about what global softmax, non-linear, and linear kernels mean—what is their relationship with segmenting? If it's possible to perform internal segmentation, combined with PEFT and some memory structures (like ring, tree) or DNC and NTMs as mentioned in the README, what do you think about the feasibility of this approach?

from infini-attention.

jlamprou avatar jlamprou commented on June 16, 2024

@ZackZikaiXiao When I say "You can't achieve the same performance with global softmax (which is non-linear) with linear kernels, at least with the existing research.", I'm trying to explain to you that if you do the segmentation in-class you have zero advantages, because you don't save memory and you lose performance since normal SDPA attention is just better. It's possible to do it in-class, there other implementations of the paper on GitHub that do it that way, but there is no point to it. This is not a better attention scheme, it's a series of math tricks to avoid keeping the whole global attention in the memory. We have a memory structure already in-class that is essentially what the paper calls "compressed memory". Let's leave the DNC and NTM suggestions out of this cause this conversation is gonna get even more complicated.🙃

from infini-attention.

ZackZikaiXiao avatar ZackZikaiXiao commented on June 16, 2024

I'm starting to understand what you mean. If segmenting is within the attention, we still need to concatenate all fragments when the attention function returns, and this concatenated matrix is still as large as with SDPA, so it doesn't reduce the VRAM. By segmenting during the training loop, only one set of compressive memory needs to be maintained. Thank you very much for your response!

from infini-attention.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.