Coder Social home page Coder Social logo

Comments (5)

oasidorshin avatar oasidorshin commented on June 11, 2024 1

@lucidrains Thanks for great answers! Yeah, I agree that the most correct way would be to create special masks that disable rotations for CLS tokens, but it seems to be very complicated to do so.

For people from the future here: I just add both CLS and memory tokens in the beginning, and it is working quite well with RoPE, at least nothing is breaking and it is learning well, but I'm working not with texts but with custom sequences. I will add more if I find something else.

from x-transformers.

oasidorshin avatar oasidorshin commented on June 11, 2024 1

@lucidrains Good idea about using CLS only at the penultimate layer btw, going to remember that

from x-transformers.

lucidrains avatar lucidrains commented on June 11, 2024 1

@oasidorshin sounds good, if you do discover that CLS tokens function well without much relative positional engineering, that is tweet worthy

from x-transformers.

lucidrains avatar lucidrains commented on June 11, 2024

@oasidorshin hey Oleg! so the answer to this is no one knows and i haven't read any papers trying to make this work. it may turn out to be the case that the network just figures it out (CLS token learns to ignore the rotations). you may be in a position to be the first to explore this and share your findings

however, the correct way would be a months work fusing the rotation of queries and keys into the flash attention kernel. i imagine passing some hyperparameters that builds a should rotate mask, with rotations between CLS tokens and all other tokens omitted. you can also do this manually within Attention by breaking off the CLS token from queries and keys and building up the pre-softmax attention matrix that way

from x-transformers.

lucidrains avatar lucidrains commented on June 11, 2024

@oasidorshin another alternative is to just use the CLS token to pool the representations across all tokens at the penultimate layers through cross attention. i've seen this used in some vision transformers with success

from x-transformers.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.