Comments (5)
@lucidrains Thanks for great answers! Yeah, I agree that the most correct way would be to create special masks that disable rotations for CLS tokens, but it seems to be very complicated to do so.
For people from the future here: I just add both CLS and memory tokens in the beginning, and it is working quite well with RoPE, at least nothing is breaking and it is learning well, but I'm working not with texts but with custom sequences. I will add more if I find something else.
from x-transformers.
@lucidrains Good idea about using CLS only at the penultimate layer btw, going to remember that
from x-transformers.
@oasidorshin sounds good, if you do discover that CLS tokens function well without much relative positional engineering, that is tweet worthy
from x-transformers.
@oasidorshin hey Oleg! so the answer to this is no one knows and i haven't read any papers trying to make this work. it may turn out to be the case that the network just figures it out (CLS token learns to ignore the rotations). you may be in a position to be the first to explore this and share your findings
however, the correct way would be a months work fusing the rotation of queries and keys into the flash attention kernel. i imagine passing some hyperparameters that builds a should rotate mask, with rotations between CLS tokens and all other tokens omitted. you can also do this manually within Attention
by breaking off the CLS token from queries and keys and building up the pre-softmax attention matrix that way
from x-transformers.
@oasidorshin another alternative is to just use the CLS token to pool the representations across all tokens at the penultimate layers through cross attention. i've seen this used in some vision transformers with success
from x-transformers.
Related Issues (20)
- How to build optimizer HOT 9
- [Minor; noob question] Uniform distribution instead of normal
- RotaryEmbedding XPOS doesn't work with mems HOT 5
- `layer_mem` is unbound (when called from `ContinuousTransformerWrapper`) HOT 6
- Generation for PaLI?
- Confusion about image->caption example HOT 1
- How can I add custom attention masks to a Decoder? HOT 3
- Question: rotary embeddings and bad length extrapolation HOT 1
- [Bug] XL-recurrence with AlibiPositionalBias and mems not working correctly HOT 17
- [Question] very small attention scores HOT 7
- Was it a clerical error ? ScaleNorm.g init form dim ** -0.5. I think it should be dim ** 0.5 HOT 1
- [Bug] Error when `rotary_pos_emb` set to True in cross attention HOT 3
- Question: problem with xval implementation HOT 5
- RoPE inconsistency (2-dim subspaces choice)
- Sinusoidal embedding order choice different from original definition HOT 1
- How to use "src_key_padding_mask" HOT 2
- Enable flash attention does not support BFloat16? HOT 1
- Problem with cache and memory
- Random lack of gradients HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from x-transformers.