Sorry for asking here, but I couldn't find any answer to this both in papers and in th

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Correct interaction between CLS token and RoPE about x-transformers HOT 5 CLOSED

oasidorshin commented on June 11, 2024 1

Correct interaction between CLS token and RoPE

from x-transformers.

Comments (5)

oasidorshin commented on June 11, 2024 1

@lucidrains Thanks for great answers! Yeah, I agree that the most correct way would be to create special masks that disable rotations for CLS tokens, but it seems to be very complicated to do so.

For people from the future here: I just add both CLS and memory tokens in the beginning, and it is working quite well with RoPE, at least nothing is breaking and it is learning well, but I'm working not with texts but with custom sequences. I will add more if I find something else.

from x-transformers.

oasidorshin commented on June 11, 2024 1

@lucidrains Good idea about using CLS only at the penultimate layer btw, going to remember that

from x-transformers.

lucidrains commented on June 11, 2024 1

@oasidorshin sounds good, if you do discover that CLS tokens function well without much relative positional engineering, that is tweet worthy

from x-transformers.

lucidrains commented on June 11, 2024

@oasidorshin hey Oleg! so the answer to this is no one knows and i haven't read any papers trying to make this work. it may turn out to be the case that the network just figures it out (CLS token learns to ignore the rotations). you may be in a position to be the first to explore this and share your findings

however, the correct way would be a months work fusing the rotation of queries and keys into the flash attention kernel. i imagine passing some hyperparameters that builds a should rotate mask, with rotations between CLS tokens and all other tokens omitted. you can also do this manually within Attention by breaking off the CLS token from queries and keys and building up the pre-softmax attention matrix that way

from x-transformers.

lucidrains commented on June 11, 2024

@oasidorshin another alternative is to just use the CLS token to pool the representations across all tokens at the penultimate layers through cross attention. i've seen this used in some vision transformers with success

from x-transformers.

Recommend Projects

Correct interaction between CLS token and RoPE about x-transformers HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent