I think we could see a big improvement on the BPE performance if we allowed pre-traini

[Idea] Support "training" and freezing BPE Cache about tokenizers HOT 3 CLOSED

huggingface commented on May 9, 2024

[Idea] Support "training" and freezing BPE Cache

from tokenizers.

Comments (3)

n1t0 commented on May 9, 2024

I agree but it might get a bit complicated to use for the simpler cases.

What do you think about trying evmap and synchronizing only rarely? It could also allow for very fast reads, and writing only when needed (no need to update already existing cache entries). The performances would increase as we see more subwords. If we reach the capacity of the cache then it should go full speed anyway.

from tokenizers.

epwalsh commented on May 9, 2024

So I gave evmap a shot but haven't got it fully working yet. It's a little weird, actually. The read handles need to be cloned between threads or at least wrapped in Arc, maybe. See the branch cache-evmap. I don't think it compiles right now.

I'm not saying I'm ready to give up on evmap, but I will say RwLock is much simpler, and the only bottleneck is writing since that blocks all readers. But that would be solved by pre-filling the cache.

And pre-filling the cache could be done during BpeTrainer::train, right?

I guess another improvement would be to automatically stop trying to write as soon as the cache is at capacity. That should be pretty easy.

from tokenizers.

n1t0 commented on May 9, 2024

Ok, I see. The evmap probably requires the cache to be handled one level higher, so that we can give each thread what it needs to manipulate it. This is definitely not something we want to do for now.

I think the cache should be optimized for the actual usage, so It's not really possible to do this during the training since both can be completely different. Also, I think for the difference it might make, it shouldn't be a problem to have some kind of start-up period where the tokenizer isn't as fast as it might get. If someone using the library wants to optimize their use case, it can easily be done by encoding some text that fills the cache before starting to use it. Tokenizing 500MB of text takes something like 10s right now, so even if we needed that much data it shouldn't be a problem to have this kind of start-up time.

from tokenizers.

Recommend Projects

[Idea] Support "training" and freezing BPE Cache about tokenizers HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent