Comments (3)
I agree but it might get a bit complicated to use for the simpler cases.
What do you think about trying evmap and synchronizing only rarely? It could also allow for very fast reads, and writing only when needed (no need to update already existing cache entries). The performances would increase as we see more subwords. If we reach the capacity of the cache then it should go full speed anyway.
from tokenizers.
So I gave evmap
a shot but haven't got it fully working yet. It's a little weird, actually. The read handles need to be cloned between threads or at least wrapped in Arc
, maybe. See the branch cache-evmap
. I don't think it compiles right now.
I'm not saying I'm ready to give up on evmap
, but I will say RwLock
is much simpler, and the only bottleneck is writing since that blocks all readers. But that would be solved by pre-filling the cache.
And pre-filling the cache could be done during BpeTrainer::train
, right?
I guess another improvement would be to automatically stop trying to write as soon as the cache is at capacity. That should be pretty easy.
from tokenizers.
Ok, I see. The evmap
probably requires the cache to be handled one level higher, so that we can give each thread what it needs to manipulate it. This is definitely not something we want to do for now.
I think the cache should be optimized for the actual usage, so It's not really possible to do this during the training since both can be completely different. Also, I think for the difference it might make, it shouldn't be a problem to have some kind of start-up period where the tokenizer isn't as fast as it might get. If someone using the library wants to optimize their use case, it can easily be done by encoding some text that fills the cache before starting to use it. Tokenizing 500MB of text takes something like 10s right now, so even if we needed that much data it shouldn't be a problem to have this kind of start-up time.
from tokenizers.
Related Issues (20)
- tokenizers-linux-x64-musl is not found when running inside node apline docker
- `BertWordPieceTokenizer` not saving with `sep_token` marked HOT 2
- error: casting `&T` to `&mut T` is undefined behavior HOT 1
- Is it possible to pass a tokenizer from Python into Rust?
- Discrepancy Between GitHub Release and NPM Package Version & Missing Dependencies HOT 1
- Deepseeker model completely loses performance after using tokenizer.add_tokens(special_tokens)
- Unsound use of unsafe in `src/utils/parallelism.rs`
- LLamaTokenizer with `use_fast=True` / and `use_fast=False` causing memory leak when used with multiprocessing / `dataset.map(num_proc)` HOT 1
- StripAccents doesn't work
- Issue in installing rudalle on google colab, !pip install rudalle
- Extended vocab tokenizer merging text into a single string without spaces while decoding
- offline installation HOT 1
- Failing to build bindings with 0.19.1 HOT 1
- Python Binding: Tokenizer.from_file() cannot parse JSON file of tokens HOT 1
- Treatment of hyphenated words
- Cross-compilation fails for custom target
- Breaking changes in v0.19.1 for tiktoken/llama3 HOT 3
- BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased HOT 1
- UnigramTrainer: byte_fallback is false.
- Tokens Removed from Trained Custom BPE Tokenizer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tokenizers.