huggingface / tokenizers Goto Github PK

View Code? Open in Web Editor NEW

8.4K 120.0 724.0 9.79 MB

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Home Page: https://huggingface.co/docs/tokenizers

License: Apache License 2.0

Rust 71.09% JavaScript 0.47% Python 20.44% Makefile 0.26% TypeScript 2.51% Jupyter Notebook 4.90% CSS 0.31% HTML 0.02%

nlp natural-language-processing natural-language-understanding language-model transformers bert gpt

tokenizers's Introduction

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!):

Rust (Original implementation)
Python
Node.js
Ruby (Contributed by @ankane, external repo)

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

You can customize how pre-tokenization (e.g., splitting into words) is done:

from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace()

Then training your tokenizer on a set of files just takes two lines of codes:

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

Once your tokenizer is trained, encode any text with just one line:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Check the documentation or the quicktour to learn more!

tokenizers's People

Contributors

Stargazers

Watchers

Forkers

elonmuskceo ssitb codeaudit fajela jingmouren muleina sthagen tchigher gazzola sanjibnarzary baitcenter benjamin-ny ml-lab tejamoy sanjayws mattoates ib23 rdpli sharms canberkaslan 0x01001011 joanneburke adelevie yuimo rbroc shivlondon 0xqq shanhedian2017 matteo-grella nkvuradi hongshunyang hhy5277 ys610zz zebrajack yhqiu awesome-archive sjyttkl vinsonzou indexfziq peckerning shaunstanislauslau rsantana-isg shuowenwei jokertion shadowkun isgasho davidalphafox blue-science-ai sidney1994 zhongqian71400 jplu uttam92 leziyu wiki-lai cedrickchee mbrukman aliseyfi shineloveyc chikara-chan lkluo murasame loveksa10 venkyr91193 iechevarria srravula1 chladams yuanjie-ai bestart jclairelopez asdlei99 dineshjs newtornmoses vilunov mikbry tomarchelone abhishekkrthakur da505819 xinhaoli74 michaelxuzhi vanrobermore prashant118 dragomirradev donna-legal fyyw domsteil gandalf012 wmcai huitseeker taipt ljos jorgeih belalmohsen arunkumarramanan ejhortala sashank06 lyn2018lyn kraw phymucs 0xflotus learnedvector

tokenizers's Issues

Compatibility with torchtext

Normally when using a custom tokenizer with torchtext fields, you can pass the tokenizer function to the Field constructor and then build a vocab attribute which keeps track of the stoi mapping.

TEXT = Field(sequential=True, tokenize=my_tokenizer_fn)
TEXT.build_vocab(train_data) # builds the stoi/itos mapping

Since 🤗 tokenizers build their own vocab mappings, what's the best way to use them with torchtext, for example to use one of their datasets? If you just did the above, the TEXT.vocab mappings wouldn't match the tokenizer mappings. Unfortunately I haven't seen a simple way of using custom mappings in torchtext. The best solution I've found so far is to follow the above procedure and then manually override the TEXT vocab with the tokenizer one. So that would look something like this:

from torchtext.datasets import WikiText2
from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer(...)
tokenizer_fn = lambda string: tokenizer.encode(string).tokens
TEXT = Field(sequential=True, tokenize=tokenizer_fn)
train, valid, test = WikiText2.splits(TEXT)
TEXT.build_vocab(train)

def set_vocab_mapping(vocab, tokenizer, unk_token='[UNK]'):
    stoi = defaultdict(lambda: tokenizer.token_to_id(unk_token))
    itos = []
    for i in range(tokenizer._tokenizer.get_vocab_size()):
        token = tokenizer.id_to_token(i)
        stoi[token] = i
        itos.append(token)
    vocab.stoi = stoi
    vocab.itos = itos

set_vocab_mapping(TEXT.vocab, tokenizer)

Is there a more straightforward way to do this? If not, it might be handy to have a helper function and/or example for others to reference since torchtext is so ubiquitous.

Tokenization Training not working ?

I am on a google colab notebook.

I have vocab.txt file

I have uploaded above some sample content.
The full size is 83mb

i am training a tokenizer on it

from tokenizers import (ByteLevelBPETokenizer,
                            BPETokenizer,
                            SentencePieceBPETokenizer)

tokenizer = BPETokenizer()
tokenizer.train(["vocab.txt"],
                vocab_size=30000,
                min_frequency=2,
                special_tokens=["<unk>", "<pad>", "<cls>", "<sep>"],
                limit_alphabet=1000,
                show_progress=True)

and it just keep running forever..... like i stopped it after 20min..

am i doing some mistake?

BPE dropout

cf https://arxiv.org/abs/1910.13267

Node Doc/Typings/Autocompletion

We should be able to provide meaningful documentation and typings using some Typescript index.d.ts file. Needs more digging.

Special tokens are getting encoded.

I'm trying to get SentencePieceBPETokenizer with fastai, As you can see encoding the spec_tokens individually returns the correct tokens but when encoding the same with .encode() doesn't treat them as special tokens. Am I missing something or this is a bug?

Feature Request: Support for sentencepiece unigram model

Any plans to add support for sentencepiece unigram model?

Add BPE GPT2 benchmark with cache at capacity

Current benchmarks start with an empty cache which may or may not get filled by the time the benchmark is finished. So it would be useful to also have benchmarks that start with a full cache.

Enabling in-memory inputs for training a new tokenizer

Hi,
Thanks for the release!
I was wondering whether there's a possibility to feed BPETokenizer.train() method with input other than list of file names. To be more specific, I'd like to feed it with an in-memory data structure like Pandas Series or list of lists (each representing a doc).
Is that possible without being forced to write to a .txt file?
Thx!

BertWordPieceTokenizer error

Hi, the following code from the main example doesn't work:

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

Errors: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Traceback (most recent call last):
File "", line 1, in
File "/home/username/.local/lib/python3.6/site-packages/tokenizers/implementations/bert_wordpiece.py", line 26, in init
tokenizer = Tokenizer(WordPiece.from_files(vocab_file, unk_token=unk_token))
Exception: Error while initializing WordPiece

[MASK] token is missing in BERT wordpiece example

Hi,

in the current example script for BERT with wordpiece:

https://github.com/huggingface/tokenizers/blob/master/bindings/python/examples/train_bert_wordpiece.py

the [MASK] token is missing. This will lead to the following error message:

Traceback (most recent call last):                                                      
  File "create_pretraining_data.py", line 469, in <module>                                                                                                                     
    tf.app.run()                                                                                                                                                                         
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run                                                                                  
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)                          
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run                                                      
    _run_main(main, args)                                                                         
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main                                                                            
    sys.exit(main(argv))                                       
  File "create_pretraining_data.py", line 462, in main                                                                                                                                   
    FLAGS.max_predictions_per_seq, output_files)                                                  
  File "create_pretraining_data.py", line 107, in write_instance_to_example_files
    input_ids = tokenizer.convert_tokens_to_ids(instance.tokens)                                                                             
  File "/mnt/histobert3/bert/tokenization.py", line 179, in convert_tokens_to_ids
    return convert_by_vocab(self.vocab, tokens)                                                                                                                                          
  File "/mnt/histobert3/bert/tokenization.py", line 140, in convert_by_vocab
    output.append(vocab[item])                                               
KeyError: '[MASK]'

When trying to create the pre-training data for BERT. I checked the original BERT vocab files and it seems that these are the only "special" tokens:

[PAD]
[UNK]
[CLS]
[SEP]
[MASK]

So I think the train command should be:

trainer = tokenizer.train(
    files,
    vocab_size=10000,
    min_frequency=2,
    show_progress=True,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]' ],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)

Support for multiple language!

Hi,
Is adding support for multiple language on the roadmap?
if yes, I would like to help implementing this, Where should I start?

Builder pattern for the WordPiece model

Node bindings

These are completely out-of-date.

Why not use rayon?

Hey, to start off, congratulations on this successful release of a tokenizer written in rust. It is indeed a great idea and as a fellow Rust user I'm happy to see it in use in NLP. I was wondering why the functional code you have for tokenization, at it's core does not use Rayon? The word tokenization seems embarrassingly parallel over the number of words. It should probably be a free speedup for typical multi-core machines. Furthermore it should be a relatively small code-change, not a complete rewrite I think.

Let me know what are your thoughts!

Offsets / Alignment

We want to provide alignment information during tokenization. This will allow users to retrieve subslices of the original string provided to the tokenizer.

Right now, the Encoding already knows about said offsets and deals with them partially during truncation and padding. We still need to provide alignment information during the normalization and pre-tokenization steps for this to work properly.
Each implementation of Normalizer and PreTokenizer will then have to provide these alignments.

JS / WebAssembly binding planned ?

I see your Node.js binding using Neon. But have you considered WebAssembly ? There are some tools to compile Rust code easily. So you will get a browser compatiblity and node v13 with a low impact on speed.

Feature request: vocab.json: serialize it in order of token index

This makes it easier to scan through the vocab file and compare token frequencies, special tokens, etc.

See for instance https://cdn.huggingface.co/roberta-base-vocab.json

[Idea] Support "training" and freezing BPE Cache

I think we could see a big improvement on the BPE performance if we allowed pre-training / pre-filling the cache and then freezing it so that no more writes can occur. The cache training process could be really simple, like just compute word counts on a corpus and fill the cache with the most frequent words.

Then if / once the cache has been trained, there is no need to write to it again, which should greatly improve performance (especially in the multi-threaded case) since blocking write locks won't need to be acquired.

Rust documentation

The main README is completely out-of-date.

We also want to provide documentation in order to prepare for the crate release. This covers the Rust documentation only, not bindings.

"can't pickle Tokenizer objects"

Does it work well in multyprecess?

cannot encode with custom BPE tokenizer

Hi, thanks for the library! I tried training a BPE tokenizer over custom corpus, following your examples.

In one notebook I run:

import tokenizers
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE.empty())
tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = tokenizers.decoders.ByteLevel.new()
trainer = tokenizers.trainers.BpeTrainer.new(vocab_size=30000, min_frequency=5)
tokenizer.train(trainer, ["vocab.txt"])
tokenizer.model.save(folder=".", name="custom_bpe_tokenizer")

This works fine and I am able to encode text given that tokenizer.

However, if I try to reinitialize that tokenizer in another notebook:

from tokenizers import BPETokenizer
tk = BPETokenizer(vocab_file = 'custom_bpe_tokenizer-vocab.json',
                  merges_file = 'custom_bpe_tokenizer-merges.txt', )
tokenized = tk.encode("hello")

gives an error:

ExceptionTraceback (most recent call last)
<ipython-input-22-d176a55b6c42> in <module>
----> 1 tokenized = tk.encode("hello")

~/miniconda3/lib/python3.7/site-packages/tokenizers/implementations/base_tokenizer.py in encode(self, sequence, pair)
    118             An Encoding
    119         """
--> 120         return self._tokenizer.encode(sequence, pair)
    121 
    122     def encode_batch(self, sequences: List[Union[str, Tuple[str, str]]]) -> List[Encoding]:

Exception: Unk token `<unk>` not found in the vocabulary

I'm pretty sure "hello" is in my vocabulary and I tried with a few words from vocab.json and always get the same error. Any advice here?

Little grammatical error in the Readme

we have "And training an new vocabulary is just as easy"
Instead of "And training a new vocabulary is just as easy"

around line 39

Why Rust?

Hello team, congrats on this release!

I was wondering how the decision to implement this in Rust came about. Did you consider another language or was this based on skill set already existing on the team.

Congrats again!

why train file run slow use the demo code

I pip install the python packages
and train have no progress ~~

char positions <> token positions

Hi, great library!

I've got a question and request. Would it be possible to also return a map relating input char positions and output token positions?

Use case: calculating char span endpoints from token span endpoints for, e.g., SQuAD-type question answering.

Hope to hear from you,
Torsten

Not able to import in python 3.5.2

Hi team,

I have tried to install tokenizers using both pip and source. However, getting below error when importing.

from tokenizers import BPETokenizer
Traceback (most recent call last):
File "", line 1, in
File "NLP/NLP_venv/lib/python3.5/site-packages/tokenizers-0.1.1-py3.5-linux-x86_64.egg/tokenizers/init.py", line 10, in
from .implementations import (
File "NLP/NLP_venv/lib/python3.5/site-packages/tokenizers-0.1.1-py3.5-linux-x86_64.egg/tokenizers/implementations/init.py", line 1, in
from .base_tokenizer import BaseTokenizer
File "NLP/NLP_venv/lib/python3.5/site-packages/tokenizers-0.1.1-py3.5-linux-x86_64.egg/tokenizers/implementations/base_tokenizer.py", line 6
_tokenizer: Tokenizer
^
SyntaxError: invalid syntax

[Feature Request] Add option to prevent BPE merge across unicode scripts.

This change in BPE training was introduced in GPT-2 paper — merge a pair only if they belong to the same Unicode script. This prevents duplicate allocations like dog?, dog!, dog. by prevent merge of dog with other tokens. This should be implemented with an exception of whitespace.

Relevant issue: google/sentencepiece#263

Python Doc/Typings/Autocompletion

After some tests on all the different ways to go, it appears that we will have to start by

Laying out the package structure with relevant __init__.py files and imports
Providing .pyi files with typings
Documentation using docstrings in the .pyi files

All of these will be hand-written for now, and we'll think about automatically generating these later on.

Automatically loading vocab files

It would be nice if the vocab files be automatically downloaded if they don't already exist. Also would be better if you add a short note/comment in the readme file so that folks know that they should manually download the vocab files. Specifically when running the following line of code:

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

which will be resulted in the following error if the vocab file doesn't exist:

Exception: Error while initializing WordPiece

[Python bindings] Add method in BaseTokenizer to get vocab size.

Feature request

Please add a method in BaseTokenizer to get vocab size. It is useful for specifying arguments to nn.Embedding layers in our external model pipeline. It can simply be:

def get_vocab_size(self, with_added_tokens: Optional[bool] = True) -> int:
    self._tokenizer.get_vocab_size(with_added_tokens)

Should be as similar as the underlying Rust implementation. Here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tokenizers/implementations/base_tokenizer.py

Offsets is not right for Chinese language sentences tokenization!

txt = '今天天气很好'
tokened = tokenizer.encode(txt)
for token, offset in zip(tokened.tokens, tokened.offsets):
    print(token, offset)

# output
[CLS] (0, 0)
今 (1, 2)
天 (4, 5)
天 (7, 8)
气 (10, 11)
很 (13, 14)
好 (16, 17)
[SEP] (0, 0)

WordPiece trainer

Right now we can load a WordPiece model from a file, but we are not able to train one directly.

Export Trainer::train API into Python bindings

HI!
Currently Trainer::train method of the built trainer is missing from the Python object returned by BpeTrainer.new. This is important to have if you want to construct models using token counters from Python.

Python - Encoding.overflowing.original_str is not interpreted as str

Currently, the following code will crash as Encoding.overflowing.original_str is not recognised as python str

tokenizer.encode(encoding.overflowing.original_str)
>>> Exception: Input must be a list[str] or list[(str, str)]

type(encoding.overflowing.original_str)
>>> <class 'IndexableString'>

Issue with compile from source

For some reason compilation from source isn’t working for me.

I run: sudo python3 setup.py install

/usr/lib/python3.5/distutils/dist.py:261: UserWarning: Unknown distribution option: 'long_description_content_type'
  warnings.warn(msg)
running install
running bdist_egg
running egg_info
writing tokenizers.egg-info/PKG-INFO
writing top-level names to tokenizers.egg-info/top_level.txt
writing dependency_links to tokenizers.egg-info/dependency_links.txt
reading manifest file 'tokenizers.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*' found under directory 'tokenizers-lib/target'
writing manifest file 'tokenizers.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
running build_rust
error: Can not find Rust compiler

rustc is in PATH:
rustc --version outputs:

rustc 1.42.0-nightly (b5a3341f1 2020-01-20)

I also have setuptools-rust installed.

I’m on ubuntu 16.04

Any help appreciated :)

[node] Support node v13

Limiting number of processes

It seems like the code is currently spawning an unbounded number of processes (is that correct?)

It would be really useful if we could limit that somehow. Especially on machines where we need to share with some other people.
Is it possible today?

Tokenizer does not respect truncation when given a pair

I have a custom Sentencepiece tokenizer with BERT post-processing, with padding and truncation enabled to a fixed max_seq_len, and it works as long as I don't encode a sentence pair:

encoding = tokenizer.encode('hello, how are you')
assert len(encoding.tokens) == max_seq_len

encoding = tokenizer.encode('hello, how are you', 'fine and you')
assert len(encoding.tokens) == max_seq_len # fails

I've had a look at the Rust codebase and everything looks like it should work. This behavior happens both in single encodings and batch encodings. Do you have any idea where the error might be? Is it user error?

Add Template post-processor

Roberta tokenizer requires inputs (& pairs) to be wrapped around <s> ... </s> (respectively cls and eos tokens).

BPE & WordPiece trained models saving

Tokenizer saving/loading

We need to provide a way to save and load tokenizers to/from files.
Things that need to be saved:

Each part (Normalizer, PreTokenizer, ..) and their options
Added tokens / special tokens
The model's vocabulary

We can approach this in multiple ways, but in the end, we would like to have a single self-contained file that represents a tokenizer. We will probably need to have some scripts to convert existing models to this new format.

incompatible with transformers 2.3.0 with tokenizers 0.0.11

$ pip install tokenizers --upgrade
ERROR: transformers 2.3.0 has requirement tokenizers==0.0.11, but you'll have tokenizers 0.1.1 which is incompatible.
Installing collected packages: tokenizers
  Found existing installation: tokenizers 0.0.11
    Uninstalling tokenizers-0.0.11:
      Successfully uninstalled tokenizers-0.0.11
Successfully installed tokenizers-0.1.1

is it no problem?

Decoding to string

Hi, thanks for this awesome library!

I want to decode BPE back to actual text, so that I can calculate BLEU scores. When I use the tokenizer.decoder, I get a string without any whitespace. I understand I can use a pre_tokenizer to get whitespaces, but in that case the decoded output would be i can feel the mag i c , can you ? (or something similar, depending on the BPE model). How do I get the actual text through decoding, so that I can calculate BLEU scores like I normally would?

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./scripts/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel.new(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel.new()

# And then encode:
encoded = tokenizer.encode("i can feel the magic, can you?")

decoded = tokenizer.decode(encoded.ids)
print(encoded)
print(decoded)
>>> ['i', 'can', 'feel', 'the', 'mag', 'i', 'c', ',', 'can', 'you', '?']
>>> icanfeelthemagic,canyou?

`OnlyFirst` and `OnlySecond` truncation strategies

The current behavior for OnlyFirst and OnlySecond truncation strategies is not the one I would expect, and diverges from the current behavior in transformers:

tokenizers/tokenizers/src/utils.rs

Lines 87 to 99 in 88391dd

    
           TruncationStrategy::OnlyFirst | TruncationStrategy::OnlySecond => { 
        
               let target = if params.strategy == TruncationStrategy::OnlyFirst { 
        
                   Ok(&mut encoding) 
        
               } else if let Some(encoding) = pair_encoding.as_mut() { 
        
                   Ok(encoding) 
        
               } else { 
        
                   Err(Box::new(Error::SecondSequenceNotProvided)) 
        
               }?; 
        
               if target.get_ids().len() > params.max_length { 
        
                   target.truncate(params.max_length, params.stride); 
        
               } 
        
           }

It currently takes only the first encoding (OnlyFirst) or the second one (OnlySecond), and then truncates it to make its length less than the desired max_length.

But this doesn't guarantee that the combined encodings have a length inferior to max_length, which is the behavior I was expecting: those strategies should take into account the combined encodings length when truncating only the first or second one.

What do you think @n1t0 @mfuntowicz?

Different results compared with python BertWordPieceTokenizer

Hi Team,

Thanks for your great work! Recently I tested BertWordPieceTokenizer in this repo on my Multi-Lingual Dataset, compared with python BertWordPieceTokenizer, the results are different on some cases, I found two types:

Punctuation Type:
- Sentence: 台北‧08月20日
- Python Results: '台', '北', '‧', '08', '月', '20', '日'
- Rust Results: '台', '北', '‧', '##0', '##8', '月', '20', '日'
[UNK] Type:
- Sentence: application remained denied“
- Python Results: 'application', 'remained', 'denied', '[UNK]'
- Rust Results: 'application', 'remained', '[UNK]'

Are these differences expected?

Best Wishes!

Missing serialization preventing multiprocessing

Hey,

I finally found some time to test the new tokenizers. Seems really promising! Great work!
We measured speedups of about 8x 🚀

As mentioned here deepset-ai/FARM#157, the only blocker for us right now:

❌ The Tokenizer objects can't be pickled and are therefore not usable with python's multiprocessing. As we make heavy use of multiprocessing during preprocessing, we can't really use them right now . Not sure how much of work is needed for fixing this, but for the XLM-R python tokenizer it was a very easy fix (huggingface/transformers#2414).

Related to #87

Provide `original_str` in node bindings

When using the node bindings, there is no way for now to map back to the original string like it's possible in python or rust with output.original_str[output.offsets[1]]

Feature Request: Customizable Word Tokenizers - Spacy

Spacy has customizable word level tokenizers with rules for multiple languages. I think porting that to rust would add nicely to this package. Having a customizable uniform word level tokenization across platforms (client web, server) and languages would be beneficial. Currently, idk any clean way or whether it's even possible to write bindings for spacy cython.

Spacy Tokenizer Code

https://github.com/explosion/spaCy/blob/master/spacy/tokenizer.pyx

Tokenizer exceptions for english

https://github.com/explosion/spaCy/blob/master/spacy/lang/en/tokenizer_exceptions.py

I can put in some time doing this.

ByteLevelBPETokenizer adding special character at the beginning of most of the words

I am trying to train ByteLevelBPETokenizer tokenizer. But the tokens that I get have a special character at the beginning of the word: Ġ (ord = 288). I don't know why this is happening but this is creating duplicate tokens in the tokenizer vocabulary.

# file contains a text/sentence on each line.
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(file)

# encode
output = tokenizer.encode("say what?")
output.tokens
>>> ['say', 'Ġwhat', '?'] # what is this weird thing against `what` ??

output = tokenizer.encode("what ever?")
output.tokens
>>> ['what', 'Ġever', '?']

Am I doing something wrong here?

Comparison with YouTokenToMe

Are there any comparisons or benchmarks with YouTokenToMe?

	TruncationStrategy::OnlyFirst \| TruncationStrategy::OnlySecond => {
	let target = if params.strategy == TruncationStrategy::OnlyFirst {
	Ok(&mut encoding)
	} else if let Some(encoding) = pair_encoding.as_mut() {
	Ok(encoding)
	} else {
	Err(Box::new(Error::SecondSequenceNotProvided))
	}?;

	if target.get_ids().len() > params.max_length {
	target.truncate(params.max_length, params.stride);
	}
	}