Coder Social home page Coder Social logo

hugging-face-supporter / tftokenizers Goto Github PK

View Code? Open in Web Editor NEW
6.0 0.0 3.0 269 KB

Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels

License: Apache License 2.0

Makefile 1.23% Shell 2.84% Python 95.93%
bert nlp tensorflow transformers natural-language-processing tensorflow-hub sentencepie tokenizer tokenizers

tftokenizers's Introduction

TFtftransformers

Converting Hugginface tokenizers to Tensorflow tokenizers. The main reason is to be able to bundle the tokenizer and model into one Reusable SavedModel, inspired by the Tensorflow Official Guide on tokenizers

PyPI version

Source Code: https://github.com/Hugging-Face-Supporter/tftokenizers


Models we know works:

"bert-base-cased"
"bert-base-uncased"
"bert-base-multilingual-cased"
"bert-base-multilingual-uncased"
# Distilled
"distilbert-base-cased"
"distilbert-base-multilingual-cased"
"microsoft/MiniLM-L12-H384-uncased"
# Non-english
"KB/bert-base-swedish-cased"
"bert-base-chinese"

Examples

This is an example of how one can use Huggingface model and tokenizers bundled together as a Reusable SavedModel and yields the same result as using the model and tokenizer from Huggingface ๐Ÿค—

import tensorflow as tf
from transformers import TFAutoModel
from tftokenizers import TFModel, TFAutoTokenizer

# Load base models from Huggingface
model_name = "bert-base-cased"
model = TFAutoModel.from_pretrained(model_name)

# Load converted TF tokenizer
tokenizer = TFAutoTokenizer.from_pretrained(model_name)

# Create a TF Reusable SavedModel
custom_model = TFModel(model=model, tokenizer=tokenizer)

# Tokenizer and model can handle `tf.Tensors` or regular strings
tf_string = tf.constant(["Hello from Tensorflow"])
s1 = "SponGE bob SQuarePants is an avenger"
s2 = "Huggingface to Tensorflow tokenizers"
s3 = "Hello, world!"

output = custom_model(tf_string)
output = custom_model([s1, s2, s3])

# We can now pass input as tensors
output = custom_model(
    inputs=tf.constant([s1, s2, s3], dtype=tf.string, name="inputs"),
)

# Save tokenizer
saved_name = "reusable_bert_tf"
tf.saved_model.save(custom_model, saved_name)

# Load tokenizer
reloaded_model = tf.saved_model.load(saved_name)
output = reloaded_model([s1, s2, s3])
print(output)

Setup

git clone https://github.com/Hugging-Face-Supporter/tftokenizers.git
cd tftokenizers
poetry install
poetry shell

Run

To convert a Huggingface tokenizer to Tensorflow, first choose one from the models or tokenizers from the Huggingface hub to download.

NOTE

Currently only BERT models work with the converter.

Download

First download tokenizers from the hub by name. Either run the bash script do download multiple tokenizers or download a single tokenizer with the python script.

The idea is to eventually only to automatically download and convert

python tftokenizers/download.py -n bert-base-uncased
bash scripts/download_tokenizers.sh

Convert

Convert downloaded tokenizer from Huggingface format to Tensorflow

python tftokenizers/convert.py

Before Commit

make build

FAQ

How to know what tokenizer is used?

TL;DR

from transformers import AutoTokenizer

name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(name)

# IF the tokenizer is fast:
print(tokenizer.is_fast)
# Base tokenizer model
print(type(tokenizer.backend_tokenizer.model))
# Check if it is a SentencePiece tokenizer
# Should be `vocab.txt` or `vocab.json` if not SentencePiece tokenizer
# SencePiece if "vocab_file":
#   "sentencepiece.bpe.model"
print(tokenizer.vocab_files_names)

# Else
# Find if the model is a SentencePiece model with
print(vars(tokenizer).get("spm_file", None))
# print(vars(tokenizer).get("sp_model", None))
๐Ÿ“ Read More: And the components of the tokenizers described [here](https://huggingface.co/docs/tokenizers/python/latest/components.html) as: - Normalizers - Pre tokenizers - [Models](https://huggingface.co/docs/tokenizers/python/latest/components.html#models) - PostProcessor - Decoders

When loading a tokenizer with Huggingface transformers, it maps the name of the model from the Huggingface Hub to the correct model and tokenizer available there, if not it will try to to find a folder on your local computer with that name.

Additionally, tokenizers from Huggingface are defined in multiple different steps using the Huggingface tokenizer library. For those interested, you can look into the different components of that library of how the composition of a tokenizer works here. There is also a great guide documenting how composition of tokenizers are done in this Medium article

What tokenizers are used by what models?

๐Ÿ“ Read More: As stated in the section above, you will need to look at each model to inspect the type of tokenizer it is using, but in general there are just a few "base tokenizers / models". See [Huggingface documentation](https://huggingface.co/docs/transformers/tokenizer_summary) for explanation on how these "base tokenizers" are defined

Base Tokenizer Names Model Implementations

SentencePiece tokenizers can either be BPE (rare if the tokenizers is fast) or Unigram (all Unigram == SentencePiece)

BPE = tokenizers.models.BPE

Unigram = tokenizers.models.Unigram

WordPiece = tokenizers.models.WordPiece

  • Implemented by

    Bert WordPiece

  • Used by

    BERT, mBERT, miniLM, distilled versions of BERT

SentencePiece

SentencePiece is a method for creating sub-word tokenizations. It supports BPE and Unigram.

SentencePiece is a separate C++ implemented library with python and Tensorflow bindings. The vocabulary is bundled into:

For fast models:

"vocab_file_names":

`sentencepiece.bpe.model` for "BPE" and
`spiece.model` for Unigram

For slow models:

"vocab_file_names":

'source_spm': 'source.spm',
'target_spm': 'target.spm',
'vocab': 'vocab.json'

"spm_files":

will be a single file or a list of files
...
  • Used by:

    Fast: T5 models Slow: facebook/m2m100_418M, facebook/wmt19-en-de

How to implement the tokenizers from Huggingface to Tensorflow?

You will need to download the Huggingface tokenizer of your choice, determine the type of the tokenizer (is_fast, tokenizer type and vocab_file_names). Then map the tokenizer used to the Tensorflow supported equivalent:

tensorflow/text#422

BPE and Unigram:

WordPiece:

tensorflow/text#116 tensorflow/text#414

What other ways are there to convert a tokenizer?

๐Ÿ“ Read More: With `tfokenizers` there are three ways to use the package:
import tensorflow as tf
import tensorflow_text as text
from transformers import AutoTokenizer, TFAutoModel
from transformers.utils.logging import set_verbosity_error

from tftokenizers.file import (
    get_filename_from_path,
    get_vocab_from_path,
    load_json
)
from tftokenizers.model import TFModel
from tftokenizers.tokenizer import TFAutoTokenizer, TFTokenizerBase

set_verbosity_error()
tf.get_logger().setLevel("ERROR")

pretrained_model_name = "bert-base-cased"


# a) by model_name
tf_tokenizer = TFAutoTokenizer.from_pretrained(pretrained_model_name)

# b) bundled with the model, similar to TFHub
model = TFAutoModel.from_pretrained(pretrained_model_name)
custom_model = TFModel(model=model, tokenizer=tf_tokenizer)

# c) from source, using the saved files of a transformers tokenizer
# Make sure you run download.py or the download script first
PATH = "saved_tokenizers/bert-base-uncased"
vocab = get_vocab_from_path(PATH)
vocab_path = get_filename_from_path(PATH, "vocab")

config = load_json(f"{PATH}/tokenizer_config.json")
tokenizer_spec = load_json(f"{PATH}/tokenizer.json")
special_tokens_map = load_json(f"{PATH}/special_tokens_map.json")

tokenizer_base_params = dict(lower_case=True, token_out_type=tf.int64)
tokenizer_base = text.BertTokenizer(vocab_path, **tokenizer_base_params)
custom_tokenizer = TFTokenizerBase(
    vocab_path=vocab_path,
    tokenizer_base=tokenizer_base,
    hf_spec=tokenizer_spec,
    config=config,
)

How to save Huggingface Tokenizer files locally?

๐Ÿ“ Read More:

To download the files used by Huggingface tokenizers, you can either download one by name

python tftokenizers/download.py -n KB/bert-base-swedish-cased

or download multiple

bash scrips/download_tokenizers.sh

WIP

  • Convert a BERT tokenizer from Huggingface to Tensorflow
  • Make a TF Reusabel SavedModel with Tokenizer and Model in the same class. Emulate how the TF Hub example for BERT works.
  • Find methods for identifying the base tokenizer model and map those settings and special tokens to new tokenizers
  • Extend the tokenizers to more tokenizer types and identify them from a huggingface model name
  • Document how others can use the library and document the different stages in the process
  • Improve the conversion pipeline (s.a. Download and export files if not passed in or available locally)
  • model_max_length should be regulated. However, some newer models have the max_lenght for tokenizers at 1000_000_000
  • Support more tokenizers, starting with SentencePiece
  • Identify tokenizer conversion limitations
  • Support encoding of two sentences at a time Ref
  • Allow the tokenizers to be used for Masking (MLM) Ref

tftokenizers's People

Contributors

markussagen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

tftokenizers's Issues

Describe how more tokenizers can be added

To make it easier to understand how new tokenizers can be added, try to provide some more details on how new tokenizers can be added. Will help understand how to approach #2

Reduce tensorflow noise

Tensorflow is still quite prone to produce a lot of noise when printing things

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"  # Reduce the amount of console output from TF
import tensorflow as tf  # noqa: E402

Tokenizers do not convert tokens correctly

The following BERT based tokenzers do not match when converting HF tokenizers to TF tokenizers

pretrained_model_name = "albert-base-v2"
pretrained_model_name = "emilyalsentzer/Bio_ClinicalBERT"
pretrained_model_name =  "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract" 
pretrained_model_name = "sentence-transformers/all-MiniLM-L12-v2"  

Remove deprecated class `TFTokenizerBase`

Since we can now load a tokenizer directly based on the name of a Huggingface tokenizer, the old class is no longer needed and should be removed.

Along with the class, we can also clean up the tests

Can't reproduce the exampe in the Readme

Hi,

first of all, thanks for sharing this code!

When I copy-paste the example from the Examples section of the Readme, the line output = reloaded_model([s1, s2, s3]) causes the following error:

In [16]: output = reloaded_model([s1, s2, s3])
    ...: 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-977dbd3103dd> in <cell line: 1>()
----> 1 output = reloaded_model([s1, s2, s3])

~/.cache/pypoetry/virtualenvs/tftokenizers-F6-m1WpK-py3.8/lib/python3.8/site-packages/tensorflow/python/saved_model/load.py in _call_attribute(instance, *args, **kwargs)
    699 
    700 def _call_attribute(instance, *args, **kwargs):
--> 701   return instance.__call__(*args, **kwargs)
    702 
    703 

~/.cache/pypoetry/virtualenvs/tftokenizers-F6-m1WpK-py3.8/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py in error_handler(*args, **kwargs)
    151     except Exception as e:
    152       filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153       raise e.with_traceback(filtered_tb) from None
    154     finally:
    155       del filtered_tb

~/.cache/pypoetry/virtualenvs/tftokenizers-F6-m1WpK-py3.8/lib/python3.8/site-packages/tensorflow/python/saved_model/function_deserialization.py in restored_function_body(*args, **kwargs)
    261     """Calls a restored function or raises an error if no matching function."""
    262     if not saved_function.concrete_functions:
--> 263       raise ValueError("Found zero restored functions for caller function.")
    264     # This is the format of function.graph.structured_input_signature. At this
    265     # point, the args and kwargs have already been canonicalized.

ValueError: Found zero restored functions for caller function.

Have you guys ever encountered that before?

Enable more tokenizers, s.a. SentencePiece

Current implementation relies on converting tokenizers to Tensorflows BertTokenizer/WordPieceToeknizer
Ideally, we would like to map these to work with more tokenizers, starting with SentencePiece

Repository doesn't contain covert.py file

In the readme guide there's this step
[
Convert
Convert downloaded tokenizer from Huggingface format to Tensorflow

python tftokenizers/convert.py
]

but in current version of project there's no such file as convert.py
image

Use built in function to access special tokens and ids

Current implementation for mapping the tokens to their ids caused some problems when there were new words containing "token" in them. Currently, we map from the vocab file all tokens containing the word token. However, for (at lease non-SentencePiece) tokenizers in Huggignface transformers, there are already two argmuments for this:

  • tokenizer.all_special_tokens
  • tokenizer.all_special_ids

Let's test and replace our implementation with the officially supported vocab arguments

def map_special_tokens_to_ids(

BertTokenizer may not be optimal choice for converstion

Tensorflow supports two (or three) different types of WordPiece tokenizers.
Could be worth testing to use the FastWordPiece tokenizer, since it can build the model from a vocab directly and claims to be faster as mentioned:

But is will likely also require a bit more setup (https://www.tensorflow.org/text/guide/subwords_tokenizer#overview), as WordPiece only see to split words, but the BertTokenizer splits sentences

Goal

  • Compare the different tokenizers and see if they yield the same results
  • Compare if the new tokenizer can be saved as a Reusable SavedModel
  • Test if the models that previously fails now work #4

[Optional] Provide TFAutoModels for different head types

When creating a TF Resuable SavedModel, it would be great to use to some extent be able to not need to create your own custom model for every model head

The basic example covers currently covers tokenizer and a base MLM model.
It would be great if to provide a TFTAutoModelForSequenceClassification, that creates a reusable saved model for with tokenizer and model. This could then be extended to every transformer head

Can be done without major changes

  • "TFAutoModel"
  • "TFAutoModelForMultipleChoice"
  • "TFAutoModelForSequenceClassification"
  • "TFAutoModelForPreTraining"
  • "TFAutoModelForTokenClassification"
  • "TFAutoModelWithLMHead"

Require Different changing Tokenizer masking?

  • "TFAutoModelForCausalLM"
  • "TFAutoModelForMaskedLM"
  • "TFAutoModelForSeq2SeqLM"

Require tokenizer to except multiple inputs

  • "TFAutoModelForQuestionAnswering"
  • "TFAutoModelForTableQuestionAnswering"

Tokenizer's `model_max_length` is not consistent

Most tokenizers define their max model length as either 510 tokens or more and is based on:

  • Model max token lenght - number of tokens needed to define a sentence (start and end)

Example

Most tokenizers follow this convention, but there are some that have nearly infinite length, with tokenizer.model_max_length=1000000000000000019884624838656

This means that when converting the tokenizer max length, in Tensorflow, most values are assumed to be ints, but with nearly infinit model length, it needs to be a tf.long or greater for the conversion not to fail


Initially, the tokenizers model_max_length was set dynamically, but is now set to 510 tokens. This should be changed to reflect the actual tokenizers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.