kennethenevoldsen / spacy-wrap Goto Github PK

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.

Home Page: https://KennethEnevoldsen.github.io/spacy-wrap/

License: MIT License

Python 100.00%

spacy-nlp spacy-extension natural-language-processing spacy-extensions spacy-transformers spacy-models transformers machine-learning language-model pytorch

spacy-wrap's Introduction

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap is a minimal library intended for wrapping fine-tuned transformers from the Huggingface model hub in your spaCy pipeline allowing the inclusion of existing models within SpaCy workflows.

As far as possible it follows a similar API as spacy-transformers.

NOTE: Since the release of spaCy-wrap, Explosion released the spacy-huggingface-pipelines it takes the approach of wrapping the Huggingface pipeline as opposed to the transformer. That means token aggregation and conversion into spans happens at the Huggingface pipeline, while in spaCy-wrap it happens at the logits of the model which can sometimes lead to unfortunate differences in results. I generally recommend using the spacy-huggingface-pipelines for most use cases, but if you need to use the transformer output more directly spaCy-wrap can have its uses.

Installation

Installing spacy-wrap is simple using pip:

pip install spacy_wrap

Examples

The following shows a simple example of how you can quickly add a fine-tuned transformer model from the Huggingface model hub for either text classification, named entity or token classification.

Sequence Classification

In this example, we will use a model fine-tuned for sentiment classification on SST2. This model classifies whether a text is positive or negative. We will add this model to a blank English pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("en")

config = {
    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass
    "doc_extension_prediction": "sentiment",  # document extention for the prediction
    "model": {
        # the model name or path of huggingface model
        "name": "distilbert-base-uncased-finetuned-sst-2-english",  
    },
}

transformer = nlp.add_pipe("sequence_classification_transformer", config=config)

doc = nlp("spaCy is a wonderful tool")

print(doc.cats)
# {'NEGATIVE': 0.001, 'POSITIVE': 0.999}
print(doc._.sentiment)
# 'POSITIVE'
print(doc._.clf_trf_data)
# TransformerData(wordpieces=...

These pipelines can also easily be applied to multiple documents using the nlp.pipe as one would expect from a spaCy component:

docs = nlp.pipe(
    [
        "I hate wrapping my own models",
        "Isn't there a tool for this?!",
        "spacy-wrap is great for wrapping models",
    ]
)

for doc in docs:
    print(doc._.sentiment)
# 'NEGATIVE'
# 'NEGATIVE'
# 'POSITIVE'

More Examples

It is always nice to have more than one example. Here is another one where we add the Hate speech model for Danish to a blank Danish pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("da")

config = {
    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass
    "doc_extension_prediction": "hate_speech",  # document extention for the prediction
    # choose custom labels
    "labels": ["Not hate Speech", "Hate speech"],
    "model": {
        "name": "DaNLP/da-bert-hatespeech-detection",  # the model name or path of huggingface model
    },
}

transformer = nlp.add_pipe("classification_transformer", config=config)

doc = nlp("Senile gamle idiot") # old senile idiot

doc._.clf_trf_data
# TransformerData(wordpieces=...
doc._.hate_speech
# "Hate speech"
doc._.hate_speech_prob
# {'prob': array([0.013, 0.987], dtype=float32), 'labels': ['Not hate Speech', 'Hate speech']}

Token Classification

We can also use the model for token classification:

import spacy
import spacy_wrap
nlp = spacy.blank("en")

config = {"model": {"name": "vblagoje/bert-english-uncased-finetuned-pos"}, 
          # "predictions_to": ["pos"]  # optional, can be "pos", "tag" or "ents"
}

snlp.add_pipe("token_classification_transformer", config=config)

text = "My name is Wolfgang and I live in Berlin"

doc = nlp(text)
print(doc._.tok_clf_predictions)
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

By default, spacy-wrap will automatically detect it the labels follow the universal POS tags as well. If so it will also assign it to the token.pos, similar regular spacy pipelines:

print(doc[0].pos_)
# 'PRON'

Named Entity Recognition

In this example, we use a model fine-tuned for named entity recognition. spacy-wrap will in this case infer from the IOB tags that the model is intended for named entity recognition and assign it to doc.ents.

import spacy
import spacy_wrap
nlp = spacy.blank("en")

# specify model from the hub
config = {"model": {"name": "dslim/bert-base-NER"}, 
          "predictions_to": ["ents"]} # forced to be named entity recognition, if left out it will be estimated from the labels

# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)

doc = nlp("My name is Wolfgang and I live in Berlin.")

print(doc.ents)
# (Wolfgang, Berlin)

📖 Documentation

Documentation
🔧 Installation	Installation instructions for spacy-wrap.
📰 News and changelog	New additions, changes and version history.
🎛 Documentation	The reference for spacy-wrap's API.

💬 Where to ask questions

Type
🚨 FAQ	FAQ
🚨 Bug Reports	GitHub Issue Tracker
🎁 Feature Requests & Ideas	GitHub Issue Tracker
👩‍💻 Usage Questions	GitHub Discussions
🗯 General Discussion	GitHub Discussions

spacy-wrap's People

Contributors

Stargazers

Watchers

Forkers

adrianeboyd dris101 nthomsencph davidberenstein1957

spacy-wrap's Issues

Add license

Poetry dependency resolution fails. PyPi 1.0.0 version requires spaCy < 3.3.0

How to reproduce the behaviour

pyproject.toml contains

spacy = "^3.3.0"

Add spacy-wrap with

poetry add spacy_wrap

Info about spaCy

spaCy version: 3.3.0
Platform: Windows-10-10.0.19043-SP0
Python version: 3.9.10
Pipelines: en_core_web_sm (3.3.0), en_core_web_trf (3.3.0)

Issue

Dependency resolution fails with poetry

SolverProblemError

  Because spacy-wrap (1.0.0) depends on spacy (>=3.2.1,<3.3.0)
   and no versions of spacy-wrap match >1.0.0,<2.0.0, spacy-wrap (>=1.0.0,<2.0.0) requires spacy (>=3.2.1,<3.3.0).
  So, because spacytest depends on both spacy (^3.3.0) and spacy-wrap (^1.0.0), version solving failed.

setup,cfg still has

install_requires = 
	spacy_transformers>=1.1.4,<1.2.0
	spacy>=3.2.1,<3.3.0
	thinc>=8.0.13,<8.1.0

spacy-wrap fail in some cases on long sequences

How to reproduce the behaviour

import spacy
import spacy_wrap

nlp = spacy.blank("fr")
config = {"model": {"name": "Jean-Baptiste/camembert-ner"}}
nlp.add_pipe("token_classification_transformer", config=config)

text = """
brice hansemann vice president charge de l'instruction a meaux 77100
information ouverte contre : x se disant prazaru margarita, x se disant adi popa, x se disant chica florin, x se disant tavu anarena, x se disant adan alexandra, x se disant kati enesa, x se disant alin alexandro, x
pour : vols en reunion, tentative de vols en reunion, refus de se soumettre aux operations de prelevement externe necessaire a la realisation d'examen technique et scientifique de comparaison avec les traces et indices preleves lors d'une enquete judiciaire, vols en bande organisee, traite des etres humains en bande organisee, recels en bande organisee de bien provenant d'un delit, delaissement de mineurs
mission : voir commission rogatoire jointe.
"""

nlp(text)

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xxx/site-packages/spacy/language.py", line 1031, in __call__
    error_handler(name, proc, [doc], e)
  File "xxx/site-packages/spacy/util.py", line 1670, in raise_error
    raise e
  File "xxx/site-packages/spacy/language.py", line 1026, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 391, in __call__
    self.set_annotations([doc], outputs)
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 208, in set_annotations
    iob_tags, iob_prob = self.convert_to_token_predictions(
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 275, in convert_to_token_predictions
    agg_token_logits = agg(logits[align.data[:, 0]])
IndexError: index 177 is out of bounds for axis 0 with size 176

while

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """
brice hansemann vice president charge de l'instruction a meaux 77100
information ouverte contre : x se disant prazaru margarita, x se disant adi popa, x se disant chica florin, x se disant tavu anarena, x se disant adan alexandra, x se disant kati enesa, x se disant alin alexandro, x
pour : vols en reunion, tentative de vols en reunion, refus de se soumettre aux operations de prelevement externe necessaire a la realisation d'examen technique et scientifique de comparaison avec les traces et indices preleves lors d'une enquete judiciaire, vols en bande organisee, traite des etres humains en bande organisee, recels en bande organisee de bien provenant d'un delit, delaissement de mineurs
mission : voir commission rogatoire jointe.
"""
nlp(text)

works fine

Your Environment

Operating System: Linux-5.15.0-58-generic-x86_64-with-glibc2.31
Python Version Used: 3.9.16
spaCy Version Used: 3.4.4
spaCy-wrap Version Used: 1.4.0

Different output from a NER model

Hi, I get a different output from the HF NER model when used with the spacy and spacy-wrap.

HF pipeline

model_path = 'asahi417/tner-xlm-roberta-base-ontonotes5'

tokenizer = AutoTokenizer.from_pretrained(model_path)
ort_model = AutoModelForTokenClassification.from_pretrained(model_path)

ner_pipeline = pipeline("ner", model=ort_model, tokenizer=tokenizer, aggregation_strategy="simple")

Text and output

text = '(a) keep it confidential throughout the duration of this Agreement and for 5 years after the expiry or termina tion of this Agreement'

# showing only the problematic ent
ner_pipeline(text)

[{'entity_group': 'date',
  'score': 0.9785325,
  'word': '5 years',
  'start': 75,
  'end': 83}]

Spacy pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("en")
config = {"model": {"name": "../models/tner-xlm-roberta-base-ontonotes5"}, "predictions_to": ["ents"]}
nlp.add_pipe("token_classification_transformer", config=config)

Spacy output:

doc = nlp(text)
# problematic ent with the label
5 date

As we can see the HF pipeline returns 5 years, while the Spacy returns only 5.

What can be the issue? Is it possible to get the exact same outputs while using spacy-wrap?

Custom model for NER

Hello thanks you for setting up this ..

The examples are amazing.

Is it possible to use this wrapper with a Named Entity Recognition model?

If that is the case, is it possible to add an example with a NER model from hugging face?

Following the example, I am trying to add this but it is not working, I do not why may be I should go back and learn how spacy works.

import spacy
import spacy_wrap

nlp = spacy.blank("fr")

config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "Jean-Baptiste/camembert-ner-with-dates",
        "tokenizer_config" : {"use_fast": False},
        "get_spans":  {"@span_getters": "spacy-transformers.doc_spans.v1"}
    }
}

transformer = nlp.add_pipe("ner", config=config)

Make ClassificationTransformer trainable

Following the comment on this issue.

Make it such that the ClassificationTransformer is able to be trained.

Worked on in #9

spaCy 3.5 Support

Hi Kenneth,
Any chance of bumping the spaCy dependency to <3.6? Poetry won't resolve my dependencies against spaCy 3.5.0 at present

Because spacy-wrap (1.4.0) depends on spacy (>=3.2.1,<3.5.0)
 and no versions of spacy-wrap match >1.4,<2.0, spacy-wrap (>=1.4,<2.0) requires spacy (>=3.2.1,<3.5.0).

not sure if that is as simple as bumping the version number on your side or if there are breaking changes however,

IndexError when emojis in input

How to reproduce the behaviour

import spacy
import spacy_wrap
nlp = spacy.blank("en")

# specify model from the hub
config = {"model": {"name": "dslim/bert-base-NER"}}
# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)

doc = nlp("My name is Wolfgang 🚀 and I live in Berlin.")

Your Environment

spacy-wrap Version Used: spacy_wrap-1.2.0-py2.py3-none-any.whl
Operating System: MacOS, Apple M1, Ventura 13.0.1
Python Version Used: 3.10.6
spaCy Version Used: 3.4.3
Environment Information: poetry
Torch version: 1.13.0-cp310
torch = {url = "https://files.pythonhosted.org/packages/79/b3/eaea3fc35d0466b9dae1e3f9db08467939347b3aaa53c0fd81953032db33/torch-1.13.0-cp310-none-macosx_11_0_arm64.whl"}

(Had to set torch manually because spacy-wrap fails to install on Mac with the default torch version. Specifically, the dependency nvidia-cublas-cu11 returns a RuntimeError with "Unable to find installation candidates for nvidia-cublas-cu11 (11.10.3.66)". No distributions available for Mac, see https://pypi.org/project/nvidia-cublas-cu11/11.10.3.66/#files )

Above example yields an IndexError in TokenClassificationTransformer.convert_to_token_predictions(data, aggregation_strategy, labels)

      305 logits = data.model_output.logits[0]
      306 for align in data.align:
      307     # aggregate the logits for each token
--> 308     agg_token_logits = agg(logits[align.data[:, 0]])
      309     token_probabilities_ = {
      310         "prob": softmax(agg_token_logits).round(decimals=3),
      311         "label": labels,
      312     }
      313     token_probabilities.append(token_probabilities_)

IndexError: index 0 is out of bounds for axis 1 with size 0

Issue is the same with other models. Tried saattrupdan/nbailab-base-ner-scandi.

Tests with transformers pipeline works with same models.

from transformers import pipeline
pipe = pipeline(model="dslim/bert-base-NER")
res = pipe("My name is Wolfgang 🚀 and I live in Berlin.")
[e["word"] for e in res]
# ['Wolfgang', '🚀', 'Berlin']

kennethenevoldsen / spacy-wrap Goto Github PK

spacy-wrap's Introduction

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

Installation

Examples

Sequence Classification

Token Classification

Named Entity Recognition

📖 Documentation

💬 Where to ask questions

spacy-wrap's People

Contributors

Stargazers

Watchers

Forkers

spacy-wrap's Issues

How to reproduce the behaviour

Info about spaCy

Issue

How to reproduce the behaviour

Your Environment

How to reproduce the behaviour

Your Environment

Recommend Projects

Recommend Topics

Recommend Org