Coder Social home page Coder Social logo

kennethenevoldsen / spacy-wrap Goto Github PK

View Code? Open in Web Editor NEW
47.0 3.0 4.0 2.26 MB

spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to include existing fine-tuned models within your SpaCy workflow.

Home Page: https://KennethEnevoldsen.github.io/spacy-wrap/

License: MIT License

Python 100.00%
spacy-nlp spacy-extension natural-language-processing spacy-extensions spacy-transformers spacy-models transformers machine-learning language-model pytorch

spacy-wrap's Introduction

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

PyPI version python version Code style: black github actions pytest github actions docs github coverage

spaCy-wrap is a minimal library intended for wrapping fine-tuned transformers from the Huggingface model hub in your spaCy pipeline allowing the inclusion of existing models within SpaCy workflows.

As far as possible it follows a similar API as spacy-transformers.

NOTE: Since the release of spaCy-wrap, Explosion released the spacy-huggingface-pipelines it takes the approach of wrapping the Huggingface pipeline as opposed to the transformer. That means token aggregation and conversion into spans happens at the Huggingface pipeline, while in spaCy-wrap it happens at the logits of the model which can sometimes lead to unfortunate differences in results. I generally recommend using the spacy-huggingface-pipelines for most use cases, but if you need to use the transformer output more directly spaCy-wrap can have its uses.

Installation

Installing spacy-wrap is simple using pip:

pip install spacy_wrap

Examples

The following shows a simple example of how you can quickly add a fine-tuned transformer model from the Huggingface model hub for either text classification, named entity or token classification.

Sequence Classification

In this example, we will use a model fine-tuned for sentiment classification on SST2. This model classifies whether a text is positive or negative. We will add this model to a blank English pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("en")

config = {
    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass
    "doc_extension_prediction": "sentiment",  # document extention for the prediction
    "model": {
        # the model name or path of huggingface model
        "name": "distilbert-base-uncased-finetuned-sst-2-english",  
    },
}

transformer = nlp.add_pipe("sequence_classification_transformer", config=config)

doc = nlp("spaCy is a wonderful tool")

print(doc.cats)
# {'NEGATIVE': 0.001, 'POSITIVE': 0.999}
print(doc._.sentiment)
# 'POSITIVE'
print(doc._.clf_trf_data)
# TransformerData(wordpieces=...

These pipelines can also easily be applied to multiple documents using the nlp.pipe as one would expect from a spaCy component:

docs = nlp.pipe(
    [
        "I hate wrapping my own models",
        "Isn't there a tool for this?!",
        "spacy-wrap is great for wrapping models",
    ]
)

for doc in docs:
    print(doc._.sentiment)
# 'NEGATIVE'
# 'NEGATIVE'
# 'POSITIVE'

More Examples

It is always nice to have more than one example. Here is another one where we add the Hate speech model for Danish to a blank Danish pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("da")

config = {
    "doc_extension_trf_data": "clf_trf_data",  # document extention for the forward pass
    "doc_extension_prediction": "hate_speech",  # document extention for the prediction
    # choose custom labels
    "labels": ["Not hate Speech", "Hate speech"],
    "model": {
        "name": "DaNLP/da-bert-hatespeech-detection",  # the model name or path of huggingface model
    },
}

transformer = nlp.add_pipe("classification_transformer", config=config)

doc = nlp("Senile gamle idiot") # old senile idiot

doc._.clf_trf_data
# TransformerData(wordpieces=...
doc._.hate_speech
# "Hate speech"
doc._.hate_speech_prob
# {'prob': array([0.013, 0.987], dtype=float32), 'labels': ['Not hate Speech', 'Hate speech']}

Token Classification

We can also use the model for token classification:

import spacy
import spacy_wrap
nlp = spacy.blank("en")

config = {"model": {"name": "vblagoje/bert-english-uncased-finetuned-pos"}, 
          # "predictions_to": ["pos"]  # optional, can be "pos", "tag" or "ents"
}

snlp.add_pipe("token_classification_transformer", config=config)

text = "My name is Wolfgang and I live in Berlin"

doc = nlp(text)
print(doc._.tok_clf_predictions)
# ['PRON', 'NOUN', 'AUX', 'PROPN', 'CCONJ', 'PRON', 'VERB', 'ADP', 'PROPN']

By default, spacy-wrap will automatically detect it the labels follow the universal POS tags as well. If so it will also assign it to the token.pos, similar regular spacy pipelines:

print(doc[0].pos_)
# 'PRON'

Named Entity Recognition

In this example, we use a model fine-tuned for named entity recognition. spacy-wrap will in this case infer from the IOB tags that the model is intended for named entity recognition and assign it to doc.ents.

import spacy
import spacy_wrap
nlp = spacy.blank("en")

# specify model from the hub
config = {"model": {"name": "dslim/bert-base-NER"}, 
          "predictions_to": ["ents"]} # forced to be named entity recognition, if left out it will be estimated from the labels

# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)

doc = nlp("My name is Wolfgang and I live in Berlin.")

print(doc.ents)
# (Wolfgang, Berlin)

๐Ÿ“– Documentation

Documentation
๐Ÿ”ง Installation Installation instructions for spacy-wrap.
๐Ÿ“ฐ News and changelog New additions, changes and version history.
๐ŸŽ› Documentation The reference for spacy-wrap's API.

๐Ÿ’ฌ Where to ask questions

Type
๐Ÿšจ FAQ FAQ
๐Ÿšจ Bug Reports GitHub Issue Tracker
๐ŸŽ Feature Requests & Ideas GitHub Issue Tracker
๐Ÿ‘ฉโ€๐Ÿ’ป Usage Questions GitHub Discussions
๐Ÿ—ฏ General Discussion GitHub Discussions

spacy-wrap's People

Contributors

actions-user avatar adrianeboyd avatar dependabot[bot] avatar kennethenevoldsen avatar pre-commit-ci[bot] avatar willfrey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

spacy-wrap's Issues

Poetry dependency resolution fails. PyPi 1.0.0 version requires spaCy < 3.3.0

How to reproduce the behaviour

pyproject.toml contains

spacy = "^3.3.0"

Add spacy-wrap with

poetry add spacy_wrap

Info about spaCy

  • spaCy version: 3.3.0
  • Platform: Windows-10-10.0.19043-SP0
  • Python version: 3.9.10
  • Pipelines: en_core_web_sm (3.3.0), en_core_web_trf (3.3.0)

Issue

Dependency resolution fails with poetry

SolverProblemError

  Because spacy-wrap (1.0.0) depends on spacy (>=3.2.1,<3.3.0)
   and no versions of spacy-wrap match >1.0.0,<2.0.0, spacy-wrap (>=1.0.0,<2.0.0) requires spacy (>=3.2.1,<3.3.0).
  So, because spacytest depends on both spacy (^3.3.0) and spacy-wrap (^1.0.0), version solving failed.

setup,cfg still has

install_requires = 
	spacy_transformers>=1.1.4,<1.2.0
	spacy>=3.2.1,<3.3.0
	thinc>=8.0.13,<8.1.0

spacy-wrap fail in some cases on long sequences

How to reproduce the behaviour

import spacy
import spacy_wrap

nlp = spacy.blank("fr")
config = {"model": {"name": "Jean-Baptiste/camembert-ner"}}
nlp.add_pipe("token_classification_transformer", config=config)

text = """
brice hansemann vice president charge de l'instruction a meaux 77100
information ouverte contre : x se disant prazaru margarita, x se disant adi popa, x se disant chica florin, x se disant tavu anarena, x se disant adan alexandra, x se disant kati enesa, x se disant alin alexandro, x
pour : vols en reunion, tentative de vols en reunion, refus de se soumettre aux operations de prelevement externe necessaire a la realisation d'examen technique et scientifique de comparaison avec les traces et indices preleves lors d'une enquete judiciaire, vols en bande organisee, traite des etres humains en bande organisee, recels en bande organisee de bien provenant d'un delit, delaissement de mineurs
mission : voir commission rogatoire jointe.
"""

nlp(text)

results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "xxx/site-packages/spacy/language.py", line 1031, in __call__
    error_handler(name, proc, [doc], e)
  File "xxx/site-packages/spacy/util.py", line 1670, in raise_error
    raise e
  File "xxx/site-packages/spacy/language.py", line 1026, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 391, in __call__
    self.set_annotations([doc], outputs)
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 208, in set_annotations
    iob_tags, iob_prob = self.convert_to_token_predictions(
  File "xxx/site-packages/spacy_wrap/pipeline_component_tok_clf.py", line 275, in convert_to_token_predictions
    agg_token_logits = agg(logits[align.data[:, 0]])
IndexError: index 177 is out of bounds for axis 0 with size 176

while

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("Jean-Baptiste/camembert-ner")
model = AutoModelForTokenClassification.from_pretrained("Jean-Baptiste/camembert-ner")
nlp = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = """
brice hansemann vice president charge de l'instruction a meaux 77100
information ouverte contre : x se disant prazaru margarita, x se disant adi popa, x se disant chica florin, x se disant tavu anarena, x se disant adan alexandra, x se disant kati enesa, x se disant alin alexandro, x
pour : vols en reunion, tentative de vols en reunion, refus de se soumettre aux operations de prelevement externe necessaire a la realisation d'examen technique et scientifique de comparaison avec les traces et indices preleves lors d'une enquete judiciaire, vols en bande organisee, traite des etres humains en bande organisee, recels en bande organisee de bien provenant d'un delit, delaissement de mineurs
mission : voir commission rogatoire jointe.
"""
nlp(text)

works fine

Your Environment

  • Operating System: Linux-5.15.0-58-generic-x86_64-with-glibc2.31
  • Python Version Used: 3.9.16
  • spaCy Version Used: 3.4.4
  • spaCy-wrap Version Used: 1.4.0

Different output from a NER model

Hi, I get a different output from the HF NER model when used with the spacy and spacy-wrap.

HF pipeline

model_path = 'asahi417/tner-xlm-roberta-base-ontonotes5'

tokenizer = AutoTokenizer.from_pretrained(model_path)
ort_model = AutoModelForTokenClassification.from_pretrained(model_path)

ner_pipeline = pipeline("ner", model=ort_model, tokenizer=tokenizer, aggregation_strategy="simple")

Text and output

text = '(a) keep it confidential throughout the duration of this Agreement and for 5 years after the expiry or termina tion of this Agreement'

# showing only the problematic ent
ner_pipeline(text)

[{'entity_group': 'date',
  'score': 0.9785325,
  'word': '5 years',
  'start': 75,
  'end': 83}]

Spacy pipeline:

import spacy
import spacy_wrap

nlp = spacy.blank("en")
config = {"model": {"name": "../models/tner-xlm-roberta-base-ontonotes5"}, "predictions_to": ["ents"]}
nlp.add_pipe("token_classification_transformer", config=config)

Spacy output:

doc = nlp(text)
# problematic ent with the label
5 date

As we can see the HF pipeline returns 5 years, while the Spacy returns only 5.

What can be the issue? Is it possible to get the exact same outputs while using spacy-wrap?

Custom model for NER

Hello thanks you for setting up this ..

The examples are amazing.

Is it possible to use this wrapper with a Named Entity Recognition model?

If that is the case, is it possible to add an example with a NER model from hugging face?

Following the example, I am trying to add this but it is not working, I do not why may be I should go back and learn how spacy works.

import spacy
import spacy_wrap

nlp = spacy.blank("fr")

config = {
    "model": {
        "@architectures": "spacy-transformers.TransformerModel.v3",
        "name": "Jean-Baptiste/camembert-ner-with-dates",
        "tokenizer_config" : {"use_fast": False},
        "get_spans":  {"@span_getters": "spacy-transformers.doc_spans.v1"}
    }
}

transformer = nlp.add_pipe("ner", config=config)

spaCy 3.5 Support

Hi Kenneth,
Any chance of bumping the spaCy dependency to <3.6? Poetry won't resolve my dependencies against spaCy 3.5.0 at present

Because spacy-wrap (1.4.0) depends on spacy (>=3.2.1,<3.5.0)
 and no versions of spacy-wrap match >1.4,<2.0, spacy-wrap (>=1.4,<2.0) requires spacy (>=3.2.1,<3.5.0).

not sure if that is as simple as bumping the version number on your side or if there are breaking changes however,

IndexError when emojis in input

How to reproduce the behaviour

import spacy
import spacy_wrap
nlp = spacy.blank("en")

# specify model from the hub
config = {"model": {"name": "dslim/bert-base-NER"}}
# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)

doc = nlp("My name is Wolfgang ๐Ÿš€ and I live in Berlin.")

Your Environment

(Had to set torch manually because spacy-wrap fails to install on Mac with the default torch version. Specifically, the dependency nvidia-cublas-cu11 returns a RuntimeError with "Unable to find installation candidates for nvidia-cublas-cu11 (11.10.3.66)". No distributions available for Mac, see https://pypi.org/project/nvidia-cublas-cu11/11.10.3.66/#files )

Above example yields an IndexError in TokenClassificationTransformer.convert_to_token_predictions(data, aggregation_strategy, labels)

      305 logits = data.model_output.logits[0]
      306 for align in data.align:
      307     # aggregate the logits for each token
--> 308     agg_token_logits = agg(logits[align.data[:, 0]])
      309     token_probabilities_ = {
      310         "prob": softmax(agg_token_logits).round(decimals=3),
      311         "label": labels,
      312     }
      313     token_probabilities.append(token_probabilities_)

IndexError: index 0 is out of bounds for axis 1 with size 0

Issue is the same with other models. Tried saattrupdan/nbailab-base-ner-scandi.

Tests with transformers pipeline works with same models.

from transformers import pipeline
pipe = pipeline(model="dslim/bert-base-NER")
res = pipe("My name is Wolfgang ๐Ÿš€ and I live in Berlin.")
[e["word"] for e in res]
# ['Wolfgang', '๐Ÿš€', 'Berlin']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.