tomaarsen / spanmarkerner Goto Github PK

View Code? Open in Web Editor NEW

354.0 9.0 24.0 82.86 MB

SpanMarker for Named Entity Recognition

Home Page: https://tomaarsen.github.io/SpanMarkerNER/

License: Apache License 2.0

Python 31.27% Jupyter Notebook 68.49% Makefile 0.11% Batchfile 0.13%

ner nlp transformers huggingface spacy spacy-extension

spanmarkerner's Introduction

SpanMarker for Named Entity Recognition

🤗 Models | 🛠️ Getting Started In Google Colab | 📄 Documentation | 📊 Thesis

SpanMarker is a framework for training powerful Named Entity Recognition models using familiar encoders such as BERT, RoBERTa and ELECTRA. Built on top of the familiar 🤗 Transformers library, SpanMarker inherits a wide range of powerful functionalities, such as easily loading and saving models, hyperparameter optimization, automatic logging in various tools, checkpointing, callbacks, mixed precision training, 8-bit inference, and more.

Based on the PL-Marker paper, SpanMarker breaks the mold through its accessibility and ease of use. Crucially, SpanMarker works out of the box with many common encoders such as bert-base-cased, roberta-large and bert-base-multilingual-cased, and automatically works with datasets using the IOB, IOB2, BIOES, BILOU or no label annotation scheme.

Additionally, the SpanMarker library has been integrated with the Hugging Face Hub and the Hugging Face Inference API. See the SpanMarker documentation on Hugging Face or see all SpanMarker models on the Hugging Face Hub. Through the Inference API integration, users can test any SpanMarker model on the Hugging Face Hub for free using a widget on the model page. Furthermore, each public SpanMarker model offers a free API for fast prototyping and can be deployed to production using Hugging Face Inference Endpoints.

Inference API Widget (on a model page)	Free Inference API (`Deploy` > `Inference API` on a model page)

Documentation

Feel free to have a look at the documentation.

Installation

You may install the span_marker Python module via pip like so:

pip install span_marker

Quick Start

Training

Please have a look at our Getting Started notebook for details on how SpanMarker is commonly used. It explains the following snippet in more detail. Alternatively, have a look at the training scripts that have been successfully used in the past.

Colab	Kaggle	Gradient	Studio Lab

from pathlib import Path
from datasets import load_dataset
from transformers import TrainingArguments
from span_marker import SpanMarkerModel, Trainer, SpanMarkerModelCardData


def main() -> None:
    # Load the dataset, ensure "tokens" and "ner_tags" columns, and get a list of labels
    dataset_id = "DFKI-SLT/few-nerd"
    dataset_name = "FewNERD"
    dataset = load_dataset(dataset_id, "supervised")
    dataset = dataset.remove_columns("ner_tags")
    dataset = dataset.rename_column("fine_ner_tags", "ner_tags")
    labels = dataset["train"].features["ner_tags"].feature.names
    # ['O', 'art-broadcastprogram', 'art-film', 'art-music', 'art-other', ...

    # Initialize a SpanMarker model using a pretrained BERT-style encoder
    encoder_id = "bert-base-cased"
    model_id = f"tomaarsen/span-marker-{encoder_id}-fewnerd-fine-super"
    model = SpanMarkerModel.from_pretrained(
        encoder_id,
        labels=labels,
        # SpanMarker hyperparameters:
        model_max_length=256,
        marker_max_length=128,
        entity_max_length=8,
        # Model card arguments
        model_card_data=SpanMarkerModelCardData(
            model_id=model_id,
            encoder_id=encoder_id,
            dataset_name=dataset_name,
            dataset_id=dataset_id,
            license="cc-by-sa-4.0",
            language="en",
        ),
    )

    # Prepare the 🤗 transformers training arguments
    output_dir = Path("models") / model_id
    args = TrainingArguments(
        output_dir=output_dir,
        # Training Hyperparameters:
        learning_rate=5e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=3,
        weight_decay=0.01,
        warmup_ratio=0.1,
        bf16=True,  # Replace `bf16` with `fp16` if your hardware can't use bf16.
        # Other Training parameters
        logging_first_step=True,
        logging_steps=50,
        evaluation_strategy="steps",
        save_strategy="steps",
        eval_steps=3000,
        save_total_limit=2,
        dataloader_num_workers=2,
    )

    # Initialize the trainer using our model, training args & dataset, and train
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
    )
    trainer.train()

    # Compute & save the metrics on the test set
    metrics = trainer.evaluate(dataset["test"], metric_key_prefix="test")
    trainer.save_metrics("test", metrics)

    # Save the final checkpoint
    trainer.save_model(output_dir / "checkpoint-final")

if __name__ == "__main__":
    main()

Inference

from span_marker import SpanMarkerModel

# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
[{'span': 'Amelia Earhart', 'label': 'person-other', 'score': 0.7659597396850586, 'char_start_index': 0, 'char_end_index': 14},
 {'span': 'Lockheed Vega 5B', 'label': 'product-airplane', 'score': 0.9725785851478577, 'char_start_index': 38, 'char_end_index': 54},
 {'span': 'Atlantic', 'label': 'location-bodiesofwater', 'score': 0.7587679028511047, 'char_start_index': 66, 'char_end_index': 74},
 {'span': 'Paris', 'label': 'location-GPE', 'score': 0.9892390966415405, 'char_start_index': 78, 'char_end_index': 83}]

Pretrained Models

All models in this list contain train.py files that show the training scripts used to generate them. Additionally, all training scripts used are stored in the training_scripts directory. These trained models have Hosted Inference API widgets that you can use to experiment with the models on their Hugging Face model pages. Additionally, Hugging Face provides each model with a free API (Deploy > Inference API on the model page).

These models are further elaborated on in my thesis.

FewNERD

tomaarsen/span-marker-bert-base-fewnerd-fine-super is a model that I have trained in 2 hours on the finegrained, supervised Few-NERD dataset. It reached a 70.53 Test F1, competitive in the all-time Few-NERD leaderboard using bert-base. My training script resembles the one that you can see above.
tomaarsen/span-marker-roberta-large-fewnerd-fine-super was trained in 6 hours on the finegrained, supervised Few-NERD dataset using roberta-large. It reached a 71.03 Test F1, reaching a new state of the art in the all-time Few-NERD leaderboard.
tomaarsen/span-marker-xlm-roberta-base-fewnerd-fine-super is a multilingual model that I have trained in 1.5 hours on the finegrained, supervised Few-NERD dataset. It reached a 68.6 Test F1 on English, and works well on other languages like Spanish, French, German, Russian, Dutch, Polish, Icelandic, Greek and many more.

OntoNotes v5.0

tomaarsen/span-marker-roberta-large-ontonotes5 was trained in 3 hours on the OntoNotes v5.0 dataset, reaching a performance of 91.54 F1. For reference, the current strongest spaCy model (en_core_web_trf) reaches 89.8 F1. This SpanMarker model uses a roberta-large encoder under the hood.

CoNLL03

tomaarsen/span-marker-xlm-roberta-large-conll03 is a SpanMarker model using xlm-roberta-large that was trained in 45 minutes. It reaches a state of the art 93.1 F1 on CoNLL03 without using document-level context. For reference, the current strongest spaCy model (en_core_web_trf) reaches 91.6.
tomaarsen/span-marker-xlm-roberta-large-conll03-doc-context is another SpanMarker model using the xlm-roberta-large encoder. It uses document-level context to reach a state of the art 94.4 F1. For the best performance, inference should be performed using document-level context (docs). This model was trained in 1 hour.

CoNLL++

tomaarsen/span-marker-xlm-roberta-large-conllpp-doc-context was trained in an hour using the xlm-roberta-large encoder on the CoNLL++ dataset. Using document-level context, it reaches a very competitive 95.5 F1. For the best performance, inference should be performed using document-level context (docs).

MultiNERD

tomaarsen/span-marker-xlm-roberta-base-multinerd is a multilingual SpanMarker model using the xlm-roberta-large encoder trained on the huge MultiNERD dataset. It reaches a 91.31 F1 on all 10 training languages and 94.55 F1 on English only. The model can predict between 15 classes. For best performance, separate punctuation from your words as described here. Note that tomaarsen/span-marker-mbert-base-multinerd does not have this limitation and performs better.
tomaarsen/span-marker-mbert-base-multinerd is the successor of tomaarsen/span-marker-xlm-roberta-base-multinerd. It's a multilingual SpanMarker model using bert-base-multilingual-cased trained on the MultiNERD dataset. It reaches a state-of-the-art 92.48 F1 on all 10 training languages and 95.18 F1 on English only. This model generalizes well to languages using the Latin and Cyrillic script.

Using pretrained SpanMarker models with spaCy

All SpanMarker models on the Hugging Face Hub can also be easily used in spaCy. It's as simple as including 1 line to add the span_marker pipeline. See the Documentation or API Reference for more information.

import spacy

# Load the spaCy model with the span_marker pipeline component
nlp = spacy.load("en_core_web_sm", exclude=["ner"])
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})

# Feed some text through the model to get a spacy Doc
text = """Cleopatra VII, also known as Cleopatra the Great, was the last active ruler of the \
Ptolemaic Kingdom of Egypt. She was born in 69 BCE and ruled Egypt from 51 BCE until her \
death in 30 BCE."""
doc = nlp(text)

# And look at the entities
print([(entity, entity.label_) for entity in doc.ents])
"""
[(Cleopatra VII, "PERSON"), (Cleopatra the Great, "PERSON"), (the Ptolemaic Kingdom of Egypt, "GPE"),
(69 BCE, "DATE"), (Egypt, "GPE"), (51 BCE, "DATE"), (30 BCE, "DATE")]
"""

Context

I have developed this library as a part of my thesis work at Argilla. Feel free to read my finished thesis here in this repository!

Changelog

See CHANGELOG.md for news on all SpanMarker versions.

License

See LICENSE for the current license.

spanmarkerner's People

Contributors

Stargazers

Watchers

spanmarkerner's Issues

Bert-based models crash

Hi there. Thanks for the great library!

I have one issue regarding the usage of Bert-based models. I trained different models finetuning them on my custom dataset (roberta, luke, deberta, xlm-roberta etc)

I tried to do the same using the same code but I get an error (also using your code from the getting started part of the documentation).

I am using a dataset with this format:
{"tokens": ["(7)", "On", "specific", "query", "by", "the", "Bench", "about", "an", "entry", "of", "Rs.", "1,31,37,500", "on", "deposit", "side", "of", "Hongkong", "Bank", "account", "of", "which", "a", "photo", "copy", "is", "appearing", "at", "p.", "40", "of", "assessee's", "paper", "book,", "learned", "authorised", "representative", "submitted", "that", "it", "was", "related", "to", "loan", "from", "broker,", "Rahul", "&", "Co.", "on", "the", "basis", "of", "his", "submission", "a", "necessary", "mark", "is", "put", "by", "us", "on", "that", "photo", "copy."], "ner_tags": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 21, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

And I load it with this script:

from datasets import load_dataset, Dataset, DatasetDict
def load_legal_ner():
    ret = {}
    for split_name in ['TRAIN', 'DEV']:
        data = []
        with open(f"./data/NER_{split_name}/NER_{split_name}_ALL_OT.jsonl", 'r') as reader:
            for line in reader:
                data.append(json.loads(line))
        ret[split_name.lower()] = Dataset.from_list(data)
    return DatasetDict(ret)

For every other model, it works perfectly. But if I try to use a bert-based model (e.g. bert-base-uncased, bert-base-cased, legal-bert etc) it crashes returning different errors, but always linked to the forward method (sometimes is related to the normalization layer, sometimes about matmul).

This is the traceback:

Cell In[8], line 28
     20 trainer = Trainer(
     21     model=model,
     22     args=args,
     23     train_dataset=dataset["train"],
     24     eval_dataset=dataset["dev"],
     25 )
     27 # Training is really simple using our Trainer!
---> 28 trainer.train()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
   2720     return loss_mb.reduce_mean().detach().to(self.args.device)
   2722 with self.compute_loss_context_manager():
-> 2723     loss = self.compute_loss(model, inputs)
   2725 if self.args.n_gpu > 1:
   2726     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2744 else:
   2745     labels = None
-> 2746 outputs = model(**inputs)
   2747 # Save past state if it exists
   2748 # TODO: this needs to be fixed and made cleaner later.
   2749 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1006 embedding_output = self.embeddings(
   1007     input_ids=input_ids,
   1008     position_ids=position_ids,
   (...)
   1011     past_key_values_length=past_key_values_length,
   1012 )
-> 1013 encoder_outputs = self.encoder(
   1014     embedding_output,
   1015     attention_mask=extended_attention_mask,
   1016     head_mask=head_mask,
   1017     encoder_hidden_states=encoder_hidden_states,
   1018     encoder_attention_mask=encoder_extended_attention_mask,
   1019     past_key_values=past_key_values,
   1020     use_cache=use_cache,
   1021     output_attentions=output_attentions,
   1022     output_hidden_states=output_hidden_states,
   1023     return_dict=return_dict,
   1024 )
   1025 sequence_output = encoder_outputs[0]
   1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    596     layer_outputs = self._gradient_checkpointing_func(
    597         layer_module.__call__,
    598         hidden_states,
   (...)
    604         output_attentions,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:497, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    485 def forward(
    486     self,
    487     hidden_states: torch.Tensor,
   (...)
    494 ) -> Tuple[torch.Tensor]:
    495     # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    496     self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
--> 497     self_attention_outputs = self.attention(
    498         hidden_states,
    499         attention_mask,
    500         head_mask,
    501         output_attentions=output_attentions,
    502         past_key_value=self_attn_past_key_value,
    503     )
    504     attention_output = self_attention_outputs[0]
    506     # if decoder, the last output is tuple of self-attn cache

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:436, in BertAttention.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    417 def forward(
    418     self,
    419     hidden_states: torch.Tensor,
   (...)
    425     output_attentions: Optional[bool] = False,
    426 ) -> Tuple[torch.Tensor]:
    427     self_outputs = self.self(
    428         hidden_states,
    429         attention_mask,
   (...)
    434         output_attentions,
    435     )
--> 436     attention_output = self.output(self_outputs[0], hidden_states)
    437     outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
    438     return outputs

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:386, in BertSelfOutput.forward(self, hidden_states, input_tensor)
    385 def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
--> 386     hidden_states = self.dense(hidden_states)
    387     hidden_states = self.dropout(hidden_states)
    388     hidden_states = self.LayerNorm(hidden_states + input_tensor)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 768 n 3072 k 768 mat1_ld 768 mat2_ld 768 result_ld 768 abcType 0 computeType 68 scaleType 0

Or this is another traceback (same code):

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 148
    139 trainer = Trainer(
    140     model=model,
    141     args=args,
   (...)
    144     compute_metrics=compute_f1
    145 )
    147 # Training is really simple using our Trainer!
--> 148 trainer.train()
    150 # ... and so is evaluating!
    151 metrics = trainer.evaluate()

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2723, in Trainer.training_step(self, model, inputs)
   2720     return loss_mb.reduce_mean().detach().to(self.args.device)
   2722 with self.compute_loss_context_manager():
-> 2723     loss = self.compute_loss(model, inputs)
   2725 if self.args.n_gpu > 1:
   2726     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /opt/conda/lib/python3.10/site-packages/transformers/trainer.py:2746, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2744 else:
   2745     labels = None
-> 2746 outputs = model(**inputs)
   2747 # Save past state if it exists
   2748 # TODO: this needs to be fixed and made cleaner later.
   2749 if self.args.past_index >= 0:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:1013, in BertModel.forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
   1004 head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
   1006 embedding_output = self.embeddings(
   1007     input_ids=input_ids,
   1008     position_ids=position_ids,
   (...)
   1011     past_key_values_length=past_key_values_length,
   1012 )
-> 1013 encoder_outputs = self.encoder(
   1014     embedding_output,
   1015     attention_mask=extended_attention_mask,
   1016     head_mask=head_mask,
   1017     encoder_hidden_states=encoder_hidden_states,
   1018     encoder_attention_mask=encoder_extended_attention_mask,
   1019     past_key_values=past_key_values,
   1020     use_cache=use_cache,
   1021     output_attentions=output_attentions,
   1022     output_hidden_states=output_hidden_states,
   1023     return_dict=return_dict,
   1024 )
   1025 sequence_output = encoder_outputs[0]
   1026 pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:607, in BertEncoder.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
    596     layer_outputs = self._gradient_checkpointing_func(
    597         layer_module.__call__,
    598         hidden_states,
   (...)
    604         output_attentions,
    605     )
    606 else:
--> 607     layer_outputs = layer_module(
    608         hidden_states,
    609         attention_mask,
    610         layer_head_mask,
    611         encoder_hidden_states,
    612         encoder_attention_mask,
    613         past_key_value,
    614         output_attentions,
    615     )
    617 hidden_states = layer_outputs[0]
    618 if use_cache:

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:539, in BertLayer.forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, past_key_value, output_attentions)
    536     cross_attn_present_key_value = cross_attention_outputs[-1]
    537     present_key_value = present_key_value + cross_attn_present_key_value
--> 539 layer_output = apply_chunking_to_forward(
    540     self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
    541 )
    542 outputs = (layer_output,) + outputs
    544 # if decoder, return the attn key/values as the last output

File /opt/conda/lib/python3.10/site-packages/transformers/pytorch_utils.py:242, in apply_chunking_to_forward(forward_fn, chunk_size, chunk_dim, *input_tensors)
    239     # concatenate output at same dimension
    240     return torch.cat(output_chunks, dim=chunk_dim)
--> 242 return forward_fn(*input_tensors)

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:552, in BertLayer.feed_forward_chunk(self, attention_output)
    550 def feed_forward_chunk(self, attention_output):
    551     intermediate_output = self.intermediate(attention_output)
--> 552     layer_output = self.output(intermediate_output, attention_output)
    553     return layer_output

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/transformers/models/bert/modeling_bert.py:466, in BertOutput.forward(self, hidden_states, input_tensor)
    464 hidden_states = self.dense(hidden_states)
    465 hidden_states = self.dropout(hidden_states)
--> 466 hidden_states = self.LayerNorm(hidden_states + input_tensor)
    467 return hidden_states

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/normalization.py:190, in LayerNorm.forward(self, input)
    189 def forward(self, input: Tensor) -> Tensor:
--> 190     return F.layer_norm(
    191         input, self.normalized_shape, self.weight, self.bias, self.eps)

File /opt/conda/lib/python3.10/site-packages/torch/nn/functional.py:2515, in layer_norm(input, normalized_shape, weight, bias, eps)
   2511 if has_torch_function_variadic(input, weight, bias):
   2512     return handle_torch_function(
   2513         layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, bias=bias, eps=eps
   2514     )
-> 2515 return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions

Sometimes also this one:

/usr/local/src/pytorch/aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [313,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

I know that probably is not much to work on. Let me know if you have any advice for me.

transformers==4.36.0
span-marker==1.5.0
torch==2.0.0

Entity type

Hi,

Appreciate the amazing work. Is there a list of entity to be detected with the pretrained model?

Thanks.

is <start> <end> ever used?

class SpanMarkerTokenizer:
def init(self, tokenizer: PreTrainedTokenizer, config: SpanMarkerConfig, **kwargs) -> None:
self.tokenizer = tokenizer
self.config = config

    tokenizer.add_tokens(["<start>", "<end>"], special_tokens=True)
    self.start_marker_id, self.end_marker_id = self.tokenizer.convert_tokens_to_ids(["<start>", "<end>"])

Hi,

In the above code, you defined and for tokenizer but I couldn't find where they are used. Potentially intended for relation extraction?

Evaluation Metrics with Nervalute

Is there a way to use https://pypi.org/project/nervaluate/ instead of the current metric library?
I would like to obtain F1-exact, F1-strict etc,

SpanMarker library for document level context Gives Error. (RuntimeError: CUDA error: device-side assert triggered)

Gives this error:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

FineTune Code:

from datasets import load_dataset, Dataset
dataset = load_dataset("json", data_files=["output.jsonl"])
from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained(
    "bert-base-uncased",  # Example encoder
    labels=['O','Degree','Years_of_Experience','Email_Address'
        'College_Name','Location','Designation','Graduation_Year','Skills','Name'
        'Companies_worked_at'],
    max_prev_context=2,
    max_next_context=2,
)
from transformers import TrainingArguments
args = TrainingArguments(
    output_dir="models/RUDYRDX-NER-1",
    learning_rate=1e-5,
    gradient_accumulation_steps=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=500,
    push_to_hub=False,
    logging_steps=50,
    fp16=True,
    warmup_ratio=0.1,
)
from span_marker import Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
)
trainer.train() # error happens when this runs

Dataset Sample:

{"document_id": 0, "sentence_id": 0, "tokens": ["Govardhana", "K", "Senior", "Software", "Engineer", "Bengaluru", "Karnataka", "Karnataka", "-", "Email", "Indeed", ":", "indeed.com/r/Govardhana-K/", "b2de315d95905b68", "Total", "experience", "5", "Years", "6", "Months", "Cloud", "Lending", "Solutions", "INC", "4", "Month", "Salesforce", "Developer", "Oracle", "5", "Years", "2", "Month", "Core", "Java", "Developer", "Languages", "Core", "Java", "Go", "Lang", "Oracle", "PL-SQL", "programming", "Sales", "Force", "Developer", "APEX", "."], "ner_tags": ["Name", "Designation", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]}
{"document_id": 0, "sentence_id": 1, "tokens": ["Designations", "&", "Promotions", "Willing", "relocate", ":", "Anywhere", "WORK", "EXPERIENCE", "Senior", "Software", "Engineer", "Cloud", "Lending", "Solutions", "-", "Bangalore", "Karnataka", "-", "January", "2018", "Present", "Present", "Senior", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2016", "December", "2017", "Staff", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "January", "2014", "October", "2016", "Associate", "Consultant", "Oracle", "-", "Bangalore", "Karnataka", "-", "November", "2012", "December", "2013", "EDUCATION", "B.E", "Computer", "Science", "Engineering", "Adithya", "Institute", "Technology", "-", "Tamil", "Nadu", "September", "2008", "June", "2012", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "https", ":", "//www.indeed.com/r/Govardhana-K/b2de315d95905b68", "?", "isid=rex-download", "&", "ikw=download-top", "&", "co=IN", "SKILLS", "APEX", "."], "ner_tags": ["Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation", "Companies worked at", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Designation", "Designation", "Designation"]}
{"document_id": 0, "sentence_id": 2, "tokens": ["(", "Less", "1", "year", ")", "Data", "Structures", "(", "3", "years", ")", "FLEXCUBE", "(", "5", "years", ")", "Oracle", "(", "5", "years", ")", "Algorithms", "(", "3", "years", ")", "LINKS", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/", "ADDITIONAL", "INFORMATION", "Technical", "Proficiency", ":", "Languages", ":", "Core", "Java", "Go", "Lang", "Data", "Structures", "&", "Algorithms", "Oracle", "PL-SQL", "programming", "Sales", "Force", "APEX", "."], "ner_tags": ["Name", "Name", "Name", "Designation", "Designation", "Designation", "Designation", "Designation", "Designation", "Location", "Location", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O"]}
{"document_id": 0, "sentence_id": 3, "tokens": ["Tools", ":", "RADTool", "Jdeveloper", "NetBeans", "Eclipse", "SQL", "developer", "PL/SQL", "Developer", "WinSCP", "Putty", "Web", "Technologies", ":", "JavaScript", "XML", "HTML", "Webservice", "Operating", "Systems", ":", "Linux", "Windows", "Version", "control", "system", "SVN", "&", "Git-Hub", "Databases", ":", "Oracle", "Middleware", ":", "Web", "logic", "OC4J", "Product", "FLEXCUBE", ":", "Oracle", "FLEXCUBE", "Versions", "10.x", "11.x", "12.x", "https", ":", "//www.linkedin.com/in/govardhana-k-61024944/"], "ner_tags": ["Name", "Name", "Designation", "Designation", "Designation", "Location", "O", "O", "O", "O", "O", "O", "O", "Email Address", "Email Address", "Email Address", "Email Address", "Email Address", "O", "O", "O", "O", "O", "O", "Companies worked at", "Companies worked at", "Companies worked at", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "Companies worked at", "O", "O", "O", "O"]}

Hugging Face Space URL not working for FewNERD fine-tuned model

Hi here @tomaarsen!

Description

It seems that the Hugging Face Space linked as 🤗 Space is not working as it throws an HTTP 404 error code, and it seems is no longer in your Hugging Face Hub account.

So I just wanted to point that out in case you want to have a running example besides the Free Inference API, otherwise I guess you're good removing it!

`Trainer.preprocess_dataset` - `KeyError` with `datasets<2.6.0`

I ran the README snippet in a Kaggle notebook (Python 3.7) after running %pip install datasets span_marker transformers. I get a KeyError from trainer.train() / Trainer.preprocess_dataset (see below).

I was able to fix this by upgrading to datasets>=2.6.0. It seems that with the default Kaggle setup datasets==2.1.0 was installed when I ran %pip install datasets span_marker transformers.

I can't see an obvious reason for the change in datasets behaviour in the 2.6.0 release notes so maybe I'm on the wrong track. I thought you might like to know, in case there is a reason and you would like to add a constraint on datasets as a dependency.

Thanks for the package!

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/tmp/ipykernel_103/2689077115.py in <module>
     30 )
     31 
---> 32 trainer.train()
     33 trainer.save_model("my_span_marker_model/checkpoint-final")
     34 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1635             resume_from_checkpoint=resume_from_checkpoint,
   1636             trial=trial,
-> 1637             ignore_keys_for_eval=ignore_keys_for_eval,
   1638         )
   1639 

/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in _inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1643         self._train_batch_size = batch_size
   1644         # Data loader and number of training steps
-> 1645         train_dataloader = self.get_train_dataloader()
   1646 
   1647         # Setting up training control variables:

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in get_train_dataloader(self)
    176     def get_train_dataloader(self) -> DataLoader:
    177         """Return the preprocessed training DataLoader."""
--> 178         self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
    179         return super().get_train_dataloader()
    180 

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in preprocess_dataset(self, dataset, label_normalizer, tokenizer, dataset_name, is_evaluate)
    170             batched=True,
    171             remove_columns=dataset.column_names,
--> 172             desc=f"Tokenizing the {dataset_name} dataset",
    173         )
    174         return dataset

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   1971                 new_fingerprint=new_fingerprint,
   1972                 disable_tqdm=disable_tqdm,
-> 1973                 desc=desc,
   1974             )
   1975         else:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    518             self: "Dataset" = kwargs.pop("self")
    519         # apply actual function
--> 520         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    521         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    522         for dataset in datasets:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in wrapper(*args, **kwargs)
    485         }
    486         # apply actual function
--> 487         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    488         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    489         # re-apply format to the output

/opt/conda/lib/python3.7/site-packages/datasets/fingerprint.py in wrapper(*args, **kwargs)
    456             # Call actual function
    457 
--> 458             out = func(self, *args, **kwargs)
    459 
    460             # Update fingerprint of in-place transforms + update in-place history of transforms

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in _map_single(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset, disable_tqdm, desc, cache_only)
   2341                                 indices,
   2342                                 check_same_num_examples=len(input_dataset.list_indexes()) > 0,
-> 2343                                 offset=offset,
   2344                             )
   2345                         except NumExamplesMismatchError:

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in apply_function_on_filtered_inputs(inputs, indices, check_same_num_examples, offset)
   2218             if with_rank:
   2219                 additional_args += (rank,)
-> 2220             processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
   2221             if update_data is None:
   2222                 # Check if the function returns updated examples

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in decorated(item, *args, **kwargs)
   1913                 )
   1914                 # Use the LazyDict internally, while mapping the function
-> 1915                 result = f(decorated_item, *args, **kwargs)
   1916                 # Return a standard dict
   1917                 return result.data if isinstance(result, LazyDict) else result

/opt/conda/lib/python3.7/site-packages/span_marker/trainer.py in <lambda>(batch)
    167         # Tokenize and add start/end markers
    168         dataset = dataset.map(
--> 169             lambda batch: tokenizer(batch["tokens"], labels=batch["ner_tags"], return_num_words=is_evaluate),
    170             batched=True,
    171             remove_columns=dataset.column_names,

/opt/conda/lib/python3.7/site-packages/datasets/arrow_dataset.py in __getitem__(self, key)
    123 class Batch(LazyDict):
    124     def __getitem__(self, key):
--> 125         values = super().__getitem__(key)
    126         if self.features and key in self.features:
    127             values = [

/opt/conda/lib/python3.7/collections/__init__.py in __getitem__(self, key)
   1025         if hasattr(self.__class__, "__missing__"):
   1026             return self.__class__.__missing__(self, key)
-> 1027         raise KeyError(key)
   1028     def __setitem__(self, key, item): self.data[key] = item
   1029     def __delitem__(self, key): del self.data[key]

KeyError: 'tokens'

Cannot train BILOU scheme with no singletons

The AutoLabelNormalizer infers the scheme based on the presence of all the tag prefixes - i.e. BILOU is assumed if there's at least one of each of ['B','I','L','O','U']. There doesn't appear to be any way of passing a specific LabelNormalizer to the trainer.

My issue with this is that my dataset contains only BILO - i.e., there are no singletons. But the nature of my problem means I need to use a scheme that has B and L tags.

Because my dataset has no U tags SpanMarker errors since the set BILO doesn't match any of cases that AutoLabelNormalizer checks, i.e. perhaps allow a LabelNormalizer to be passed as an argument or relax the conditions i..e

    if (tags == set("BILOU")) or (tags == set("BILO")):
        return LabelNormalizerBILOU(config)

Issue with adding span_marker pipe within spaCy

ValueError Traceback (most recent call last)
in <cell line: 6>()
4 # Load the spaCy model
5 nlp = spacy.load("en_core_web_sm")
----> 6 nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-roberta-large-ontonotes5"})
7
8 # Feed some text through the model to get a spacy Doc

1 frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in create_pipe(self, factory_name, name, config, raw_config, validate)
658 lang_code=self.lang,
659 )
--> 660 raise ValueError(err)
661 pipe_meta = self.get_factory_meta(factory_name)
662 # This is unideal, but the alternative would mean you always need to

ValueError: [E002] Can't find factory for 'span_marker' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, spancat_singlelabel, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer

Package versions:
spacy == 3.5.3 (did not work with 3.5.2 as well)
span_marker == 1.2.1

How to make this work for overlapping entities?

Hi tom, i was wondering how could i make this work for overlapping spans for different entity types and how to extend this to relations extraction as well.
Any help or direction would be super helpful.

Unexpectedly (bad) predictions?

I'm just trying out the pretrained models accompanying this repo via HF Spaces and I'm seeing some weird results.

For some models there's a huge difference in quality when I include/exclude a period. E.g. the difference between the two sentences "This is James" vs "This is James**.**" sometimes causes models to not be able to recognise "James" as a person.
The multilingual model was not able to recognise major cities in a straightforward Dutch sentence ("Ik woon in Leuven." - also tried it with "Amsterdam", and "Parijs").

Am I using the pretrained models wrong? Is it expecting different kinds of inputs?

I can elaborate and test more, but I figured I'd post this first. 😊

Thanks!

UnicodeDecodeError when import span_marker

This error message appears when I try to import span_markers

from span_marker import SpanMarkerModel

Traceback (most recent call last):
  File "/home/user/name_entity.py", line 9, in <module>
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/site-packages/span_marker/__init__.py", line 8, in <module>
    import torch
  File "/home/user/.local/lib/python3.9/site-packages/torch/__init__.py", line 778, in <module>
    _C._initExtension(manager_path())
  File "/home/user/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 170, in <module>
    _lazy_call(_check_capability)
  File "/home/user/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 168, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/traceback.py", line 197, in format_stack
    return format_list(extract_stack(f, limit=limit))
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/traceback.py", line 211, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/traceback.py", line 366, in extract
    f.line
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/traceback.py", line 288, in line
    self._line = linecache.getline(self.filename, self.lineno).strip()
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/linecache.py", line 30, in getline
    lines = getlines(filename, module_globals)
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/linecache.py", line 46, in getlines
    return updatecache(filename, module_globals)
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/linecache.py", line 137, in updatecache
    lines = fp.readlines()
  File "/home/user/miniconda3/envs/span_marker_env/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 346: invalid continuation byte

Can use with BetterTransformer?

I was hoping this would work for optimizing the underlying model.encoder since it is independent(?) of the rest
but I'm getting a shape error like :
RuntimeError: shape '[1, 512]' is invalid for input of size 262144 basically saying it expected [512, 512] for attention_mask which is weird because attention_mask input is shaped [512, 512]

Here is the test code:

from span_marker import SpanMarkerModel
from optimum.bettertransformer import BetterTransformer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super").eval()

# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities) # works

better_encoder = BetterTransformer.transform(model.encoder)
model.encoder=better_encoder 

# Run inference
entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
print(entities)

│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/span_marker/modeling.py:137 in forward      │
│                                                                                                  │
│   134 │   │   │   SpanMarkerOutput: The output dataclass.                                        │
│   135 │   │   """                                                                                │
│   136 │   │   token_type_ids = torch.zeros_like(input_ids)                                       │
│ ❱ 137 │   │   outputs = self.encoder(                                                            │
│   138 │   │   │   input_ids,                                                                     │
│   139 │   │   │   attention_mask=attention_mask,                                                 │
│   140 │   │   │   token_type_ids=token_type_ids,                                                 │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:1 │
│ 020 in forward                                                                                   │
│                                                                                                  │
│   1017 │   │   │   inputs_embeds=inputs_embeds,                                                  │
│   1018 │   │   │   past_key_values_length=past_key_values_length,                                │
│   1019 │   │   )                                                                                 │
│ ❱ 1020 │   │   encoder_outputs = self.encoder(                                                   │
│   1021 │   │   │   embedding_output,                                                             │
│   1022 │   │   │   attention_mask=extended_attention_mask,                                       │
│   1023 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py:6 │
│ 10 in forward                                                                                    │
│                                                                                                  │
│    607 │   │   │   │   │   encoder_attention_mask,                                               │
│    608 │   │   │   │   )                                                                         │
│    609 │   │   │   else:                                                                         │
│ ❱  610 │   │   │   │   layer_outputs = layer_module(                                             │
│    611 │   │   │   │   │   hidden_states,                                                        │
│    612 │   │   │   │   │   attention_mask,                                                       │
│    613 │   │   │   │   │   layer_head_mask,                                                      │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/torch/nn/modules/module.py:1501 in          │
│ _call_impl                                                                                       │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /Users/ceyda.1/miniconda/lib/python3.9/site-packages/optimum/bettertransformer/models/encoder_mo │
│ dels.py:246 in forward                                                                           │
│                                                                                                  │
│    243 │   │   │   # attention mask comes in with values 0 and -inf. we convert to torch.nn.Tra  │
│    244 │   │   │   # 0->false->keep this token -inf->true->mask this token                       │
│    245 │   │   │   attention_mask = attention_mask.bool()                                        │
│ ❱  246 │   │   │   attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], att  │
│    247 │   │   │   hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mas  │
│    248 │   │   │   attention_mask = None                                                         │
│    249                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[1, 512]' is invalid for input of size 262144

transformers version: 4.29.2
Python version: 3.9.16
PyTorch version (GPU?): 2.0.1 (False)
optimum :1.8.6

Integrate Entity Ruler with Span Marker model

Hi. I would like to mix up both the Span Marker model with Spacy integration(https://spacy.io/universe/project/span_marker) and the entity ruler (to add some custom patterns), similar like in the code from below.
`import spacy
import de_dep_news_trf

nlp = de_dep_news_trf.load()

patterns = [{"label": "PERIOD", "pattern": [{"LOWER": "monat"}]},
{"label": "PER", "pattern": "Raluca"},
{"label": "COLOR", "pattern": [{"LOWER": "blau"}]},
{"label": "JOBTITLE", "pattern": [{"LOWER" : {"REGEX": ".(referent)."}}]}
]

ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)

span_marker_ruler = nlp.add_pipe('span_marker', config={"model": "tomaarsen/span-marker-mbert-base-multinerd"}, name='span_marker_ruler')

doc = nlp("Raluca ist referent, er mag die Farbe Blau und geht jeden Monat in die Berge.")

print([(ent.text, ent.label_) for ent in doc.ents])`

Unfortunately, this mix doesn't work properly.

I would expect however an output like this

`
[('Raluca', 'PER'), ('referent', 'JOBTITLE'), ('Blau', 'COLOR'), ('Monat', 'PERIOD')]

How to integrate entity ruler (with add of some customized entities and patterns) with the Span Marker model?

Do you have any ideas on how to solve this kind of issues?

spaCy integration has no `.pipe()` method, hence will fallback to individual `.call()`

Not sure what works better during inference (individual sentences or longer segments in larger batches, but maybe something like this could work.

    def pipe(self, stream, batch_size=128, include_sent=None):
        """
        predict the class for a spacy Doc stream

        Args:
            stream (Doc): a spacy doc

        Returns:
            Doc: spacy doc with spanmarker entities
        """
        if isinstance(stream, str):
            stream = [stream]

        if not isinstance(stream, types.GeneratorType):
            stream = self.nlp.pipe(stream, batch_size=batch_size)

        for docs in util.minibatch(stream, size=batch_size):
            batch_results = self.model.predict(docs)

            for doc, prediction in zip(docs, batch_results):
                yield self.post_process_batch(doc, prediction)

should return same no. of list as of inputs

When inputs are of same, predict should return those many lists. Instead gives 1.

inputs = ["Unknown", "Unknown", "Unknown"]
model = SpanMarkerModel.from_pretrained("model_name")
model.predict(inputs)

Output:
[]

Expected Output:
[[], [], []]

Pretrained tokenizer

Hey Tom, thank you for this amazing framework!

May I ask why you do not require any pre-tokenization (e.g., subword-tokenization, setting special tokens to -100) based on the chosen model using AutoTokenizer.from_pretrained()?
Might pre-tokenization improve or rather degrade model performance?

Best regards,
Daniel

Confusing error thrown when tokens is empty

When one of the elements in the training set is empty, then it ends up throwing a confusing error:

Label normalizing the train dataset: 100%|██████████████████████████████████████████████████████████████████████| 8324/8324 [00:00<00:00, 34016.14 examples/s]
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1665.71 examples/s]c:\code\span-marker-ner\span_marker\tokenizer.py:204: RuntimeWarning: All-NaN slice encountered
  num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
Tokenizing the train dataset:  96%|██████████████████████████████████████████████████████████████████████████▉   | 8000/8324 [00:04<00:00, 1612.60 examples/s] 
This SpanMarker model will ignore 3.181189% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 5 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 18798 total entities:
- 203 missed entities with 6 words (1.079902%)
- 81 missed entities with 7 words (0.430897%)
- 58 missed entities with 8 words (0.308543%)
- 29 missed entities with 9 words (0.154272%)
- 5 missed entities with 10 words (0.026599%)
- 9 missed entities with 11 words (0.047877%)
- 8 missed entities with 12 words (0.042558%)
- 1 missed entities with 13 words (0.005320%)
- 1 missed entities with 14 words (0.005320%)
- 1 missed entities with 15 words (0.005320%)
- 2 missed entities with 16 words (0.010639%)
- 1 missed entities with 17 words (0.005320%)
Additionally, a total of 199 (1.058623%) entities were missed due to the maximum input length.
Traceback (most recent call last):
  File "c:\code\span-marker-ner\demo_conll2002.py", line 83, in <module>
    main()
  File "c:\code\span-marker-ner\demo_conll2002.py", line 72, in main
    trainer.train()
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1553, in train
    return inner_training_loop(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\transformers\trainer.py", line 1567, in _inner_training_loop
    train_dataloader = self.get_train_dataloader()
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 423, in get_train_dataloader
    self.train_dataset = self.preprocess_dataset(self.train_dataset, self.label_normalizer, self.tokenizer)
  File "c:\code\span-marker-ner\span_marker\trainer.py", line 241, in preprocess_dataset
    dataset = dataset.map(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3097, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3474, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "C:\Users\tom\.conda\envs\span-marker-ner\lib\site-packages\datasets\arrow_dataset.py", line 3353, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "c:\code\span-marker-ner\span_marker\tokenizer.py", line 204, in __call__
    num_words = int(np.nanmax(np.array(batch_encoding.word_ids(sample_idx), dtype=float))) + 1
ValueError: cannot convert float NaN to integer

Perhaps a cleaner error can be designed here.

Tom Aarsen

ValueError: Failed to concatenate on axis=1 because tables don't have the same number of rows

When I place a single word in the first index of the list, it leads to the above error. However, if I put it in any index other than the first, no error occurs.

from span_marker import SpanMarkerModel
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-fewnerd-fine-super")
entities = model.predict( ['Avolon', 'Walmart - Milwaukee, WI']) #error
entities = model.predict( [ 'Walmart - Milwaukee, WI','Avolon']) #no error
Can you please help me here @tomaarsen

spaCy integration ignores old entities

I think it might not be best practice to completely overwrite the previously obtained entities, maybe something like the code underneath would work better.

from spacy.util import filter_spans

doc.set_ents(filter_spans(list(doc.ents) + new_ents))

Prevent re-adding contextual information when training with document-level context

See title. This adding contextual information occurs every time the development set is used for a mid-training evaluation. This is a bit of a waste of (training) time.

Training CoNLL03 with doc-level context should reproduce it.

Note: (XLM-)RoBERTa-based SpanMarker models require text preprocessing

Hello!

This is a heads up that (XLM-)RoBERTa-based SpanMarker models require text to be preprocessed to separate punctuation from words:

# ✅
model.predict("He plays J. Robert Oppenheimer , an American theoretical physicist .")
# ❌
model.predict("He plays J. Robert Oppenheimer, an American theoretical physicist.")

# You can also supply a list of words directly: ✅
model.predict(["He", "plays", "J.", "Robert", "Oppenheimer", ",", "an", "American", "theoretical", "physicist", "."])

This is a consequence of the RoBERTa tokenizer distinguishing , and , as different tokens, and the SpanMarker model is only familiar with the , variant.

Another alternative is to use the spaCy integration, which preprocesses the text into words for you!

The (m)BERT-based SpanMarker models do not require this preprocessing.

Tom Aarsen

SpanMaker not working on custom dataset

Hey @tomaarsen ,
Thanks for sharing this amazing repo, while working on https://colab.research.google.com/drive/1XoAjmgyx_Sgj4WuY7w7fLR_ItBm7-Dzu?usp=sharing

I was not able to embed tags using the following dataset: https://huggingface.co/datasets/nlpaueb/finer-139

Love to hear how to improve on it.
Looking forward to hearing from you.
Thanks,
Andy

Stacking of embeddings

@tomaarsen Can we use stacking of two different embeddings for custom NER training using span marker?

Error loading SpanMarkerTokenizer

Hello Tom.

I have found a problem when loading the tokenizer directly from the SpanMarkerTokenizer class.
I have tested it with several repo ids and the response is the same in all of them.

System Info

Platform: MacOS Sonoma 14.0, M1 Pro
Python 3.11.5

transformers=4.35.0
span_marker=1.5.0
tokenizers=0.14.1

Error Response

tokenizer = SpanMarkerTokenizer.from_pretrained("tomaarsen/span-marker-mbert-base-multinerd")
Downloading (…)okenizer_config.json: 100%|███| 343/343 [00:00<00:00, 834kB/s]
Downloading (…)solve/main/vocab.txt: 100%|█| 996k/996k [00:00<00:00, 3.36MB/s
Downloading (…)/main/tokenizer.json: 100%|█| 2.92M/2.92M [00:00<00:00, 5.46MB
Downloading (…)in/added_tokens.json: 100%|█| 43.0/43.0 [00:00<00:00, 114kB/s]
Downloading (…)cial_tokens_map.json: 100%|███| 125/125 [00:00<00:00, 384kB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 285, in from_pretrained
    return cls(tokenizer, config=config, **kwargs)
  File "/Users/polodealvarado/Desktop/github_projects/SpanMarkerNER/span_marker/tokenizer.py", line 156, in __init__
    self.tokenizer.model_max_length, self.config.model_max_length or self.config.model_max_length_default
AttributeError: 'NoneType' object has no attribute 'model_max_length'

Expected behaviour

Get the tokenizer

num_proc not specified in .map functions

In trainer.py, there are three .map functions where num_proc is not specified.
It should be possible to set this because it speeds up tokenization, spreading and normalizations by a significant amount.

Example from trainer.py row 227:

        with tokenizer.entity_tracker(split=dataset_name):
            dataset = dataset.map(
                tokenizer,
                batched=True,
                remove_columns=set(dataset.column_names) - set(self.OPTIONAL_COLUMNS),
                desc=f"Tokenizing the {dataset_name} dataset",
                fn_kwargs={"return_num_words": is_evaluate},
                num_proc=4, # Added this - should be specifiable
            )

This sped up tokenization by about 4 times.

Possible to load your own trained models with internet disabled?

Was wondering if there is a way to load a model in a kaggle notebook that I trained myself. There's currently a NER competition going on, and I wanted to try using the SpanMarker library to compete. Training went fine, but now to submit, I need to have the kaggle notebook have internet disabled. When trying to load my checkpoint, I get this error:

model_checkpoint = "/kaggle/input/pii-train-1-cp3000/Kaggle Checkpoints/checkpoint 3000"
model = SpanMarkerModel.from_pretrained(model_checkpoint,local_files_only = True,
labels = [
'1-EMAIL', '1-ID_NUM', '1-NAME_STUDENT', '1-PHONE_NUM', '1-STREET_ADDRESS',
'1-URL_PERSONAL', '1-USERNAME', '2-ID_NUM', '2-NAME_STUDENT', '2-PHONE_NUM',
'2-STREET_ADDRESS', '2-URL_PERSONAL', 'O'
])

OSError: We couldn't connect to 'https://huggingface.co/' to load this file, couldn't find it in the cached files and it looks like bert-base-uncased is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

Kaggle notebook here: https://www.kaggle.com/jdonnelly0804/pii-infer

inference time cpu vs gpu

I have used gte-tiny embeddings for my custom NER model and need to speed up the inference time.
below are stats for different batch sizes.

Batch Size	Average Inference Time (ms)- GPU	Average Inference Time (ms)- CPU
16	0.14945	1.23388
32	0.28	3.24456
64	0.51582	6.57234
128	1.10669	13.73319
256	2.24729	28.236

Is there any specific method to enhance it? @tomaarsen

deberta-v3 encoder error

Hi there!

I was playing around with your google colab and wanted to few-nerd on encoder_id = "microsoft/deberta-v3-large"

but when I reach the train part if fails with RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2:


Tokenizing the train dataset: 100%
 131767/131767 [01:16<00:00, 1661.38 examples/s]
This SpanMarker model will ignore 0.339320% of all annotated entities in the train dataset. This is caused by the SpanMarkerModel maximum entity length of 8 words and the maximum model input length of 256 tokens.
These are the frequencies of the missed entities due to maximum entity length out of 340387 total entities:
- 486 missed entities with 9 words (0.142779%)
- 245 missed entities with 10 words (0.071977%)
- 119 missed entities with 11 words (0.034960%)
- 92 missed entities with 12 words (0.027028%)
- 57 missed entities with 13 words (0.016746%)
- 36 missed entities with 14 words (0.010576%)
- 17 missed entities with 15 words (0.004994%)
- 14 missed entities with 16 words (0.004113%)
- 10 missed entities with 17 words (0.002938%)
- 4 missed entities with 18 words (0.001175%)
- 5 missed entities with 19 words (0.001469%)
- 3 missed entities with 20 words (0.000881%)
- 4 missed entities with 21 words (0.001175%)
- 1 missed entities with 22 words (0.000294%)
- 2 missed entities with 23 words (0.000588%)
- 3 missed entities with 24 words (0.000881%)
- 2 missed entities with 25 words (0.000588%)
- 2 missed entities with 26 words (0.000588%)
- 1 missed entities with 27 words (0.000294%)
- 1 missed entities with 29 words (0.000294%)
Additionally, a total of 51 (0.014983%) entities were missed due to the maximum input length.
Spreading data between multiple samples: 100%
 131767/131767 [00:18<00:00, 7164.64 examples/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], line 1
----> 1 trainer.train()

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1537, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1535         hf_hub_utils.enable_progress_bars()
   1536 else:
-> 1537     return inner_training_loop(
   1538         args=args,
   1539         resume_from_checkpoint=resume_from_checkpoint,
   1540         trial=trial,
   1541         ignore_keys_for_eval=ignore_keys_for_eval,
   1542     )

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:1854, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1851     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   1853 with self.accelerator.accumulate(model):
-> 1854     tr_loss_step = self.training_step(model, inputs)
   1856 if (
   1857     args.logging_nan_inf_filter
   1858     and not is_torch_tpu_available()
   1859     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1860 ):
   1861     # if loss is nan or inf simply add the average of previous logged losses
   1862     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2735, in Trainer.training_step(self, model, inputs)
   2732     return loss_mb.reduce_mean().detach().to(self.args.device)
   2734 with self.compute_loss_context_manager():
-> 2735     loss = self.compute_loss(model, inputs)
   2737 if self.args.n_gpu > 1:
   2738     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/trainer.py:2758, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2756 else:
   2757     labels = None
-> 2758 outputs = model(**inputs)
   2759 # Save past state if it exists
   2760 # TODO: this needs to be fixed and made cleaner later.
   2761 if self.args.past_index >= 0:

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:687, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    686 def forward(*args, **kwargs):
--> 687     return model_forward(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/accelerate/utils/operations.py:675, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    674 def __call__(self, *args, **kwargs):
--> 675     return convert_to_fp32(self.model_forward(*args, **kwargs))

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     13 @functools.wraps(func)
     14 def decorate_autocast(*args, **kwargs):
     15     with autocast_instance:
---> 16         return func(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/span_marker/modeling.py:153, in SpanMarkerModel.forward(self, input_ids, attention_mask, position_ids, start_marker_indices, num_marker_pairs, labels, num_words, document_ids, sentence_ids, **kwargs)
    136 """Forward call of the SpanMarkerModel.
    137 
    138 Args:
   (...)
    150     SpanMarkerOutput: The output dataclass.
    151 """
    152 token_type_ids = torch.zeros_like(input_ids)
--> 153 outputs = self.encoder(
    154     input_ids,
    155     attention_mask=attention_mask,
    156     token_type_ids=token_type_ids,
    157     position_ids=position_ids,
    158 )
    159 last_hidden_state = outputs[0]
    160 last_hidden_state = self.dropout(last_hidden_state)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:1062, in DebertaV2Model.forward(self, input_ids, attention_mask, token_type_ids, position_ids, inputs_embeds, output_attentions, output_hidden_states, return_dict)
   1059 if token_type_ids is None:
   1060     token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-> 1062 embedding_output = self.embeddings(
   1063     input_ids=input_ids,
   1064     token_type_ids=token_type_ids,
   1065     position_ids=position_ids,
   1066     mask=attention_mask,
   1067     inputs_embeds=inputs_embeds,
   1068 )
   1070 encoder_outputs = self.encoder(
   1071     embedding_output,
   1072     attention_mask,
   (...)
   1075     return_dict=return_dict,
   1076 )
   1077 encoded_layers = encoder_outputs[1]

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File /shared/jupyter/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File /shared/jupyter/.venv/lib/python3.10/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py:900, in DebertaV2Embeddings.forward(self, input_ids, token_type_ids, position_ids, mask, inputs_embeds)
    897         mask = mask.unsqueeze(2)
    898     mask = mask.to(embeddings.dtype)
--> 900     embeddings = embeddings * mask
    902 embeddings = self.dropout(embeddings)
    903 return embeddings

RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 2

Any ideas if this encoder will work and how to make it work? Thanks! No issues if I run it on roberta-large for example

MobileBert

HI,
You did a very good job. I tried SpanMarkerNer with a pretrained model "bert-base-multilingual-cased" and I had good results. But now I want to use my code with mobileBert.
Do you know if it's possible ?
And how to do with https://tfhub.dev/tensorflow/mobilebert_multi_cased_L-24_H-128_B-512_A-4_F-4_OPT/1 ( for multilangual) ?
Thank you

SpanMarker with ONNX models

Hi @tomaarsen! Is there a ONNX exporter planned? Have you tried using SpanMarker with ONNX models for inference?
Would be really curious if you experimented with that already! :-)

Choose class-candidates during inference

Really great library @tomaarsen, thank you for this great contribution!!

It would be really convenient to have the possibility to give a list of class-candidates to the predict method during inference. This would be useful for Retriever-Reader systems, where the Retriever (e.g. Setfit-Model) returns text sequences where it is already known what set of classes are available for the Reader (e.g. SpanMarker) for extraction and you do not want to extract other classes.

E.g. a system like this from here https://lilianweng.github.io/posts/2020-10-29-odqa/:

I was thinking about modifications to the predict method like these:

def predict(self, ... , class_candidates: Optional[List[str]] = None):
    
    ...

    if class_candidates is not None:
        # convert class names to class ids
        label2id = self.config.label2id
        class_candidate_ids = [label2id[c] for c in class_candidates if c in label2id]

    for batch_start_idx in trange(0, len(dataset), batch_size, leave=True, disable=not show_progress_bar):
        
        ...
        # Computing probabilities based on the logits
        probs = output.logits.softmax(-1)

        # Mask everything except class-candidate probabilities
        if class_candidates is not None:
            mask = torch.zeros_like(probs)
            mask[:, :, class_candidate_ids] = 1
            probs = probs * mask

        # Get the labels and the correponding probability scores
        scores, labels = probs.max(-1)

        ...

    return all_entities

I did not find time to have a deep dive, implement & test it, but I think this could be a useful feature.

implementing the relation identification part?

Hi,

Do you have any clues about implementing the relation identifying part, which follows the entity recognition?

I am looking into the sources code and thinking about implementing that maybe. It could be useful if you have any suggestions of doing that.

Thanks

amazing works but minimum token ?

Hello
it is really an amazing work, but I wonder is there any minimum number of token required?
when I tried inference on a short sentence or question, it just returned an empty JSON.

Different results on ontonotes

Hi Tom,

Sorry for missing your email address. I noticed that you get slightly worse results on ontonotes dataset.

There are several versions of ontonotes. In PL-Marker, I preprocess CONLL-2012 by https://drive.google.com/file/d/1cFVb5thXNCXZkz99E8w4Xg9JB9E_viAa/view?usp=sharing

Best,
Deming

spaCy_integration `.pipe()` does not behave as expected

I have created a pipeline like so:

self.model = spacy.load("en_core_web_md", disable=[
            "tagger",
            "lemmatizer",
            "attribute_ruler",
            "ner",])
self.model.add_pipe(
    "span_marker",
    config={"model": span_marker_model_path, "batch_size": batch_size},
)

I call pipe() on a stream of documents:

for name, proc in self.model.pipeline:
        stream2 = proc.pipe(stream2)

The SpanMarker model in this pipeline performs inference on each doc in the stream as if it were a single sentence.

    def pipe(self, stream, batch_size=128):
        """Fill `doc.ents` and `span.label_` using the chosen SpanMarker model."""
        if isinstance(stream, str):
            stream = [stream]

        if not isinstance(stream, types.GeneratorType):
            stream = self.nlp.pipe(stream, batch_size=batch_size)

        for docs in minibatch(stream, size=batch_size):
            inputs = [[token.text if not token.is_space else "" for token in doc] for doc in docs]

            # use document-level context in the inference if the model was also trained that way
            if self.model.config.trained_with_document_context:
                inputs = self.convert_inputs_to_dataset(inputs)

            entities_list = self.model.predict(inputs, batch_size=self.batch_size)
            for doc, entities in zip(docs, entities_list):
                ents = []
                for entity in entities:
                    start = entity["word_start_index"]
                    end = entity["word_end_index"]
                    span = doc[start:end]
                    span.label_ = entity["label"]
                    ents.append(span)

                self.set_ents(doc, ents)

                yield doc

So it reaches max sequence length pretty quickly and only annotates the first part of each document.

This is different to the behaviour I expected, where call() will break the doc down into sentences and infer each sentence individually.

Spacy Integration - "detect an empty sentence"

Using your spacy integration the senticizer for Spacy will sometimes produce an empty sentence (using "en_core_web_sm"). These leads to the SpanMarkerTokenizer throwing an exception. Not sure how active this project is any more, but these seems like an easy fix. Is there a work-around already for this? Would you like the code updated to have one (I might be able to do this fix).

tomaarsen / spanmarkerner Goto Github PK

spanmarkerner's Introduction

SpanMarker for Named Entity Recognition

Documentation

Installation

Quick Start

Training

Inference

Pretrained Models

FewNERD

OntoNotes v5.0

CoNLL03

CoNLL++

MultiNERD

Using pretrained SpanMarker models with spaCy

Context

Changelog

License

spanmarkerner's People

Contributors

Stargazers

Watchers

Forkers

spanmarkerner's Issues

FineTune Code:

Dataset Sample:

Description

System Info

Error Response

Expected behaviour

Recommend Projects

Recommend Topics

Recommend Org