Coder Social home page Coder Social logo

ibm / zshot Goto Github PK

View Code? Open in Web Editor NEW
320.0 15.0 17.0 1.48 MB

Zero and Few shot named entity & relationships recognition

Home Page: https://ibm.github.io/zshot

License: MIT License

Python 100.00%
few-shot few-shot-learning ned ner relation-extraction zero-shot zero-shot-learning spacy nlp ai

zshot's Introduction

Zshot

Zero and Few shot named entity & relationships recognition

Tutorials Build Build

Documentation: https://ibm.github.io/zshot

Source Code: https://github.com/IBM/zshot

Paper: https://aclanthology.org/2023.acl-demo.34/

Zshot is a highly customisable framework for performing Zero and Few shot named entity recognition.

Can be used to perform:

  • Mentions extraction: Identify globally relevant mentions or mentions relevant for a given domain
  • Wikification: The task of linking textual mentions to entities in Wikipedia
  • Zero and Few Shot named entity recognition: using language description perform NER to generalize to unseen domains
  • Zero and Few Shot named relationship recognition
  • Visualization: Zero-shot NER and RE extraction

Requirements

  • Python 3.6+

  • spacy - Zshot rely on Spacy for pipelining and visualization

  • torch - PyTorch is required to run pytorch models.

  • transformers - Required for pre-trained language models.

  • evaluate - Required for evaluation.

  • datasets - Required to evaluate over datasets (e.g.: OntoNotes).

Optional Dependencies

  • flair - Required if you want to use Flair mentions extractor and for TARS linker.
  • blink - Required if you want to use Blink for linking to Wikipedia pages.

Installation

$ pip install zshot

---> 100%

Examples

Example Notebook
Installation and Visualization Open In Colab
Knowledge Extractor Open In Colab
Wikification Open In Colab
Custom Components Open In Colab
Evaluation Open In Colab

Zshot Approach

ZShot contains two different components, the mentions extractor and the linker.

Mentions Extractor

The mentions extractor will detect the possible entities (a.k.a. mentions), that will be then linked to a data source (e.g.: Wikidata) by the linker.

Currently, there are 6 different mentions extractors supported, SMXM, TARS, 2 based on SpaCy, and 2 that are based on Flair. The two different versions for SpaCy and Flair are similar, one is based on Named Entity Recognition and Classification (NERC) and the other one is based on the linguistics (i.e.: using Part Of the Speech tagging (PoS) and Dependency Parsing(DP)).

The NERC approach will use NERC models to detect all the entities that have to be linked. This approach depends on the model that is being used, and the entities the model has been trained on, so depending on the use case and the target entities it may be not the best approach, as the entities may be not recognized by the NERC model and thus won't be linked.

The linguistic approach relies on the idea that mentions will usually be a syntagma or a noun. Therefore, this approach detects nouns that are included in a syntagma and that act like objects, subjects, etc. This approach do not depend on the model (although the performance does), but a noun in a text should be always a noun, it doesn't depend on the dataset the model has been trained on.

Linker

The linker will link the detected entities to a existing set of labels. Some of the linkers, however, are end-to-end, i.e. they don't need the mentions extractor, as they detect and link the entities at the same time.

Again, there are 4 linkers available currently, 2 of them are end-to-end and 2 are not. Let's start with those thar are not end-to-end:

Linker Name end-to-end Source Code Paper
Blink X Source Code Paper
GENRE X Source Code Paper
SMXM โœ“ Source Code Paper
TARS โœ“ Source Code Paper

Relations Extractor

The relations extractor will extract relations among different entities previously extracted by a linker..

Currently, the is only one Relation Extractor available:

Knowledge Extractor

The knowledge extractor will perform at the same time the extraction and classification of named entities and the extraction of relations among them. The pipeline with this component doesn't need any mentions extractor, linker or relation extractor to work.

Currently, the is only one Knowledge Extractor available:

How to use it

  • Install requirements: pip install -r requirements.txt
  • Install a spacy pipeline to use it for mentions extraction: python -m spacy download en_core_web_sm
  • Create a file main.py with the pipeline configuration and entities definition (Wikipedia abstract are usually a good starting point for descriptions):
import spacy

from zshot import PipelineConfig, displacy
from zshot.linker import LinkerRegen
from zshot.mentions_extractor import MentionsExtractorSpacy
from zshot.utils.data_models import Entity

nlp = spacy.load("en_core_web_sm")
nlp_config = PipelineConfig(
    mentions_extractor=MentionsExtractorSpacy(),
    linker=LinkerRegen(),
    entities=[
        Entity(name="Paris",
               description="Paris is located in northern central France, in a north-bending arc of the river Seine"),
        Entity(name="IBM",
               description="International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York"),
        Entity(name="New York", description="New York is a city in U.S. state"),
        Entity(name="Florida", description="southeasternmost U.S. state"),
        Entity(name="American",
               description="American, something of, from, or related to the United States of America, commonly known as the United States or America"),
        Entity(name="Chemical formula",
               description="In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecule"),
        Entity(name="Acetamide",
               description="Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent."),
        Entity(name="Armonk",
               description="Armonk is a hamlet and census-designated place (CDP) in the town of North Castle, located in Westchester County, New York, United States."),
        Entity(name="Acetic Acid",
               description="Acetic acid, systematically named ethanoic acid, is an acidic, colourless liquid and organic compound with the chemical formula CH3COOH"),
        Entity(name="Industrial solvent",
               description="Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent."),
    ]
)
nlp.add_pipe("zshot", config=nlp_config, last=True)

text = "International Business Machines Corporation (IBM) is an American multinational technology corporation" \
       " headquartered in Armonk, New York, with operations in over 171 countries."

doc = nlp(text)
displacy.serve(doc, style="ent")

Run it

Run with

$ python main.py

Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...

The script will annotate the text using Zshot and use Displacy for visualising the annotations

Check it

Open your browser at http://127.0.0.1:5000 .

You will see the annotated sentence:

How to create a custom component

If you want to implement your own mentions_extractor or linker and use it with ZShot you can do it. To make it easier for the user to implement a new component, some base classes are provided that you have to extend with your code.

It is as simple as create a new class extending the base class (MentionsExtractor or Linker). You will have to implement the predict method, which will receive the SpaCy Documents and will return a list of zshot.utils.data_models.Span for each document.

This is a simple mentions_extractor that will extract as mentions all words that contain the letter s:

from typing import Iterable
import spacy
from spacy.tokens import Doc
from zshot import PipelineConfig
from zshot.utils.data_models import Span
from zshot.mentions_extractor import MentionsExtractor

class SimpleMentionExtractor(MentionsExtractor):
    def predict(self, docs: Iterable[Doc], batch_size=None):
        spans = [[Span(tok.idx, tok.idx + len(tok)) for tok in doc if "s" in tok.text] for doc in docs]
        return spans

new_nlp = spacy.load("en_core_web_sm")

config = PipelineConfig(
    mentions_extractor=SimpleMentionExtractor()
)
new_nlp.add_pipe("zshot", config=config, last=True)
text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business " \
        "Machines Corporation (IBM)."

doc = new_nlp(text_acetamide)
print(doc._.mentions)

>>> [is, similar, used, Business, Machines, materials]

How to evaluate ZShot

Evaluation is an important process to keep improving the performance of the models, that's why ZShot allows to evaluate the component with two predefined datasets: OntoNotes and MedMentions, in a Zero-Shot version in which the entities of the test and validation splits don't appear in the train set.

The package evaluation contains all the functionalities to evaluate the ZShot components. The main function is zshot.evaluation.zshot_evaluate.evaluate, that will take as input the SpaCy nlp model and the dataset to evaluate. It will return a str containing a table with the results of the evaluation. For instance the evaluation of the TARS linker in ZShot for the Ontonotes validation set would be:

import spacy

from zshot import PipelineConfig
from zshot.linker import LinkerTARS
from zshot.evaluation.dataset import load_ontonotes_zs
from zshot.evaluation.zshot_evaluate import evaluate, prettify_evaluate_report
from zshot.evaluation.metrics.seqeval.seqeval import Seqeval

ontonotes_zs = load_ontonotes_zs('validation')


nlp = spacy.blank("en")
nlp_config = PipelineConfig(
    linker=LinkerTARS(),
    entities=ontonotes_zs.entities
)

nlp.add_pipe("zshot", config=nlp_config, last=True)

evaluation = evaluate(nlp, ontonotes_zs, metric=Seqeval())
prettify_evaluate_report(evaluation)

Citation

@inproceedings{picco-etal-2023-zshot,
    title = "Zshot: An Open-source Framework for Zero-Shot Named Entity Recognition and Relation Extraction",
    author = "Picco, Gabriele  and
      Martinez Galindo, Marcos  and
      Purpura, Alberto  and
      Fuchs, Leopold  and
      Lopez, Vanessa  and
      Hoang, Thanh Lam",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-demo.34",
    doi = "10.18653/v1/2023.acl-demo.34",
    pages = "357--368",
    abstract = "The Zero-Shot Learning (ZSL) task pertains to the identification of entities or relations in texts that were not seen during training. ZSL has emerged as a critical research area due to the scarcity of labeled data in specific domains, and its applications have grown significantly in recent years. With the advent of large pretrained language models, several novel methods have been proposed, resulting in substantial improvements in ZSL performance. There is a growing demand, both in the research community and industry, for a comprehensive ZSL framework that facilitates the development and accessibility of the latest methods and pretrained models.In this study, we propose a novel ZSL framework called Zshot that aims to address the aforementioned challenges. Our primary objective is to provide a platform that allows researchers to compare different state-of-the-art ZSL methods with standard benchmark datasets. Additionally, we have designed our framework to support the industry with readily available APIs for production under the standard SpaCy NLP pipeline. Our API is extendible and evaluable, moreover, we include numerous enhancements such as boosting the accuracy with pipeline ensembling and visualization utilities available as a SpaCy extension.",
}

zshot's People

Contributors

gabrielepicco avatar imgbot[bot] avatar marmg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zshot's Issues

[Bug] Error on using LinkerBlink in Google Colab

Summary

Describe the bug
I tried using LinkerBlink as the linker on Google colab. I also installed the blink module using the pip command. I restarted the instance and tried again but I am getting the same error

Exception Traceback (most recent call last)
in
1 extractor = MentionsExtractorSpacy(ExtractorType.POS)
----> 2 linker = LinkerBlink()

/usr/local/lib/python3.7/dist-packages/zshot/linker/linker_blink.py in init(self, index)
65
66 if not pkgutil.find_loader("blink"):
---> 67 raise Exception("Blink module not installed. You need to install blink in order to use the Blink Linker."
68 "Install it with: pip install -e git+https://github.com/facebookresearch/BLINK.git#egg"
69 "=BLINK")

Exception: Blink module not installed. You need to install blink in order to use the Blink Linker.Install it with: pip install -e git+https://github.com/facebookresearch/BLINK.git#egg=BLINK

This error does not come up when you use LinkerTARS

To Reproduce
Steps to reproduce the behavior:

  1. pip install zshot, spacy[transformers], Blink (!pip install -e git+https://github.com/facebookresearch/BLINK.git#egg=BLINK)
  2. Restart the instance
  3. Set up the code as per instructions in the usage manual and use linker = LinkerBlink(). The mention extractor is ExtractorType.POS
    Expected behavior
    A clear and concise description of what you expected to happen. linker is supposed to load the Blink linker .

Adding SMXM model as a mentions extractor

Scenario summary

  • SMXM model is used right now as linker, which allows using descriptions of entities to improve the zero-shot linking.
  • The user may want to extract only certain mentions or mentions related to a topic.

Features proposed:

  • Add a new set of entities in the configuration for the mentions extractor.
  • Use the SMXM model as Mentions Extractor to be able to use descriptions for mentions.

[Bug] Conversion of spans to ents not working with large entities

Summary

Describe the bug
The linkers save the raw predictions in doc._.spans. When the same spans contains multiple tokens, the conversion to spacy.Span to save them into the doc.ents field is not working well, it converts just the first and last token.

To Reproduce

import spacy

from zshot import Zshot, PipelineConfig
from zshot.utils.data_models import Entity
from zshot.linker import LinkerSMXM

nlp = spacy.blank('es')

config = PipelineConfig(
    entities=[
        Entity(name="company", description="The name of a company"),
        Entity(name="location", description="A physical location"),
        Entity(name="chemical compound", description="any substance composed of identical molecules consisting of atoms of two or more chemical elements.")
    ], 
    linker=LinkerSMXM()
)
nlp.add_pipe("zshot", config=config, last=True)


text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business " \
        "Machines Corporation (IBM) to create new materials that act like PAGs."

doc = nlp(text_acetamide)
print(doc._.spans)
print(doc.ents)

Result:

[
    chemical compound, 0, 5, 0.589776873588562,
    chemical compound, 40, 49, 0.5900241732597351, 
    company, 58, 101, 0.8631868958473206, 
    company, 103, 106, 0.697799563407898
]
(
    CH2O2, 
    Acetamide, 
    International, 
    Corporation, 
    IBM
)

As it can be seen above, the third span goes from character 58 to 101, what is 'International Business Machines Corporation', but in the doc.ents only the tokens 'International' and 'Corporation' are stored.

Expected behavior
The expected result should be:
Result:

(
    CH2O2, 
    Acetamide, 
    International, 
    Business,
    Machines,
    Corporation, 
    IBM
)

Or:

(
    CH2O2, 
    Acetamide, 
    International Business Machines Corporation, 
    IBM
)

Where all the tokens of the span have been detected.

[Bug] Add Softmax to LinkerRegen scores

Summary

For evaluation, ensembling and user purposes, the linkers should return a valid and understandable score (for instance from 0 to 1). All of them should return comparable scores. LinkerSMXM uses Softmax, so LinkerRegen should also use Softmax for the scores.

[Bug] flair POS mentions extractor not working

Summary

With torch>=2.0 there is an existing bug in flair that makes flair POS mentions extractor fail: flairNLP/flair#3187

Describe the bug
The POS models for SequenceTagger in the Huggingface Hub are not updated for torch>=2.0

E       AttributeError: 'dict' object has no attribute 'embedding_length'

To Reproduce

import spacy

from zshot import PipelineConfig
from zshot.mentions_extractor import MentionsExtractorFlair
from zshot.mentions_extractor.mentions_extractor_flair import ExtractorType

nlp = spacy.blank("en")

config_zshot = PipelineConfig(mentions_extractor=MentionsExtractorFlair(ExtractorType.POS))
nlp.add_pipe("zshot", config=config_zshot, last=True)
assert "zshot" in nlp.pipe_names
doc = nlp("Test example text")

Expected behavior
It should work without errors.

[Bug] TARS Models not working with transformers > 4.31

Summary

With recent updates in transformers TARS Linker and MentionsExtractor are not working:

RuntimeError: Error(s) in loading state_dict for RobertaModel:
           	Unexpected key(s) in state_dict: "embeddings.position_ids".

Add vocabulary exact matching for entities

Scenario summary

Entity class has a vocabulary field that is not currently used

Proposed solution

Add an exact string matching using the value defined in the entity vocabulary to identify entity

Reduce T5 model size and enhance perfomances

Scenario summary

Current inference with t5 models is slow

Proposed solution

Investigate and implement solution to reduce model size and speed-up inference, some of the ideas to consider:

[Bug]

Summary

Describe the bug

ImportError                               Traceback (most recent call last)

[<ipython-input-2-893897ebf072>](https://localhost:8080/#) in <cell line: 7>()
      5 from zshot.mentions_extractor import MentionsExtractorSpacy
      6 from zshot.linker import LinkerRegen
----> 7 from zshot.linker.linker_regen.utils import load_wikipedia_trie, spans_to_wikipedia, \
      8                                             load_dbpedia_trie, spans_to_dbpedia

ImportError: cannot import name 'spans_to_wikipedia' from 'zshot.linker.linker_regen.utils' (/usr/local/lib/python3.10/dist-packages/zshot/linker/linker_regen/utils.py)

To Reproduce
Run the current version of the Wikification example python notebook from the docs homepage. It looks like that function was removed from the utils file in a recent commit.

Expected behavior
It should run without an error

Few shot learning

Hi, thanks for the great software! I have a dataset of custom entities, and I want to use Zshot to detect these custom entities. how can I do the few-shot learning with your tool? also is there a way to have my own linker that searches my own data? would you mind giving me a code snippet for the few-shots?
thanks!

[Bug] For information only: AppData use of @lru_cache decorator causes Error with Python3.7

Summary

AppData module uses @lru_cache decorator without ( ) which apparently is not compatible with Python < 3.8 and this causes an Error when importing config.py which uses AppData.

To Reproduce
import zhot

Expected behavior
It should be imported without error.

Fix
It may be interesting to precise that Python version >= 3.8 is needed until AppData fixes the problem.

Remark
I am not a CS so my analysis of this problem could be totally wrong. If so, I would be thankful for anyone giving me a by-pass when working with Python 3.7

Thanks

Best regards

Jerome

[Bug] Mentions are override with Entities when both have first and last item in common

Summary

When using both mentions and entities in PipelineConfig, if they have the same first and last element the hash will be the same, thus the mentions will be overridden with the entities.

To Reproduce

import spacy
from zshot import Zshot, PipelineConfig
from zshot.utils.data_models import Entity

nlp = spacy.blank("en")

nlp_config = PipelineConfig(
    mentions=[
        Entity(name="first entity", description="First Entity"),
        Entity(name="second entity", description="Second Entity"),
        Entity(name="third entity", description="Third Entity")
    ],
    entities=[
        Entity(name="first entity", description="First Entity"),
        Entity(name="other second entity", description="Different Second Entity"),
        Entity(name="third entity", description="Third Entity")    
    ]
)
nlp.add_pipe("zshot", config=nlp_config, last=True)

print(nlp.get_pipe('zshot').mentions)

This will print:

[Entity(name='first entity', description='First Entity', vocabulary=None),
 Entity(name='other second entity', description='Different Second Entity', vocabulary=None),
 Entity(name='third entity', description='Third Entity', vocabulary=None)]

Expected behavior
It should print:

[Entity(name='first entity', description='First Entity', vocabulary=None),
 Entity(name='second entity', description='Second Entity', vocabulary=None),
 Entity(name='third entity', description='Third Entity', vocabulary=None)]

Add device option to the Pipeline configuration

Scenario summary

Currently most of the model run on GPU if available, but we need a standard/common way to manage device

Proposed solution

Add device option to the Pipeline configuration

[Bug] Correctly render relations visualisation in notebooks

Summary

Describe the bug
The relations visualisation should visualise correctly in the notebooks, but the render functions is displaying the HTML as string

To Reproduce
Steps to reproduce the behavior:

import spacy
from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.utils.data_models import Entity, Relation
from zshot.relation_extractor import RelationsExtractorZSRC

config = PipelineConfig(
 entities=[
    Entity(name="company", description="The name of a company"),
    Entity(name="location", description="A physical location"),
    Entity(name="chemical compound", description="Any of a large class of chemical compounds in which one or more \
           atoms of carbon are covalently linked to atoms of other elements, most commonly hydrogen, oxygen, or nitrogen")
 ], 
 relations=[
    Relation(name="acronym", description="Is the acronym of"),
    Relation(name="parent", description="Is the parent of someone")
 ], 
 linker=LinkerSMXM(),
 relations_extractor=RelationsExtractorZSRC(),
)
nlp = spacy.blank("en")
nlp.add_pipe("zshot", config=config, last=True)

text = "CH2O2 is a chemical compound similar to Acetamide used in International \
Business Machines Corporation (IBM) to create new materials that act like PAGs."
doc = nlp(text)

Expected behavior

Correctly render the HTML in the notebook

Refactor mentions extractor

Summary

The entities linked are stored in the ._.spans field, using zshot.utils.data_models.span.Span. However, the mentions are stored in the ._.mentions field using spacy.tokens.Span instead. Refactor this to use the zshot.utils.data_models.span.Span as in the linkers.

[Bug] ZShot Displacy doesn't return the markup

Summary

The displacy.render method should return the rendered HTML markup, but currently isn't returning anything

Describe the bug
The displacy.render method should return the rendered HTML markup when is not in Jupyter Notebook, but currently isn't returning anything.

To Reproduce

from zshot.utils.data_models import Entity
from zshot.linker import LinkerSMXM
from zshot import PipelineConfig
from zshot.utils.displacy import displacy

nlp = spacy.blank('en')
config = PipelineConfig(
    entities=[
        Entity(name="company", description="The name of a company"),
        Entity(name="location", description="A physical location"),
        Entity(name="chemical compound", description="any substance composed of identical molecules consisting of atoms of two or more chemical elements."),
        Entity(name="organic compound", description="Any of a large class of chemical compounds in which one or more atoms of carbon are covalently linked to atoms of other elements, most commonly hydrogen, oxygen, or nitrogen. The few carbon-containing compounds not classified as organic include carbides, carbonates, and cyanides. See chemical compound.")
    ], 
    linker=LinkerSMXM()
)
nlp.add_pipe("zshot", config=config, last=True)

text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business " \
        "Machines Corporation (IBM) to create new materials that act like PAGs."

doc = nlp(text_acetamide)
res = displacy.render(doc, style="ent", jupyter=False)
print(res is None)
True

Expected behavior
It should return the HTML Markup of the visualization.

from spacy import displacy

nlp = spacy.blank('en')
config = PipelineConfig(
    entities=[
        Entity(name="company", description="The name of a company"),
        Entity(name="location", description="A physical location"),
        Entity(name="chemical compound", description="any substance composed of identical molecules consisting of atoms of two or more chemical elements."),
        Entity(name="organic compound", description="Any of a large class of chemical compounds in which one or more atoms of carbon are covalently linked to atoms of other elements, most commonly hydrogen, oxygen, or nitrogen. The few carbon-containing compounds not classified as organic include carbides, carbonates, and cyanides. See chemical compound.")
    ], 
    linker=LinkerSMXM()
)
nlp.add_pipe("zshot", config=config, last=True)

text_acetamide = "CH2O2 is a chemical compound similar to Acetamide used in International Business " \
        "Machines Corporation (IBM) to create new materials that act like PAGs."

doc = nlp(text_acetamide)
res = displacy.render(doc, style="ent", jupyter=False)
print(res is None)
False

[Bug] Error while visualizing results without entities extracted

Summary

Describe the bug
IndexError while visualizing results without entities extracted in the "rel" mode:

File ~/zshot/utils/displacy/relations_render.py:27, in parse_rels(doc)
     25         tokens_span.append((filtered_spans[idx - 1].end, span.start, None))
     26     tokens_span.append((span.start, span.end, span))
---> 27 if filtered_spans[-1].end < len(doc.text):
     28     tokens_span.append((filtered_spans[-1].end, len(doc.text), None))
     30 words = []

IndexError: list index out of range

To Reproduce

import spacy

from zshot import PipelineConfig, displacy
from zshot.relation_extractor import RelationsExtractorZSRC
from zshot.utils.data_models import Relation

nlp = spacy.load("en_core_web_sm")
nlp_config = PipelineConfig(
    relations_extractor=RelationsExtractorZSRC(thr=0.1),
    relations=[
        Relation(name='located in', description="If something like a person, a building, or a company is located in a particular place, like a city, country of any other physical location, it is present or has been built there")
    ]
)
nlp.add_pipe("zshot", config=nlp_config, last=True)

text = "IBM headquarters are located in Armonk."
doc = nlp(text)
displacy.render(doc, style='rel')

Expected behavior
Show sentence without entities and relations

[Bug] Evaluation result is being overwritten

Summary

In the evaluation, we update a result dict with the metrics. However, if both linker and mentions_extractor are defined the result will be overwritten:

if nlp.get_pipe("zshot").linker:
    pipe = LinkerPipeline(nlp, batch_size)
    result.update(
        {
            field_name: {
                'linker': linker_evaluator.compute(pipe, dataset[split].select([1, 4]), metric=metric)
            }
        }
    )

if nlp.get_pipe("zshot").mentions_extractor:
    pipe = MentionsExtractorPipeline(nlp, batch_size)
    result.update(
        {
            field_name: {
                'mentions_extractor': mentions_extractor_evaluator.compute(pipe, dataset[split].select([1, 4]),
                                                                           metric=metric)
            }
        }
    )

Extend Regen linker to support Wikification providing precomputed prefix-tree

Scenario summary

Regen linker can be use to link to any knowledge graph, but currently for linking to wikipedia users have to pre-compute the prefix-index for performing the constrained beam search

Proposed solution

  • Provides a very simple utils to create prefix tree for a given KG
  • Add prefix tree configuration to Regen linker

Add LinkerEnsemble

Scenario summary

Add linker ensemble to allow using different linkers and different descriptions to improve the performance.

Proposed solution

Implementation of LinkerEnsemble which takes as input the list of linkers to use, the strategy (one of: max, count) and the threshold (to save entities).

It will group the entities by the name, and create combinations of them to extract with each of the linkers that set of entities, to finally group the results.

Example:

import spacy
from zshot import PipelineConfig
from zshot.linker import LinkerSMXM, LinkerTARS
from zshot.linker.linker_ensemble import LinkerEnsemble
from zshot.utils.data_models import Entity
from zshot import displacy

nlp = spacy.blank("en")

config = PipelineConfig(
    entities=[
        Entity(name="fruits", description="The sweet and fleshy product of a tree or other plant."),
        Entity(name="fruits", description="Names of fruits such as banana, oranges"),
        Entity(name="vitamin", description="A nutrient that the body needs in small amounts to function " \
                                           "and stay healthy"),
        Entity(name="vitamin", description="Vitamins are substances that our bodies need to develop and " \
                                           "function normally")
    ],
    linker=LinkerEnsemble(
        linkers=[
            LinkerSMXM(),
            LinkerTARS(),
        ],
        threshold=0.25
    )
)

nlp.add_pipe("zshot", config=config, last=True)
# annotate a piece of text
doc = nlp('Apple or oranges have a lot of vitamin C.')

# Visualize the result
displacy.render(doc, style='ent')

[Bug] ValueError: `prefix_allowed_tokens_fn` with new version of `transformers` package

Summary

Describe the bug
With transformers==4.37.2 installed, I am getting this error:

File ~/workspace/mlops-talk-llm-kg/venv/lib/python3.11/site-packages/transformers/generation/logits_process.py:1235, in PrefixConstrainedLogitsProcessor.__call__(self, input_ids, scores)
   1233         prefix_allowed_tokens = self._prefix_allowed_tokens_fn(batch_id, sent)
   1234         if len(prefix_allowed_tokens) == 0:
-> 1235             raise ValueError(
   1236                 f"`prefix_allowed_tokens_fn` returned an empty list for batch ID {batch_id}."
   1237                 f"This means that the constraint is unsatisfiable. Please check your implementation"
   1238                 f"of `prefix_allowed_tokens_fn` "
   1239             )
   1240         mask[batch_id * self._num_beams + beam_id, prefix_allowed_tokens] = 0
   1242 return scores + mask

ValueError: `prefix_allowed_tokens_fn` returned an empty list for batch ID 0.This means that the constraint is unsatisfiable. Please check your implementationof `prefix_allowed_tokens_fn` 

If I roll back to the version of transformers in the Readme's Google Colabs, transformers==4.35.2, I don't have any issues.

To Reproduce

import spacy
from zshot import PipelineConfig, displacy
from zshot.linker import LinkerSMXM
from zshot.utils.data_models import Entity, Relation
from zshot.relation_extractor import RelationsExtractorZSRC
from zshot.mentions_extractor import MentionsExtractorSpacy
from zshot.linker import LinkerRegen

nlp = spacy.load('en_core_web_sm')

# zero shot definition of entities
nlp_config = PipelineConfig(
    mentions_extractor=MentionsExtractorSpacy(),
    linker=LinkerRegen(),
    entities=[
        Entity(name='Paris',
               description='Paris is located in northern central France, in a north-bending arc of the river Seine'),
        Entity(name='IBM',
               description='International Business Machines Corporation (IBM) is an American multinational technology corporation headquartered in Armonk, New York'),
        Entity(name='New York', description='New York is a city in U.S. state'),
        Entity(name='Florida', description='southeasternmost U.S. state'),
        Entity(name='American',
              description='American, something of, from, or related to the United States of America, commonly known as the United States or America'),
        Entity(name='Chemical formula',
               description='In chemistry, a chemical formula is a way of presenting information about the chemical proportions of atoms that constitute a particular chemical compound or molecul'),
        Entity(name='Acetamide',
               description='Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent.'),
        Entity(name='Armonk',
               description='Armonk is a hamlet and census-designated place (CDP) in the town of North Castle, located in Westchester County, New York, United States.'),
        Entity(name='Acetic Acid',
               description='Acetic acid, systematically named ethanoic acid, is an acidic, colourless liquid and organic compound with the chemical formula CH3COOH'),
        Entity(name='Industrial solvent',
               description='Acetamide (systematic name: ethanamide) is an organic compound with the formula CH3CONH2. It is the simplest amide derived from acetic acid. It finds some use as a plasticizer and as an industrial solvent.'),
    ]
)
nlp.add_pipe('zshot', config=nlp_config, last=True)

text = 'International Business Machines Corporation (IBM) is an American multinational technology corporation' \
        ' headquartered in Armonk, New York, with operations in over 171 countries.'

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

Expected behavior
This should work without any issue

TypeError: serve() got an unexpected keyword argument 'jupyter'

Hello
I am trying to run the example for Zero-Shot Entity Recognition and got the following error in main.py:

2023-01-01 09:44:01.465884: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Traceback (most recent call last):
  File "main.py", line 39, in <module>
    displacy.serve(doc, style="ent")
  File "/content/zshot/zshot/utils/displacy/displacy.py", line 36, in serve
    return displacy._call_displacy(docs, style, "serve", options=options, **kwargs)
  File "/content/zshot/zshot/utils/displacy/displacy.py", line 76, in _call_displacy
    return disp(docs, style=style, options=options, jupyter=jupyter, **kwargs)
TypeError: serve() got an unexpected keyword argument 'jupyter'

Any help please?

[Bug] Can't import zshot_evaluate

Summary

Bug when trying to import a function from zshot_evaluate

Describe the bug
A clear and concise description of what the bug is.

TypeError: 'type' object is not subscriptable

in

list[PrettyTable].

To Reproduce

import zshot.evaluation.zshot_evaluate.evaluate

[Bug] ignore_verifications deprecated in datasets 2.9.1

Summary

Describe the bug
The parameter ignore_verifications is deprecated in datasets 2.9.1 and will be removed in 3.0.0.

** Solution proposed**
Replace the parameter ignore_verifications with verification_mode='no_checks'

[Bug] TypeError when printing DatasetWithEntities

Summary

TypeError raises when printing a DatasetWithEntities.

Describe the bug
When printing, displaying or representing a DatasetWithEntities next error raises:

TypeError: 'Entity' object is not subscriptable

To Reproduce

from zshot.evaluation.dataset import load_ontonotes

dataset = load_ontonotes()
print(dataset)

Expected behavior
The result should print a summary of the DatasetDict with the splits, columns and number of rows per split.

Add support for torch 2.0

Scenario summary

Apple Silicon users cannot run zshot due to torch version being pinned to be strictly less than 2.0 ("torch>=1,<2")

Proposed solution

Update torch to 2.0 and release a new wheel on PyPI ๐Ÿš€

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.