Coder Social home page Coder Social logo

davidberenstein1957 / crosslingual-coreference Goto Github PK

View Code? Open in Web Editor NEW
100.0 4.0 14.0 561 KB

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

License: MIT License

Python 100.00%
natural-language-processing python spacy nlp coreference-resolution coreference hacktoberfest

crosslingual-coreference's Introduction

Hi there ๐Ÿ‘‹

From failing to study medicine โžก๏ธ BSc industrial engineer โžก๏ธ MSc computer scientist.
Life can be strange, so better enjoy it.
Iยดm sure I do by: ๐Ÿ‘จ๐Ÿฝโ€๐Ÿณ Cooking, ๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป Coding, ๐Ÿ† Committing.

Conference slides ๐Ÿ“–

employers ๐Ÿ‘จ๐Ÿฝโ€๐Ÿ’ป

  • Argilla(2022-current) - data annotation and monitoring for enterprise NLP
  • Pandora Intelligence(2020-2022) - an independent intelligence company, specialized in security risks

open source โญ๏ธ

maintainer ๐Ÿค“

contributions ๐Ÿซฑ๐Ÿพโ€๐Ÿซฒ๐Ÿผ

volunteering ๐ŸŒ

  • Bonfari - small to medium sustainable scale projects in Gambia ๐Ÿ‡ฌ๐Ÿ‡ฒ
  • 510 red-cross - occasional projects to improve humanitarian aid with data

Contacts

Gmail LinkedIn Twitter

crosslingual-coreference's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

crosslingual-coreference's Issues

Retrieving cluster heads without replacing corefs

I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

Which language model is using for minilm

I am using the following code snippet for coreference resolution

predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")

While checking the below source code,

"minilm": {
        "url": (
            "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
        ),
        "f1_score_ontonotes": 74,
        "file_extension": ".tar.gz",
    },

it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Is this the same one that I can see in https://huggingface.co/models like
https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main
or any other huggingface model?

Comparatively high initial prediction time for first predict() hit

I am using minilm model with language 'en_core_web_sm'.
While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits.
Suppose after creating a predictor object, I call predict as follows:

predictor.predict(text) ---> first call
predictor.predict(text) ---> second call
predictor.predict(text) ---> third call

Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec).
Could you please help me understand why this initial hit takes a bit high prediction time?

Why does this package need to install google cloud auth, storage, api etc?

Hi,

after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

Is there a reason (I don't get) why this library needs these packages?

Thanks

Local run

Is there a way to run locally?
First, download all the data locally, and then run it locally via docker

Error when using coref as a spaCy pipeline

Hi all,
while trying to run a spacy test

import spacy
import crosslingual_coreference

text = """
    Do not forget about Momofuku Ando!
    He created instant noodles in Osaka.
    At that location, Nissin was founded.
    Many students survived by eating these noodles, but they don't even know him."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.resolved_text)

I encountered the following issue:

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Traceback (most recent call last):
  File "/home/user/test_coref/test.py", line 12, in <module>
    nlp.add_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
    pipe_component = self.create_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
    resolved, _ = cls._make(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
    filled, _, resolved = cls._fill(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
    getter_result = getter(*args, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
    return SpacyPredictor(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
    super().__init__(language, device, model_name, chunk_size, chunk_overlap)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
    self.set_coref_model()
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
    self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
    load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
    dataset_reader, validation_dataset_reader = _load_dataset_readers(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
    dataset_reader = DatasetReader.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
    value_dict[key] = construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
    result = annotation.from_params(params=popped_params, **subextras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
    self._matched_indexer = PretrainedTransformerIndexer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
    self._allennlp_tokenizer = PretrainedTransformerTokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
    self.tokenizer = cached_transformers.get_tokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
    super().__init__(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: EOF while parsing a list at line 1 column 4920583

Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

(.venv) user@host$ pip freeze
aiohttp==3.8.1
aiosignal==1.2.0
allennlp==2.9.3
allennlp-models==2.9.3
async-timeout==4.0.2
attrs==21.4.0
base58==2.1.1
blis==0.7.7
boto3==1.23.5
botocore==1.26.5
cached-path==1.1.2
cachetools==5.1.0
catalogue==2.0.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
click==8.0.4
conllu==4.4.1
crosslingual-coreference==0.2.4
cymem==2.0.6
datasets==2.2.1
dill==0.3.5.1
docker-pycreds==0.4.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
fairscale==0.4.6
filelock==3.6.0
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.0
google-auth==2.6.6
google-cloud-core==2.3.0
google-cloud-storage==2.3.0
google-crc32c==1.3.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.1
h5py==3.6.0
huggingface-hub==0.5.1
idna==3.3
iniconfig==1.1.1
Jinja2==3.1.2
jmespath==1.0.0
joblib==1.1.0
jsonnet==0.18.0
langcodes==3.3.0
lmdb==1.3.0
MarkupSafe==2.1.1
more-itertools==8.13.0
multidict==6.0.2
multiprocess==0.70.12.2
murmurhash==1.0.7
nltk==3.7
numpy==1.22.4
packaging==21.3
pandas==1.4.2
pathtools==0.1.2
pathy==0.6.1
Pillow==9.1.1
pluggy==1.0.0
preshed==3.0.6
promise==2.3
protobuf==3.20.1
psutil==5.9.1
py==1.11.0
py-rouge==1.1
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.8.2
pyparsing==3.0.9
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.4.24
requests==2.27.1
responses==0.18.0
rsa==4.8
s3transfer==0.5.2
sacremoses==0.0.53
scikit-learn==1.1.1
scipy==1.6.1
sentence-transformers==2.2.0
sentencepiece==0.1.96
sentry-sdk==1.5.12
setproctitle==1.2.3
shortuuid==1.0.9
six==1.16.0
smart-open==5.2.1
smmap==5.0.0
spacy==3.2.4
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.5
srsly==2.4.3
tensorboardX==2.5
termcolor==1.1.0
thinc==8.0.16
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
torch==1.10.2
torchaudio==0.10.2
torchvision==0.11.3
tqdm==4.64.0
transformers==4.17.0
typer==0.4.1
typing-extensions==4.2.0
urllib3==1.26.9
wandb==0.12.16
wasabi==0.9.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.7.2

Do you have any recommendations?
Is there an installation step missing?

Thanks in advance!

Support for Spacy 3.4.0

Hi, I would like to use this nice package for Dutch language models that only work with Spacy 3.4.0+. How difficult would it be to support spacy 3.4.0?

feat: look into ONNX enhanched transformer embeddings

Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Python 3.8.13
Spacy - 3.1.0
en_core_web_sm-3.1.0
crosslingual_coreference - 0.2.8

requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

[Errno 101] Network is unreachable

Hello, when I try to run the code below

predictor = Predictor(
    language="en_core_web_sm", device=1, model_name="info_xlm"
)

I get the following error:

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Is this url still valid and what should I use instead?

installation failed: 'notebook' has no attribute 'nbextensions'

I tried to install it in a venv:
(.spaCy) PS C:\Users\joajo\Documents> pip --version
pip 23.2.1 from C:\Users\joajo\Documents.spaCy\Lib\site-packages\pip (python 3.11)

 AttributeError: module 'notebook' has no attribute 'nbextensions'
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

ร— Encountered error while generating package metadata.
โ•ฐโ”€> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

How do we character ranges of the clusters

Right now, when we call predictor.predict() we get the clusters as a list of lists, and the cluster heads along with their token indices. Is it possible to :

  • Get the cluster heads as character ranges? Meaning that 'cluster_heads': {'Momofuku Ando': [4, 5], 'Osaka': [12, 12], 'instant noodles':
    [9, 10], 'Many students': [22, 23], 'Nissin': [18, 18]}}, instead of the token/word indices like [4,5], [12,12], etc. can we get the character ranges
  • Alternatively, can we get a separate variable that maps the token indices to tokens? Something like ['Do', 'not', 'forget', 'about'....] .
    I tried looking at how the text is tokenized but couldn't exactly get that from the code. Basically for my application I need to check whether a coreference appears in a particular character range, and would like to do that accurately (with the best way to do that being using the character range)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.