nicola-decao / efficient-autoregressive-el Goto Github PK

View Code? Open in Web Editor NEW

66.0 3.0 10.0 36 KB

Pytorch implementation of Highly Parallel Autoregressive Entity Linking with Discriminative Correction

Home Page: https://arxiv.org/abs/2109.03792

License: MIT License

Jupyter Notebook 26.74% Python 73.26%

natural-language-processing nlp pytorch entity-linking entity-disambiguation

efficient-autoregressive-el's Introduction

Highly Parallel Autoregressive Entity Linking
with Discriminative Correction

Overview

This repository contains the Pytorch implementation of [1](https://arxiv.org/abs/2109.03792).

Here the link to pre-processed data used for this work (i.e., training, validation and test splits of AIDA as well as the KB with the entities) and the released model.

Dependencies

python>=3.8
pytorch>=1.7
pytorch_lightning>=1.3
transformers>=4.0

Structure

src: The source code of the model. In src/data there is an class of a dataset for Entity Linking. In src/model there are three classes that implement our EL model. One for the Entity Disambiuation part, one for the (autoregresive) Entity Liking part, and one for the entire model (which also contains the training and validation loops).
notebooks: Example code for loading our Entity Linking model, evaluate it on AIDA, and run inference on a test document.

Usage

Please have a look into the notebooks folder to see hot to load our Entity Linking model, evaluate it on AIDA, and run inference on a test document.

Here a minimal example that demonstrate how to use our model:

from src.model.efficient_el import EfficientEL
from IPython.display import Markdown
from src.utils import 

# loading the model on GPU and setting the the threshold to the
# optimal value (based on AIDA validation set)
model = EfficientEL.load_from_checkpoint("../models/model.ckpt").eval().cuda()
model.hparams.threshold = -3.2

# loading the KB with the entities
model.generate_global_trie()

# document which we want to apply EL on
s = """CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY . LONDON 1996-08-30 \
West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset \
by an innings and 39 runs in two days to take over at the head of the county championship ."""

# getting spans from the model and converting the result into Markdown for visualization
Markdown(
    get_markdown(
        [s],
        [[(s[0], s[1], s[2][0][0]) for s in spans] 
         for spans in  model.sample([s])]
    )[0]
)

Which will generate:

CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY . LONDON 1996-08-30 West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship .

Please cite [1] in your work when using this library in your experiments.

Training

To train our model you can run the following comand

python scripts/train.py --gpus ${NUM_GPUS} --acceleration ddp --batch_size 32

Feedback

For questions and comments, feel free to contact Nicola De Cao.

License

MIT

Citation

[1] De Cao Nicola, Aziz Wilker, & Titov Ivan. (2021).
Highly parallel autoregressive entity linking with discriminative correction.
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7662–7669.
https://doi.org/10.18653/v1/2021.emnlp-main.604

BibTeX format:

@inproceedings{de-cao-etal-2021-highly,
    title = "Highly Parallel Autoregressive Entity Linking with Discriminative Correction",
    author = "De Cao, Nicola  and
      Aziz, Wilker  and
      Titov, Ivan",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.604",
    doi = "10.18653/v1/2021.emnlp-main.604",
    pages = "7662--7669",
}

efficient-autoregressive-el's People

Contributors

Stargazers

Watchers

Forkers

quancq ysenarath meghanaverma12 dekelcohen ad-freiburg zhn1010 thomasgirault sidhantls tianfeng-tasconnect chasemeng

efficient-autoregressive-el's Issues

Very different results from GENRE

I want to use end2end EL in production, so it needs to be fast.
GENRE is accurate enough for use, but not fast enough, so I've been trying this model instead (assuming that it's on par with GENRE in accuracy).

I tried a few sentences:
"I visited Bologne while I was in Italy",
'In a televised meeting across a small table from his defence minister, Vladimir Putin ordered his forces to hold back from storming the Azovstal steel plant.',
"While in Memphis he did not go to the blues clubs that he would have visited if he was back home, instead he visited the hall of Ramses II"

and GENRE linked most of them correctly, while the efficient model just failed completely at almost all the names, predicting "Breda" for "Bologne", "Miami" to "Memphis" and "Rolf Ekéus" to "Ramses II".

Is the model in https://mega.nz/folder/l4RhnIxL#_oYvidq2qyDIw1sT-KeMQA untrained?

Any idea why the model is linking random entities that start with the same letter as the name in the text, instead of an entity with the same or similar name?

Mentions positioning with GPT2-tokenizer

Hello,

This is really great work and the latency is truly better than other models.

However I do have some questions, I was hoping you could give me some insights.

Using GPT2-tokenizer, the positions of each mentions usually corresponds to the position_of_the_mentions+1 in the AIDA data you are using (except for the start of a mention that is the first word in the text).
Example from the aida_test_dataset:

You can notice that the JAPAN position within the list should be [4,6] instead of [5,7], and that the Rugby Union mention should be [0,5] instead of [0,6]. So I was simply wondering if that is normal ?

Also if it's not too much, could you explain what would happen with entities.json if the datasets for training don't have "candidates" key ?

Finally, about the mentions.json, there are a ton of '!', should I replicate that in my own mentions.json ?

Thank you for your help and for this repo !

trainer.test(model) produces an error when resuming from checkpoint.

Hey,

Thank you for sharing your code, it has been straightforward to train a model on my own data.

When I attempt to run trainer.test(model) with the command: python ./scripts/trainer.py --resume_from_checkpoint /path/to/checkpoint/ --gpus=1 I receive the following error:

Testing: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "scripts/train.py", line 47, in <module>
    trainer.test(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 705, in test
    results = self._run(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 922, in _run
    self._dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
    self.accelerator.start_evaluating(self)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 95, in start_evaluating
    self.training_type_plugin.start_evaluating(trainer)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 165, in start_evaluating
    self._results = trainer.run_stage()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 997, in run_stage
    return self._run_evaluate()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1083, in _run_evaluate
    eval_loop_results = self._evaluation_loop.run()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 111, in advance
    output = self.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 154, in evaluation_step
    output = self.trainer.accelerator.test_step(step_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in test_step
    return self.training_type_plugin.test_step(*step_kwargs.values())
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 181, in test_step
    return self.model.test_step(*args, **kwargs)
  File "/code/efficient-autoregressive-EL/src/model/efficient_el.py", line 450, in test_step
    self.log_dict(metrics)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 507, in log_dict
    self.log(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 446, in log
    results.log(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py", line 472, in log
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: You called `self.log(micro_f1, ...)` twice in `test_step` with different arguments. This is not allowed

Am I using the trainer's test method incorrectly?

Thank you,

Martin

Chinese entity linking

Hello, thank you very much for your work. I would like to ask if the algorithm can be applied to Chinese entity linking.

Start and end position indexes in dataset

Hello! Honour to read your code! 
How do you generate the start and end position indexes in all datasets? I found that the start and end position indexes do not correctly locate the mention. 
Looking forward to your reply!

{"id": "947testa 947testa", "input": "CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY . LONDON 1996-08-30 West Indian all-rounder Phil Simmons took four for 38 on Friday as Leicestershire beat Somerset by an innings and 39 runs in two days to take over at the head of the county championship . Their stay on top , though , may be short-lived as title rivals Essex , Derbyshire and Surrey all closed in on victory while Kent made up for lost time in their rain-affected match against Nottinghamshire . After bowling Somerset out for 83 on the opening morning at Grace Road , Leicestershire extended their first innings by 94 runs before being bowled out for 296 with England discard Andy Caddick taking three for 83 . Trailing by 213 , Somerset got a solid start to their second innings before Simmons stepped in to bundle them out for 174 . Essex , however , look certain to regain their top spot after Nasser Hussain and Peter Such gave them a firm grip on their match against Yorkshire at Headingley . Hussain , considered surplus to England 's one-day requirements , struck 158 , his first championship century of the season , as Essex reached 372 and took a first innings lead of 82 . By the close Yorkshire had turned that into a 37-run advantage but off-spinner Such had scuttled their hopes , taking four for 24 in 48 balls and leaving them hanging on 119 for five and praying for rain . At the Oval , Surrey captain Chris Lewis , another man dumped by England , continued to silence his critics as he followed his four for 45 on Thursday with 80 not out on Friday in the match against Warwickshire . He was well backed by England hopeful Mark Butcher who made 70 as Surrey closed on 429 for seven , a lead of 234 . Derbyshire kept up the hunt for their first championship title since 1936 by reducing Worcestershire to 133 for five in their second innings , still 100 runs away from avoiding an innings defeat . Australian Tom Moody took six for 82 but Chris Adams , 123 , and Tim O'Gorman , 109 , took Derbyshire to 471 and a first innings lead of 233 . After the frustration of seeing the opening day of their match badly affected by the weather , Kent stepped up a gear to dismiss Nottinghamshire for 214 . They were held up by a gritty 84 from Paul Johnson but ex-England fast bowler Martin McCague took four for 55 . By stumps Kent had reached 108 for three .", "anchors": [[5, 10, "Leicestershire County Cricket Club"], [24, 25, "London"], [31, 32, "West Indies cricket team"], [36, 37, "Phil Simmons"], [45, 48, "Leicestershire County Cricket Club"], [50, 50, "Somerset County Cricket Club"], [86, 86, "Essex County Cricket Club"], [88, 90, "Derbyshire County Cricket Club"], [92, 92, "Surrey County Cricket Club"], [99, 99, "Kent County Cricket Club"], [112, 113, "Nottinghamshire County Cricket Club"], [117, 117, "Somerset County Cricket Club"], [126, 127, "Grace Road"], [129, 132, "Leicestershire County Cricket Club"], [148, 148, "England cricket team"], [150, 153, "Andrew Caddick"], [164, 164, "Somerset County Cricket Club"], [174, 174, "Phil Simmons"], [184, 184, "Essex County Cricket Club"], [196, 198, "Nasser Hussain"], [200, 201, "Peter Such"], [211, 211, "Yorkshire County Cricket Club"], [213, 215, "Headingley"], [217, 217, "Nasser Hussain"], [222, 222, "England cricket team"], [242, 242, "Essex County Cricket Club"], [257, 257, "Yorkshire County Cricket Club"], [272, 272, "Peter Such"], [302, 302, "The Oval"], [304, 304, "Surrey County Cricket Club"], [306, 307, "Chris Lewis (cricketer)"], [313, 313, "England cricket team"], [339, 342, "Warwickshire County Cricket Club"], [349, 349, "England cricket team"], [351, 352, "Mark Butcher"], [357, 357, "Surrey County Cricket Club"], [369, 371, "Derbyshire County Cricket Club"], [385, 388, "Worcestershire County Cricket Club"], [408, 408, "Australia"], [409, 410, "Tom Moody"], [416, 417, "Chris Adams (cricketer)"], [431, 433, "Derbyshire County Cricket Club"], [462, 462, "Kent County Cricket Club"], [469, 470, "Nottinghamshire County Cricket Club"], [483, 484, "Paul Johnson (cricketer)"], [492, 494, "Martin McCague"], [503, 503, "Kent County Cricket Club"}

Error in beam search method with certain examples

Hey Nic,

Just having some trouble trying to get this model running on something other than the example.
I see that there may be a potential bug.

The example provided by the repo about cricket is working on my local machine.

I have also given an input sentence from a news article:

When Fabien needed to have a decayed tooth removed in May, his dentist told him that he would have to wait up to three years to have it done on the NHS. In disbelief, the 27-year-old from Edinburgh rang 50 dental practices but without any luck. He had no choice but to go private.

which results in output:

[[(5, 11, [('Fabrizio Ravanelli', -2.598557949066162), ('Bernard Tapie', -3.688422918319702), ('Bernard F. Fisher', -9.065004348754883), ('Fabrizio Brienza', -9.414182662963867), ('Fabrizio Guidi', -17.98088264465332)]), (188, 197, [('Edinburgh', -2.414118766784668), ('Edmonton', -3.213533401489258), ('Scotland', -3.43752121925354), ('Atlanta', -4.544315338134766), ('York', -5.995124340057373)])]]

however, this sentence:
Boris Johnson believed to have overruled ministers unwilling to compromise on post-Brexit immigration as forecourt queues mount.

results in the following error:

Traceback (most recent call last): File "/home/samin/src/efficient-autoregressive-EL/test.py", line 23, in <module> output = model.sample([s]) File "/home/samin/src/efficient-autoregressive-EL/src/model/efficient_el.py", line 455, in sample spans = self.forward_beam_search(batch) File "/home/samin/src/efficient-autoregressive-EL/src/model/efficient_el.py", line 292, in forward_beam_search tokens, scores_el = self.entity_linking.forward_beam_search( File "/home/samin/src/efficient-autoregressive-EL/src/model/entity_linking.py", line 321, in forward_beam_search tokens, lm_scores, all_contexts = beam_search( File "/home/samin/src/efficient-autoregressive-EL/src/beam_search.py", line 78, in beam_search hidden = tile(hidden, beam_width, dim=0) # [layers, B*beam_width, H_dec] File "/home/samin/src/efficient-autoregressive-EL/src/beam_search.py", line 20, in tile return [tile(e, count, dim=dim) for e in x] File "/home/samin/src/efficient-autoregressive-EL/src/beam_search.py", line 20, in <listcomp> return [tile(e, count, dim=dim) for e in x] File "/home/samin/src/efficient-autoregressive-EL/src/beam_search.py", line 30, in tile x.view(batch, -1) RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

I was wondering if you have seen this issue before?

Incorrect linking for Person

Hi,
Thank you very much for sharing your code.
I was wondering why I get "Bill Clinton" for this example "Mr Obama was born in Hawaii"? The location is correctly linked, but not the person. I tested it with several other examples and the same results (wrong person linking). I was wondering if there is a parameter or setting I am missing? I am using the exact code in the example (readme) of the repo.
Thank you!

Providing candidates

I wonder how to provide candidates.

The example notebook doesn't show that functionality. Therefore "Phil Simmons" gets wrongly linked to Philip Walton. However, in the file aida_val_dataset.jsonl the only candidate for "Phil Simmons" is "Phil Simmons".

Does the model make use of the pre-computed candidate sets when trainer.test() is run?

How to build "metions.json"

I'm new in EL, so could u please tell me what is the "mentions.json" used for, and how to build it? Thx :)

Cuda Out Of Memory when training on large datasets

I get the following error when training on GPUs:

  File "/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/models/longformer/modeling_longformer.py", line 830, in _sliding_chunks_query_key_matmul
    diagonal_chunked_attention_scores = self._pad_and_transpose_last_two_dims(
  File "/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/transformers/models/longformer/modeling_longformer.py", line 713, in _pad_and_transpose_last_two_dims
    hidden_states_padded = nn.functional.pad(
  File "/home/ubuntu/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/torch/nn/functional.py", line 4174, in _pad
    return _VF.constant_pad_nd(input, pad, value)
RuntimeError: CUDA out of memory. Tried to allocate 364.00 MiB (GPU 0; 11.17 GiB total capacity; 10.44 GiB already allocated; 97.44 MiB free; 10.63 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This has happened on multiple GPU setups, 6xK80 (12gb) and 1x A5000 (24gb).
Monitoring the training on the latter, it looks like there is a steady increase in GPU memory used as the training steps increase.

I think that there is a memory leak somewhere in the code, which may not have been revealed with the preprocessed aida dataset on account of its size? The dataset I am using currently is ~20gb in jsonl form.

My research tells me that there is some form of gradient accumulation, as described in the first passage here.

I have been looking for potential places where the gradients have been accumulated and was wondering if there are any lines in the training or validation step that could result in this form of gradient accumulation.

I will try to run a few experiments to try to stabilize the GPU RAM usage, but in the meantime, I thought I would post this as an issue.

Thank you for your time!

EDIT:
The following invocation on an A5000:

python3 train.py --gpus 1 --strategy ddp --batch_size 16 --num_workers 1 --train_data_path ./aida_train_dataset.jsonl --dev_data_path ./aida_val_dataset.jsonl --test_data_path ./aida_test_dataset.jsonl

where batch_size was 16 as 32 leads to OOMs and the following OOM occurs in step 1449 as the loss gradually decreases

line 91, in backward
    model.backward(closure_loss, optimizer, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1444, in backwa
rd
    loss.backward(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 1.94 GiB (GPU 0; 23.69 GiB total capacity; 20.27 GiB
already allocated; 858.62 MiB free; 21.43 GiB reserved in total by PyTorch)

Question about the pre-processed AIDA data

Hi.

Thank you for open-sourcing your code. This is a great EL system.

I have one question about your pre-processed AIDA data. I guess in each example, the field anchors contains the list of the entity mentions? And for each entity mention, you have the word start index, word end index, and the Wikipedia name?

If so, are there any misalignment issues in the preprocessed data? Like in the first example in the training set ("EU rejects German ..."), there is (12,14,Brussels), however, the tokens from 12 to 14 is BRUSSELS 1996-08-22 The. Is this a misalignment or did I misunderstand something?

Thank you.