shtoshni / fast-coref Goto Github PK

View Code? Open in Web Editor NEW

32.0 32.0 13.0 477 KB

Code for the CRAC 2021 paper "On Generalization in Coreference Resolution" (Best short paper award)

Python 42.87% Jupyter Notebook 57.13%

fast-coref's People

Contributors

Stargazers

Watchers

Forkers

dungtn aakashb95 vikibrezinova cpow24 techthiyanes michel-ds koren-gershoni yxliao95 ksteimel bingtian88 sondalex bobby-zhu tonycsoka

fast-coref's Issues

How to run inference (prediction) on any arbitary text?

If I have a fine-tuned model and a paragraph of text containing multiple sentences, then how can I run inference on the paragraph? or How can I predict coreference clusters on any arbitary text?

Repeated mentions in predictions

I get the following error message from the scorer:

Using CoNLL scorer
[...]
.//../../coref_resources/reference-coreference-scorers/scorer.pl
Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output.

This message occurs when a span has multiple clusters, e.g.,:

doc   0   14   hemzelf   VNW[pers,pron,obl,nadr,3m,ev,masc]   *)))))   -   -   -   -   *   (484)|(484)

Interestingly, hemzelf (himself) is a word made up of two morphemes that are also words on their own, so this might be the reason this happens. In this case the same cluster is assigned twice, but there are also cases where different clusters are assigned to the same span.

The error messages by coval are a bit more informative and also tell you which clusters have duplicate spans.

Google drive models not available

I cannot find the pth models (or anything) in the google drive link at
https://drive.google.com/drive/folders/1270pP1JIYLleLH7rkRyXyHV2p0C7rX_8?usp=sharing

I appreciate it if you re-upload them or share them in any other way. Thank you so much.

inference code

Hi,

I am using your library for co-referencing. Currently, I am using a few hacks to get the co-referred output. It would be great if you could provide the code which can replace the pronouns with the co-referenced outputs (clusters elements).

Thanks

References indexes are not compatible with source sentence.

From a coreference dataset :
(Tune: ``A Dream is a Wish Your Heart Makes'') Alice in Wonderland - Alice sings, while the garden of live flowers judge her. (Tune: ``All in the Golden Afternoon'') Peter Pan - Michael watches while Wendy assists Peter with his shadow.

The decoded tokens gives :
( Tune : ` ` A Dream is a Wish Your Heart Makes '' ) Alice in Wonderland - Alice sings, while the garden of live flowers judge her. ( Tune : ` ` All in the Golden Afternoon '' ) Peter Pan - Michael watches while Wendy assists Peter with his shadow.

(Note the spaces at each : ` ) ( " )

Then the reference indexes are correct within the decoded sentence but it's impossible to find the real character indexes in the source sentence.

Like :
'William de Alwis (1842--1916) was a Ceylonese artist and entomologist. With his brother George (dates unknown), William made a lasting contribution to the knowledge of the lepidoptera, (butterflies and moths) of Ceylon.'

And the ref text or decoded token text is : 'William de Alwis ( 1842 - -1916 )' which can't be found in the source sentence....

I guess this is all caused by spacy tokenization which runs firsts.

Thanks in advance for any help,
Have a great day

PreCo dev data

Hi,

I downloaded the PreCo data here, which only contains training and dev data. From your paper, it seems that you held out 500 docs from training data as your dev data and treat the original dev data as test data? Did you randomly sample 500 docs or take the last 500 docs from the training data as your dev data?

CoNLL-U support

Hi, thanks so much for this, great work!
I'm working on coreference resolution for Polish, and do have data in .conllu format. I would love to see how to implement .conllu preprocessing into your pipeline!

Cluster index does not align with output['tokenized_doc']['orig_tokens']

Hi,
I am using the Colab notebook provided by you. Here is the sample text that I provided:

doc = """Elon Reeve Musk FRS (born June 28, 1971) is a business magnate and investor. He is the founder, CEO, and Chief Engineer at SpaceX; angel investor, CEO, and Product Architect of Tesla, Inc.; founder of The Boring Company; and co-founder of Neuralink and OpenAI. With an estimated net worth of around US$203 billion as of June 2022,[4] Musk is the wealthiest person in the world according to both the Bloomberg Billionaires Index and Forbes' real-time billionaires list.[5][6]

Musk was born to White South African parents in Pretoria, where he grew up. He briefly attended the University of Pretoria before moving to Canada at age 17, acquiring citizenship through his Canadian-born mother. He matriculated at Queen's University and transferred to the University of Pennsylvania two years later, where he received bachelor's degrees in Economics and Physics. He moved to California in 1995 to attend Stanford University but decided instead to pursue a business career, co-founding the web software company Zip2 with his brother Kimbal. The startup was acquired by Compaq for $307 million in 1999. The same year, Musk co-founded online bank X.com, which merged with Confinity in 2000 to form PayPal. The company was bought by eBay in 2002 for $1.5 billion.

In 2002, Musk founded SpaceX, an aerospace manufacturer and space transport services company, of which he serves as CEO and Chief Engineer. In 2004, he was an early investor in electric vehicle manufacturer Tesla Motors, Inc. (now Tesla, Inc.). He became its chairman and product architect, eventually assuming the position of CEO in 2008. In 2006, he helped create SolarCity, a solar energy company that was later acquired by Tesla and became Tesla Energy. In 2015, he co-founded OpenAI, a nonprofit research company promoting friendly artificial intelligence (AI). In 2016, he co-founded Neuralink, a neurotechnology company focused on developing brain–computer interfaces, and founded The Boring Company, a tunnel construction company. He agreed to purchase the major American social networking service Twitter in 2022 for $44 billion. Musk has proposed the Hyperloop, a high-speed vactrain transportation system, and is the president of the Musk Foundation, an organization which donates to scientific research and education.

Musk has been criticized for making unscientific and controversial statements, such as spreading misinformation about the COVID-19 pandemic. In 2018, he was sued by the US Securities and Exchange Commission (SEC) for falsely tweeting that he had secured funding for a private takeover of Tesla; he settled with the SEC but did not admit guilt, and he temporarily stepped down from his Tesla chairmanship. In 2019, he won a defamation case brought against him by a British caver who had advised in the Tham Luang cave rescue
"""

And, I get the following non-singleton clusters.
[((0, 12), 'Elon Reeve Musk FRS ( born June 28 , 1971 )'), ((21, 21), 'He'), ((84, 84), 'Musk'), ((115, 115), 'Musk'), ((128, 128), 'he'), ((132, 132), 'He'), ((151, 151), 'his'), ((157, 157), 'He'), ((178, 178), 'he'), ((189, 189), 'He'), ((218, 218), 'his'), ((241, 241), 'Musk'), ((282, 282), 'Musk'), ((297, 297), 'he'), ((308, 308), 'he'), ((330, 330), 'He'), ((350, 350), 'he'), ((374, 374), 'he'), ((396, 396), 'he'), ((427, 427), 'He'), ((445, 445), 'Musk'), ((484, 484), 'Musk'), ((513, 513), 'he'), ((530, 530), 'he'), ((541, 541), 'he'), ((553, 553), 'he'), ((558, 558), 'his'), ((566, 566), 'he'), ((573, 573), 'him')]
[((32, 32), 'SpaceX'), ((284, 303), 'SpaceX , an aerospace manufacturer and space transport services company , of which he serves as CEO and Chief Engineer')]
[((211, 216), 'the web software company Zip2'), ((223, 224), 'The startup')]
[((235, 235), '1999'), ((237, 239), 'The same year')]
[((245, 260), 'online bank X.com , which merged with Confinity in 2000 to form PayPal'), ((262, 263), 'The company')]
[((269, 269), '2002'), ((280, 280), '2002')]
[((314, 328), 'electric vehicle manufacturer Tesla Motors , Inc. ( now Tesla , Inc. )'), ((332, 332), 'its'), ((365, 365), 'Tesla'), ((539, 539), 'Tesla'), ((559, 559), 'Tesla')]
[((517, 525), 'the US Securities and Exchange Commission ( SEC )'), ((544, 545), 'the SEC')]

The tokenized list is this:
['Elon', 'Reeve', 'Musk', 'FRS', '(', 'born', 'June', '28', ',', '1971', ')', 'is', 'a', 'business', 'magnate', 'and', 'investor', '.', 'He', 'is', 'the', 'founder', ',', 'CEO', ',', 'and', 'Chief', 'Engineer', 'at', 'SpaceX', ';', 'angel', 'investor', ',', 'CEO', ',', 'and', 'Product', 'Architect', 'of', 'Tesla', ',', 'Inc.', ';', 'founder', 'of', 'The', 'Boring', 'Company', ';', 'and', 'co', '-', 'founder', 'of', 'Neuralink', 'and', 'OpenAI', '.', 'With', 'an', 'estimated', 'net', 'worth', 'of', 'around', 'US$', '203', 'billion', 'as', 'of', 'June', '2022,[4', ']', 'Musk', 'is', 'the', 'wealthiest', 'person', 'in', 'the', 'world', 'according', 'to', 'both', 'the', 'Bloomberg', 'Billionaires', 'Index', 'and', 'Forbes', "'", 'real', '-', 'time', 'billionaires', 'list.[5][6', ']', '\n\n', 'Musk', 'was', 'born', 'to', 'White', 'South', 'African', 'parents', 'in', 'Pretoria', ',', 'where', 'he', 'grew', 'up', '.', 'He', 'briefly', 'attended', 'the', 'University', 'of', 'Pretoria', 'before', 'moving', 'to', 'Canada', 'at', 'age', '17', ',', 'acquiring', 'citizenship', 'through', 'his', 'Canadian', '-', 'born', 'mother', '.', 'He', 'matriculated', 'at', 'Queen', "'s", 'University', 'and', 'transferred', 'to', 'the', 'University', 'of', 'Pennsylvania', 'two', 'years', 'later', ',', 'where', 'he', 'received', 'bachelor', "'s", 'degrees', 'in', 'Economics', 'and', 'Physics', '.', 'He', 'moved', 'to', 'California', 'in', '1995', 'to', 'attend', 'Stanford', 'University', 'but', 'decided', 'instead', 'to', 'pursue', 'a', 'business', 'career', ',', 'co', '-', 'founding', 'the', 'web', 'software', 'company', 'Zip2', 'with', 'his', 'brother', 'Kimbal', '.', 'The', 'startup', 'was', 'acquired', 'by', 'Compaq', 'for', '$', '307', 'million', 'in', '1999', '.', 'The', 'same', 'year', ',', 'Musk', 'co', '-', 'founded', 'online', 'bank', 'X.com', ',', 'which', 'merged', 'with', 'Confinity', 'in', '2000', 'to', 'form', 'PayPal', '.', 'The', 'company', 'was', 'bought', 'by', 'eBay', 'in', '2002', 'for', '$', '1.5', 'billion', '.', '\n\n', 'In', '2002', ',', 'Musk', 'founded', 'SpaceX', ',', 'an', 'aerospace', 'manufacturer', 'and', 'space', 'transport', 'services', 'company', ',', 'of', 'which', 'he', 'serves', 'as', 'CEO', 'and', 'Chief', 'Engineer', '.', 'In', '2004', ',', 'he', 'was', 'an', 'early', 'investor', 'in', 'electric', 'vehicle', 'manufacturer', 'Tesla', 'Motors', ',', 'Inc.', '(', 'now', 'Tesla', ',', 'Inc.', ')', '.', 'He', 'became', 'its', 'chairman', 'and', 'product', 'architect', ',', 'eventually', 'assuming', 'the', 'position', 'of', 'CEO', 'in', '2008', '.', 'In', '2006', ',', 'he', 'helped', 'create', 'SolarCity', ',', 'a', 'solar', 'energy', 'company', 'that', 'was', 'later', 'acquired', 'by', 'Tesla', 'and', 'became', 'Tesla', 'Energy', '.', 'In', '2015', ',', 'he', 'co', '-', 'founded', 'OpenAI', ',', 'a', 'nonprofit', 'research', 'company', 'promoting', 'friendly', 'artificial', 'intelligence', '(', 'AI', ')', '.', 'In', '2016', ',', 'he', 'co', '-', 'founded', 'Neuralink', ',', 'a', 'neurotechnology', 'company', 'focused', 'on', 'developing', 'brain', '–', 'computer', 'interfaces', ',', 'and', 'founded', 'The', 'Boring', 'Company', ',', 'a', 'tunnel', 'construction', 'company', '.', 'He', 'agreed', 'to', 'purchase', 'the', 'major', 'American', 'social', 'networking', 'service', 'Twitter', 'in', '2022', 'for', '$', '44', 'billion', '.', 'Musk', 'has', 'proposed', 'the', 'Hyperloop', ',', 'a', 'high', '-', 'speed', 'vactrain', 'transportation', 'system', ',', 'and', 'is', 'the', 'president', 'of', 'the', 'Musk', 'Foundation', ',', 'an', 'organization', 'which', 'donates', 'to', 'scientific', 'research', 'and', 'education', '.', '\n\n', 'Musk', 'has', 'been', 'criticized', 'for', 'making', 'unscientific', 'and', 'controversial', 'statements', ',', 'such', 'as', 'spreading', 'misinformation', 'about', 'the', 'COVID-19', 'pandemic', '.', 'In', '2018', ',', 'he', 'was', 'sued', 'by', 'the', 'US', 'Securities', 'and', 'Exchange', 'Commission', '(', 'SEC', ')', 'for', 'falsely', 'tweeting', 'that', 'he', 'had', 'secured', 'funding', 'for', 'a', 'private', 'takeover', 'of', 'Tesla', ';', 'he', 'settled', 'with', 'the', 'SEC', 'but', 'did', 'not', 'admit', 'guilt', ',', 'and', 'he', 'temporarily', 'stepped', 'down', 'from', 'his', 'Tesla', 'chairmanship', '.', 'In', '2019', ',', 'he', 'won', 'a', 'defamation', 'case', 'brought', 'against', 'him', 'by', 'a', 'British', 'caver', 'who', 'had', 'advised', 'in', 'the', 'Tham', 'Luang', 'cave', 'rescue', '\n ']

As can be seen, the 2nd item in the first cluster refers to 'he' with an index of 21 whereas its index in the orig_tokens list is 18.

Can you please explain what is there this misalignment?

Reproducing results with pre-trained models from google drive

Thanks for your great work, and I have a few questions about reproducing the results.
I followed your steps on "Install Requirements", download your pre-trained models, and processed data from your google drive.
The code ran smoothly, but pre-trained models work somehow weirdly, hope to get your help.

I tested the downloaded pre-trained model from your google drive on OntoNotes and Litbank, but the latter get strange results.

My run command is

python main.py experiment=litbank paths.model_dir=../models/onto_best/  model/doc_encoder/transformer=longformer_ontonotes override_encoder=True train=False

The F1 score (58.7) is pretty low, but the more interesting thing is that the Oracle F-score is 0.825. If I understand correctly, this is the upper bound of the F1-score with the mention detection results. I wonder if there are some changes in huggingface models, but hard to track them.

BTW, I also try the command without override_encoder, but the result is less than 10 points.

[2022-05-10 02:59:40,882][HYDRA] Test
[2022-05-10 02:59:40,882][HYDRA] Dataset: LitBank

[2022-05-10 02:59:40,882][HYDRA] Dataset: litbank, Cluster Threshold: 1
[2022-05-10 02:59:40,882][HYDRA] Evaluating on 10 examples
[2022-05-10 02:59:56,783][HYDRA] F-score: 58.7 , MUC: 80.7, Bcub: 60.8, CEAFE: 34.6
[2022-05-10 02:59:56,785][HYDRA] Oracle F-score: 0.825
[2022-05-10 02:59:56,785][HYDRA] /home/ec2-user/git/incremental-coref/fast-coref/models/onto_best/litbank/test.log.jsonl
[2022-05-10 02:59:56,785][HYDRA] Inference time: 15.83
[2022-05-10 02:59:56,785][HYDRA] Max inference memory: 3.5 GB
[2022-05-10 02:59:56,786][HYDRA] Final performance summary at /home/ec2-user/git/incremental-coref/fast-coref/models/onto_best/litbank/perf.json
[2022-05-10 02:59:56,786][HYDRA] Performance summary file: /home/ec2-user/git/incremental-coref/fast-coref/models/onto_best/perf.json

Error with memory bounded model

On this line:

fast-coref/src/model/memory/entity_memory_bounded.py

Line 193 in f5f41f0

    
           mem_vectors, ent_counter, last_mention_start = self.initialize_memory(**memory_init)

I get an error that the parameter lru_list is unexpected.

I fixed it by adding **ignore as an extra parameter to initialize_merory() in base_memory.py; is this a proper fix or should it be fixed another way?

Concerning restriction of CEAF metric

A question concerning the evaluation metrics mentioned on the papers, and of course I might have misunderstood the way this works.

I have read that CEAF works under the restriction that "each key entity should be mapped to exactly one reference entity and vice versa". From what I understand, fast-coref might return overlapping chains that refer to different entities, therefore, not respecting the restriction.

Would I therefore assume correctly that CEAF would not give valid results on fast-coref?

regarding co-refered output

Hi,

I am using your GitHub notebook for co-referencing. would that be possible if you could provide the code which can replace the pronouns with the co-referenced outputs (clusters elements).

Thanks

Dialogue coreference

Can this repo be used for conversational coreference?

mention index mismatch

tiny fix needed here:

fast-coref/src/inference/model_inference.py

Line 66 in f5f41f0

cur_cluster.append(((ment_start, ment_end),

need to append(subtoken_map[ment_start], subtoken_map[ment_end])

A clarification in the code-base

In the file: src/model/mention_proposal/utils.py
Line 9: sort_scores = ment_starts + 1e-5 * ment_starts
Should it be + 1e-5 * ment_ends - To have shorter mentions (starting at the same index) first? Or is it just some coding convenience?

Training on litbank dataset

I am modifying a few clusters from the litbank dataset. How do I create the jsonlines for my custom litbank dataset like the one you have in your google drive?

And ones I have the jsonlines, then i should run the following right?:

python main.py experiment=litbank trainer.label_smoothing_wt=0.0

Regarding the PreCo dataset

The PreCo dataset (https://preschool-lab.github.io/PreCo/) seems to be unavailable as no email was received for many days after filling in the details on the page. Though I know this is not an issue of this repository, I could not place my hands on it anywhere and thought maybe the authors or the community can help me here. Can someone point me to any version of the dataset available elsewhere?
Thank you

How to fine-tune on a custom dataset?

Hi @shtoshni , thanks for the great work.

I'm new to this field, but resolving the coreference is one of the intermediate processes in my research. And my data are different from the publicly available dataset, which means I need to annotate the data and fine-tune an existing model to fit the data.

I was wondering if this model can be fine-tuned on a custom dataset? And if yes, could you provide any tutorial/suggestion on how to fine-tune the model, and what the data look like as the model need? Any suggestion would greatly save me the time of reading the code.

Why the model performance decrease a lot after fine-tuning?

Hi @shtoshni,

I am trying to fine-tune this model to a custom dataset as I mentioned in issue #11.

Before doing so, I used the datasets you provided (litbank and quizbowl) to simulate this process. However, the (conll) F1 score decreased a lot after I fine-tuned the model. Do you have any idea why this happen?

Here are the experiment details:

[joint_best] = longformer_coreference_joint encoder + joint_best model
[ontonotes_best] = longformer_coreference_notonotes encoder + ontonotes_best model
[Pretrained encoder + initial model] = pretrained longformer_coreference_joint encoder + initial model

We didn't fine-tune the doc encoder in any experiments (config.model.doc_encoder.finetune=false).
All the configs remain almost the same except for the ones we presented in the hyperparameters column.

As you can see in No.1, I used the joint_best model and evaluated it on the litbank (split No.0) datset. The conll F1 score was exactly what we were expecting.
In No.2, I used only 1 litbank record to fine-tune the model. The final F1 score was still good. I think this might prove that the code is working fine. (However, the F1 score during training decreased a lot, yet I don't know why.)

Then, in No.3, I finetuned the joint_best model on the litbank_0 dataset with different settings, and the final performance decreased a lot. I made a few assumptions about this decrease, like overfitting (as I was using litbank to fine-tune the joint_best model), and wrong gradient update direction (as I used config: "eval_per_k_steps=10").

For the assumption of the wrong gradient update direction, I increased the eval_per_k_steps to 180 and 400. But the results remained almost the same (see No.4 and No.5).

Considering the litbank dataset was used to train the joint_best model, I tried to fine-tune the model with a different dataset to avoid overfitting (see No.6-11). I used the quizbowl dataset and split the 400 test records into train/dev/test for fine-tuning, like the litbank dataset. This time, the result improved after fine-tuning (No.6 vs No.9). But the F1 scores were quite low.

I also tried to train the model from scratch (see No.12-13). In these two experiments, we were using the pre-trained longformer_coreference_joint as an encoder and didn't fine-tune it. The results were very close to No.3-5.

In short, there are two questions:

I fine-tuned the pre-trained model on litbank. The model performance was much lower than your best model. I was wondering why the performance gap happened. Is there anything I missed or misconfigured?
And I was also wondering why the results computed by the evaluation metrics you implemented in your program differ from the CoNLL metrics, for example:

[2022-08-01 22:39:40,706][HYDRA] Test
[2022-08-01 22:39:40,706][HYDRA] Dataset: LitBank
[2022-08-01 22:39:40,706][HYDRA] Dataset: litbank, Cluster Threshold: 1
[2022-08-01 22:39:40,706][HYDRA] Evaluating on 10 examples
[2022-08-01 22:40:57,327][HYDRA] F-score: 74.0 , MUC: 86.6, Bcub: 74.7, CEAFE: 60.8
Using CoNLL scorer
...
[2022-08-01 22:40:58,602][HYDRA] (CoNLL) F-score : 61.8, MUC: 86.6, Bcub: 68.1, CEAFE: 30.6
...
[2022-08-01 22:40:58,602][HYDRA] Oracle F-score: 0.875

Feature Request: Python Package

Hi @shtoshni, thank you for this project.

I was wondering if there are any future plans to organize this project as a Python package?