amazon-science / refined Goto Github PK

ReFinED is an efficient and accurate entity linking (EL) system.

License: Other

Python 99.79% Shell 0.21%

entity-extraction entity-linking entity-resolution nlp pytorch

refined's Issues

Instructions for training a new model

Hi team, thanks for sharing your great work.
Could you please share your instructions for step-by-step preparing the Wikipedia dataset and training a new Wikipedia model.
Best,

How do I add new entities to the EL system?

Thanks for the fantastic paper and this repository. The code certainly lives up to the claims made in the paper and quickly processes a piece of text!

Is there a way to add new entities to the EL system? Especially if one can't calculate the entity-mention prior?

Thanks!

Code for Entity_Linking

Hi, Congratulations on your work. I am interested to experiment with the entity_linking part of this paper. Any details on when can we expect the code for it?

How to optimize GPU usage?

I am running the model on a batch of 500 articles and it takes around 10 to 15 GB of GPU memory.
Is there a way to optimize the GPU usage?

Issue with running the code on Windows

Hello,
there seem to be an issue with running this library on a windows system.
strftime(%s) should be replaced with an uppercase %S as per https://stackoverflow.com/questions/41607854/python-the-code-strftimes-errors
The error happens in resource_management/aws.py line 49

Thanks for a great paper. Looking forward to try the tool.

Inefficient Process for Adding New Entities in ReFinED

When trying to add a dozen more entities by running preprocess_all.py, the process requires downloading over 100GB of data, which is highly inefficient for such a small addition.

This model cannot be considered to have zero-shot capabilities until there is a streamlined, bloat-free script for adding new entities into the system.

Steps to Reproduce:

Clone the repository and set up the environment as per the documentation.
Attempt to add a dozen new entities by running preprocess_all.py.
Observe the data download requirements and inefficiency.

Expected Behavior:

There should be a lightweight and efficient process for adding new entities without requiring extensive data downloads.

Actual Behavior:

Adding new entities requires downloading over 100GB of data, making the process highly inefficient and cumbersome.

Environment:

Google Colab
Operating System: Linux
Python Version: 3.10

Severity:

High - This issue severely impacts the usability and efficiency of adding new entities to the system and needs immediate attention.

Train new model on top of ReFinED

Is there any way to train entity linking model on top of existing ReFinEd model using custom knowledge base ?

Training dataset

Can you please share the processed wikipedia training dataset?

Upload Multilingual Model

Dear developers,
I was curious if you plan to upload the multilingual mrefined model for inference and finetuning.
Lookinig forward to your reply.

Best,
Cristian

Issue with loading Additional Entities

I have tried to load additional entities as per the README by running preprocess_all. Everything appears to run fine - however when I try and load the refined model afterwards with something like:

refined = Refined(
    model_file_or_model=data_dir+ "/wikipedia_model_with_numbers/model.pt",
    model_config_file_or_model_config=data_dir + "/wikipedia_model_with_numbers/config.json",
    entity_set="wikidata",
    data_dir=data_dir,
    use_precomputed_descriptions = False,
    download_files=False,
    preprocessor=preprocessor
)

I get an error like:

Traceback (most recent call last):
  File "/home/azureuser/Hafnia/email_ee/email_refined.py", line 91, in <module>
    refined = Refined(
  File "/home/azureuser/ReFinED/src/refined/inference/processor.py", line 100, in __init__
    self.model = RefinedModel.from_pretrained(
  File "/home/azureuser/ReFinED/src/refined/model_components/refined_model.py", line 643, in from_pretrained
    model.load_state_dict(checkpoint, strict=False)
  File "/home/azureuser/.pyenv/versions/venv3108/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RefinedModel:
        size mismatch for entity_typing.linear.weight: copying a param with shape torch.Size([1369, 768]) from checkpoint, the shape in current model is torch.Size([1447, 768]).
        size mismatch for entity_typing.linear.bias: copying a param with shape torch.Size([1369]) from checkpoint, the shape in current model is torch.Size([1447]).
        size mismatch for entity_disambiguation.classifier.weight: copying a param with shape torch.Size([1, 1372]) from checkpoint, the shape in current model is torch.Size([1, 1450]).

To the best of my understanding, this is because the number of classes in the wikidata dump has changed since the original model was trained. (Number of class_to_label.json now has 1446 entries.) Is there any way to accomodate this without completely retraining the model?

Approach for limited labelled data

Is there a script or a road map that I can use to try zero-shot inference with new entities. I already have annotated data but not too much (around 3000 and unbalanced) I can maybe fine tune but training from scratch would not be possible for me given I don't have a big dataset. What approach should I follow?

cannot find the file "chosen_classes.txt"

I am trying to add additional entities without retraining. The script preprocess_all.py fails:
Ran out of 'useful' classes to select. So using number the 153 chosen classes. Note that this is not expected to happen. It likely indicates that the Wikidata dump or Wikipedia was dump was not downloaded and parsed correctly. Traceback (most recent call last): File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 364, in <module> main() File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 244, in main select_classes(resources_dir=OUTPUT_PATH, is_test=debug) File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/class_selection.py", line 152, in select_classes os.rename(os.path.join(resources_dir, 'chosen_classes.txt.part'), FileNotFoundError: [Errno 2] No such file or directory: 'data/chosen_classes.txt.part' -> 'data/chosen_classes.txt'

I am not able to find the file "chosen_classes.txt" in the original data folder:

`additional_data:

datasets:

roberta-base:
config.json merges.txt pytorch_model.bin vocab.json

wikipedia_data:
class_to_idx.json descriptions_tns.pt nltk_sentence_splitter_english.pickle qcode_to_class_tns_6269457-138.np qcode_to_wiki.lmdb
class_to_label.json human_qcodes.json pem.lmdb qcode_to_idx.lmdb subclasses.lmdb

wikipedia_model:
config.json model.pt

wikipedia_model_with_numbers:
config.json model.pt
`

how can I find it and thanks in advance.

Support for AWS inferentia

Hi team, thanks for sharing the great work here.

will there be an integration to AWS inferentia?

thanks!

Batch support at the time of inference

The process_text_batch function is not implemented as of now. When can we expect it to be implemented? Is there any documentation that we can follow to implement it ourselves?

Missing Wikipedia_data folder after running preprocess_all?

When I ran preprocess_all, it completed successfully with what I think should have all the folders in the ./organised_data_dir

additional data
datasets
roberta-base
wikidata_data

However I don'y see the models or data files for the wikipedia data. Where are they stored/created or did something not process during the preprocess_all script?

Thank you!

pynini as requirement for macos with M1 can not be installed

As a fix, you can remove it from the requirements.txt and replace it with boto3, which is needed to run demo.py.

demo.py is still downloading, so I might run into issues with this down the line. Just a quickfix that might help others.

Early stopping in the preprocessing step

In class_selection.py (https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/class_selection.py#L147), when there is no entity span in the article, the preprocessing step (https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/preprocess_all.py#L244) will stop immediately instead of iterate all of your training data (wikipedia_links_aligned.json).
If this is the issue, feel free to tell me and I will send the PR with fixed codes.

Same wikipedia entity title for all top k candidates

Hello, I found what I believe is an issue with the method retrieving the top k candidate entities. When asking for top_k_predicted_entities from a span object, all candidates show the same wikipedia entity title (which is equal to the title of the selected, top-score entity), even if the wikidata ids actually point to different entities.

To show an example, when asking for top k predicted entities for "Barack Obama", I get:

[(Entity(wikidata_entity_id=Q76, wikipedia_entity_title=Barack Obama), 1.0),
 (Entity(wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q649593, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q16847466, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858115, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q3526570, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q50303833, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q8564528, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q2935433, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858123, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858105, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q1379733, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q45578, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q5842038, wikipedia_entity_title=Barack Obama), 0.0)]

even though, for example, id=Q16847466 corresponds to the wikipedia item "Efforts to impeach Barack Obama".

Am I missing something?

Regarding open sourcing the CODE

When will you release the code?

TypeError: cannot pickle 'Environment' object

Hi, I use python 3.12.2 and torch 2.2.2 on macOS 12.7.4.
The ReFinED version is 1.0.
When trying fine-tuning the following error happened.
( python src/refined/training/fine_tune/fine_tune.py --experiment_name test )

TypeError: cannot pickle 'Environment' object

Could you tell me what could be done to workaround this issue?

Thanks.

Will you release the KBED source code?

Thanks for your awesome work named "Improving Entity Disambiguation by Reasoning over a Knowledge Base(KBED)", I am trying to reproduce the results, will you release the KBED source code?

The format of pem.lmdb is inconsistent between it in preprocess_all and in LookupsInferenceOnly class

In the LookupsInferenceOnly Class from the data_lookupss.py, the format of self.pem is Mapping[str, List[Tuple[str, float]]]. But in the preprocess_all.py, when we use it to download data and build the pem.lmdb, the format is Mapping[str, Mapping[str, float]]. So this inconsistent would result in the errors when load the data.

Some questions about training dataset

Great work!

I executed the following command and obtained the data file named wikipedia_links_aligned_spans.json in the folder ~/.cache/refined/datasets.

python3 src/refined/training/train/train.py --experiment_name test

I have two questions regarding this file:

Is wikipedia_links_aligned_spans.json the training data?
If so, which fields are used for training? I found three fields in the wikipedia_links_aligned_spans.json, which are hyperlinks_clean, hyperlinks, and predicted_spans. I'm not familiar with this three fields and I'm unsure how to proceed with obtaining the training data.

Thanks !

Weird runtime variations - are there any caching effects?

Dear Tom,

First of all thank you for publishing this awesome and easy to use entity linker.

I've been running experiments with ReFinED for a while but only started using it on GPU a few days ago. I noticed some weird variations in the runtime on GPU (maybe they were there on CPU as well, and I didn't pay close attention to the runtime before, but I think I would have noticed):
If I run ReFinED over a benchmark for the first time (or for the first time after linking over several other benchmarks), it takes quite a while (in fact at least as long as on my CPU-only machine: 76s for the Wiki-Fair benchmark). If I run it again immediately on the same benchmark it is lightning fast and links the whole thing in 4s.

Is there any caching used that might explain this behavior? If so, can I disable it to get comparable runtime measurements?

The loading of the model does not count towards my time measurement. The model is loaded before the measurement is started:

self.refined = Refined.from_pretrained(model_name=model_name, entity_set=entity_set)

I'm using ReFinED from inside the ELEVANT entity linking evaluation tool with the AIDA model and the 33M entity set.

Thanks in advance,
Natalie

amazon-science / refined Goto Github PK

refined's Issues

Recommend Projects

Recommend Topics

Recommend Org