Coder Social home page Coder Social logo

amazon-science / refined Goto Github PK

View Code? Open in Web Editor NEW
176.0 18.0 32.0 430 KB

ReFinED is an efficient and accurate entity linking (EL) system.

License: Other

Python 99.79% Shell 0.21%
entity-extraction entity-linking entity-resolution nlp pytorch

refined's Issues

Instructions for training a new model

Hi team, thanks for sharing your great work.
Could you please share your instructions for step-by-step preparing the Wikipedia dataset and training a new Wikipedia model.
Best,

How do I add new entities to the EL system?

Thanks for the fantastic paper and this repository. The code certainly lives up to the claims made in the paper and quickly processes a piece of text!

Is there a way to add new entities to the EL system? Especially if one can't calculate the entity-mention prior?

Thanks!

Code for Entity_Linking

Hi, Congratulations on your work. I am interested to experiment with the entity_linking part of this paper. Any details on when can we expect the code for it?

How to optimize GPU usage?

I am running the model on a batch of 500 articles and it takes around 10 to 15 GB of GPU memory.
Is there a way to optimize the GPU usage?

Inefficient Process for Adding New Entities in ReFinED

When trying to add a dozen more entities by running preprocess_all.py, the process requires downloading over 100GB of data, which is highly inefficient for such a small addition.

This model cannot be considered to have zero-shot capabilities until there is a streamlined, bloat-free script for adding new entities into the system.

Steps to Reproduce:

  1. Clone the repository and set up the environment as per the documentation.
  2. Attempt to add a dozen new entities by running preprocess_all.py.
  3. Observe the data download requirements and inefficiency.

Expected Behavior:

There should be a lightweight and efficient process for adding new entities without requiring extensive data downloads.

Actual Behavior:

Adding new entities requires downloading over 100GB of data, making the process highly inefficient and cumbersome.

Environment:

Google Colab
Operating System: Linux
Python Version: 3.10

Severity:

High - This issue severely impacts the usability and efficiency of adding new entities to the system and needs immediate attention.

Training dataset

Can you please share the processed wikipedia training dataset?

Upload Multilingual Model

Dear developers,
I was curious if you plan to upload the multilingual mrefined model for inference and finetuning.
Lookinig forward to your reply.

Best,
Cristian

Issue with loading Additional Entities

I have tried to load additional entities as per the README by running preprocess_all. Everything appears to run fine - however when I try and load the refined model afterwards with something like:

refined = Refined(
    model_file_or_model=data_dir+ "/wikipedia_model_with_numbers/model.pt",
    model_config_file_or_model_config=data_dir + "/wikipedia_model_with_numbers/config.json",
    entity_set="wikidata",
    data_dir=data_dir,
    use_precomputed_descriptions = False,
    download_files=False,
    preprocessor=preprocessor
)

I get an error like:

Traceback (most recent call last):
  File "/home/azureuser/Hafnia/email_ee/email_refined.py", line 91, in <module>
    refined = Refined(
  File "/home/azureuser/ReFinED/src/refined/inference/processor.py", line 100, in __init__
    self.model = RefinedModel.from_pretrained(
  File "/home/azureuser/ReFinED/src/refined/model_components/refined_model.py", line 643, in from_pretrained
    model.load_state_dict(checkpoint, strict=False)
  File "/home/azureuser/.pyenv/versions/venv3108/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RefinedModel:
        size mismatch for entity_typing.linear.weight: copying a param with shape torch.Size([1369, 768]) from checkpoint, the shape in current model is torch.Size([1447, 768]).
        size mismatch for entity_typing.linear.bias: copying a param with shape torch.Size([1369]) from checkpoint, the shape in current model is torch.Size([1447]).
        size mismatch for entity_disambiguation.classifier.weight: copying a param with shape torch.Size([1, 1372]) from checkpoint, the shape in current model is torch.Size([1, 1450]).

To the best of my understanding, this is because the number of classes in the wikidata dump has changed since the original model was trained. (Number of class_to_label.json now has 1446 entries.) Is there any way to accomodate this without completely retraining the model?

Approach for limited labelled data

Is there a script or a road map that I can use to try zero-shot inference with new entities. I already have annotated data but not too much (around 3000 and unbalanced) I can maybe fine tune but training from scratch would not be possible for me given I don't have a big dataset. What approach should I follow?

cannot find the file "chosen_classes.txt"

I am trying to add additional entities without retraining. The script preprocess_all.py fails:
Ran out of 'useful' classes to select. So using number the 153 chosen classes. Note that this is not expected to happen. It likely indicates that the Wikidata dump or Wikipedia was dump was not downloaded and parsed correctly. Traceback (most recent call last): File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 364, in <module> main() File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 244, in main select_classes(resources_dir=OUTPUT_PATH, is_test=debug) File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/class_selection.py", line 152, in select_classes os.rename(os.path.join(resources_dir, 'chosen_classes.txt.part'), FileNotFoundError: [Errno 2] No such file or directory: 'data/chosen_classes.txt.part' -> 'data/chosen_classes.txt'

I am not able to find the file "chosen_classes.txt" in the original data folder:

`additional_data:

datasets:

roberta-base:
config.json merges.txt pytorch_model.bin vocab.json

wikipedia_data:
class_to_idx.json descriptions_tns.pt nltk_sentence_splitter_english.pickle qcode_to_class_tns_6269457-138.np qcode_to_wiki.lmdb
class_to_label.json human_qcodes.json pem.lmdb qcode_to_idx.lmdb subclasses.lmdb

wikipedia_model:
config.json model.pt

wikipedia_model_with_numbers:
config.json model.pt
`

how can I find it and thanks in advance.

Support for AWS inferentia

Hi team, thanks for sharing the great work here.

will there be an integration to AWS inferentia?

thanks!

Missing Wikipedia_data folder after running preprocess_all?

When I ran preprocess_all, it completed successfully with what I think should have all the folders in the ./organised_data_dir

  • additional data
  • datasets
  • roberta-base
  • wikidata_data

However I don'y see the models or data files for the wikipedia data. Where are they stored/created or did something not process during the preprocess_all script?

Thank you!

Early stopping in the preprocessing step

In class_selection.py (https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/class_selection.py#L147), when there is no entity span in the article, the preprocessing step (https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/preprocess_all.py#L244) will stop immediately instead of iterate all of your training data (wikipedia_links_aligned.json).
If this is the issue, feel free to tell me and I will send the PR with fixed codes.

Same wikipedia entity title for all top k candidates

Hello, I found what I believe is an issue with the method retrieving the top k candidate entities. When asking for top_k_predicted_entities from a span object, all candidates show the same wikipedia entity title (which is equal to the title of the selected, top-score entity), even if the wikidata ids actually point to different entities.

To show an example, when asking for top k predicted entities for "Barack Obama", I get:

[(Entity(wikidata_entity_id=Q76, wikipedia_entity_title=Barack Obama), 1.0),
 (Entity(wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q649593, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q16847466, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858115, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q3526570, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q50303833, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q8564528, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q2935433, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858123, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q4858105, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q1379733, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q45578, wikipedia_entity_title=Barack Obama), 0.0),
 (Entity(wikidata_entity_id=Q5842038, wikipedia_entity_title=Barack Obama), 0.0)]

even though, for example, id=Q16847466 corresponds to the wikipedia item "Efforts to impeach Barack Obama".

Am I missing something?

TypeError: cannot pickle 'Environment' object

Hi, I use python 3.12.2 and torch 2.2.2 on macOS 12.7.4.
The ReFinED version is 1.0.
When trying fine-tuning the following error happened.
( python src/refined/training/fine_tune/fine_tune.py --experiment_name test )

TypeError: cannot pickle 'Environment' object

Could you tell me what could be done to workaround this issue?

Thanks.

Will you release the KBED source code?

Thanks for your awesome work named "Improving Entity Disambiguation by Reasoning over a Knowledge Base(KBED)", I am trying to reproduce the results, will you release the KBED source code?

Some questions about training dataset

Great work!

I executed the following command and obtained the data file named wikipedia_links_aligned_spans.json in the folder ~/.cache/refined/datasets.

python3 src/refined/training/train/train.py --experiment_name test

I have two questions regarding this file:

  • Is wikipedia_links_aligned_spans.json the training data?
  • If so, which fields are used for training? I found three fields in the wikipedia_links_aligned_spans.json, which are hyperlinks_clean, hyperlinks, and predicted_spans. I'm not familiar with this three fields and I'm unsure how to proceed with obtaining the training data.

Thanks !

Weird runtime variations - are there any caching effects?

Dear Tom,

First of all thank you for publishing this awesome and easy to use entity linker.

I've been running experiments with ReFinED for a while but only started using it on GPU a few days ago. I noticed some weird variations in the runtime on GPU (maybe they were there on CPU as well, and I didn't pay close attention to the runtime before, but I think I would have noticed):
If I run ReFinED over a benchmark for the first time (or for the first time after linking over several other benchmarks), it takes quite a while (in fact at least as long as on my CPU-only machine: 76s for the Wiki-Fair benchmark). If I run it again immediately on the same benchmark it is lightning fast and links the whole thing in 4s.

Is there any caching used that might explain this behavior? If so, can I disable it to get comparable runtime measurements?

The loading of the model does not count towards my time measurement. The model is loaded before the measurement is started:

self.refined = Refined.from_pretrained(model_name=model_name, entity_set=entity_set)

I'm using ReFinED from inside the ELEVANT entity linking evaluation tool with the AIDA model and the 33M entity set.

Thanks in advance,
Natalie

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.