amazon-science / refined Goto Github PK
View Code? Open in Web Editor NEWReFinED is an efficient and accurate entity linking (EL) system.
License: Other
ReFinED is an efficient and accurate entity linking (EL) system.
License: Other
Hi team, thanks for sharing your great work.
Could you please share your instructions for step-by-step preparing the Wikipedia dataset and training a new Wikipedia model.
Best,
Thanks for the fantastic paper and this repository. The code certainly lives up to the claims made in the paper and quickly processes a piece of text!
Is there a way to add new entities to the EL system? Especially if one can't calculate the entity-mention prior?
Thanks!
Hi, Congratulations on your work. I am interested to experiment with the entity_linking part of this paper. Any details on when can we expect the code for it?
I am running the model on a batch of 500 articles and it takes around 10 to 15 GB of GPU memory.
Is there a way to optimize the GPU usage?
Hello,
there seem to be an issue with running this library on a windows system.
strftime(%s)
should be replaced with an uppercase %S
as per https://stackoverflow.com/questions/41607854/python-the-code-strftimes-errors
The error happens in resource_management/aws.py line 49
Thanks for a great paper. Looking forward to try the tool.
When trying to add a dozen more entities by running preprocess_all.py, the process requires downloading over 100GB of data, which is highly inefficient for such a small addition.
This model cannot be considered to have zero-shot capabilities until there is a streamlined, bloat-free script for adding new entities into the system.
Steps to Reproduce:
Expected Behavior:
There should be a lightweight and efficient process for adding new entities without requiring extensive data downloads.
Actual Behavior:
Adding new entities requires downloading over 100GB of data, making the process highly inefficient and cumbersome.
Environment:
Google Colab
Operating System: Linux
Python Version: 3.10
Severity:
High - This issue severely impacts the usability and efficiency of adding new entities to the system and needs immediate attention.
Is there any way to train entity linking model on top of existing ReFinEd model using custom knowledge base ?
Can you please share the processed wikipedia training dataset?
Dear developers,
I was curious if you plan to upload the multilingual mrefined model for inference and finetuning.
Lookinig forward to your reply.
Best,
Cristian
I have tried to load additional entities as per the README by running preprocess_all
. Everything appears to run fine - however when I try and load the refined model afterwards with something like:
refined = Refined(
model_file_or_model=data_dir+ "/wikipedia_model_with_numbers/model.pt",
model_config_file_or_model_config=data_dir + "/wikipedia_model_with_numbers/config.json",
entity_set="wikidata",
data_dir=data_dir,
use_precomputed_descriptions = False,
download_files=False,
preprocessor=preprocessor
)
I get an error like:
Traceback (most recent call last):
File "/home/azureuser/Hafnia/email_ee/email_refined.py", line 91, in <module>
refined = Refined(
File "/home/azureuser/ReFinED/src/refined/inference/processor.py", line 100, in __init__
self.model = RefinedModel.from_pretrained(
File "/home/azureuser/ReFinED/src/refined/model_components/refined_model.py", line 643, in from_pretrained
model.load_state_dict(checkpoint, strict=False)
File "/home/azureuser/.pyenv/versions/venv3108/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RefinedModel:
size mismatch for entity_typing.linear.weight: copying a param with shape torch.Size([1369, 768]) from checkpoint, the shape in current model is torch.Size([1447, 768]).
size mismatch for entity_typing.linear.bias: copying a param with shape torch.Size([1369]) from checkpoint, the shape in current model is torch.Size([1447]).
size mismatch for entity_disambiguation.classifier.weight: copying a param with shape torch.Size([1, 1372]) from checkpoint, the shape in current model is torch.Size([1, 1450]).
To the best of my understanding, this is because the number of classes in the wikidata dump has changed since the original model was trained. (Number of class_to_label.json now has 1446 entries.) Is there any way to accomodate this without completely retraining the model?
Is there a script or a road map that I can use to try zero-shot inference with new entities. I already have annotated data but not too much (around 3000 and unbalanced) I can maybe fine tune but training from scratch would not be possible for me given I don't have a big dataset. What approach should I follow?
I am trying to add additional entities without retraining. The script preprocess_all.py fails:
Ran out of 'useful' classes to select. So using number the 153 chosen classes. Note that this is not expected to happen. It likely indicates that the Wikidata dump or Wikipedia was dump was not downloaded and parsed correctly. Traceback (most recent call last): File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 364, in <module> main() File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/preprocess_all.py", line 244, in main select_classes(resources_dir=OUTPUT_PATH, is_test=debug) File "/mnt/nlu/users/yasser_hifny/gkqa/refined/ReFinED/src/refined/offline_data_generation/class_selection.py", line 152, in select_classes os.rename(os.path.join(resources_dir, 'chosen_classes.txt.part'), FileNotFoundError: [Errno 2] No such file or directory: 'data/chosen_classes.txt.part' -> 'data/chosen_classes.txt'
I am not able to find the file "chosen_classes.txt" in the original data folder:
`additional_data:
datasets:
roberta-base:
config.json merges.txt pytorch_model.bin vocab.json
wikipedia_data:
class_to_idx.json descriptions_tns.pt nltk_sentence_splitter_english.pickle qcode_to_class_tns_6269457-138.np qcode_to_wiki.lmdb
class_to_label.json human_qcodes.json pem.lmdb qcode_to_idx.lmdb subclasses.lmdb
wikipedia_model:
config.json model.pt
wikipedia_model_with_numbers:
config.json model.pt
`
how can I find it and thanks in advance.
Hi team, thanks for sharing the great work here.
will there be an integration to AWS inferentia?
thanks!
The process_text_batch function is not implemented as of now. When can we expect it to be implemented? Is there any documentation that we can follow to implement it ourselves?
When I ran preprocess_all, it completed successfully with what I think should have all the folders in the ./organised_data_dir
However I don'y see the models or data files for the wikipedia data. Where are they stored/created or did something not process during the preprocess_all script?
Thank you!
As a fix, you can remove it from the requirements.txt and replace it with boto3
, which is needed to run demo.py
.
demo.py
is still downloading, so I might run into issues with this down the line. Just a quickfix that might help others.
In class_selection.py
(https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/class_selection.py#L147), when there is no entity span in the article, the preprocessing step (https://github.com/amazon-science/ReFinED/blob/main/src/refined/offline_data_generation/preprocess_all.py#L244) will stop immediately instead of iterate all of your training data (wikipedia_links_aligned.json
).
If this is the issue, feel free to tell me and I will send the PR with fixed codes.
Hello, I found what I believe is an issue with the method retrieving the top k candidate entities. When asking for top_k_predicted_entities from a span object, all candidates show the same wikipedia entity title (which is equal to the title of the selected, top-score entity), even if the wikidata ids actually point to different entities.
To show an example, when asking for top k predicted entities for "Barack Obama", I get:
[(Entity(wikidata_entity_id=Q76, wikipedia_entity_title=Barack Obama), 1.0),
(Entity(wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q649593, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q16847466, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q4858115, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q3526570, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q50303833, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q8564528, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q2935433, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q4858123, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q4858105, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q1379733, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q45578, wikipedia_entity_title=Barack Obama), 0.0),
(Entity(wikidata_entity_id=Q5842038, wikipedia_entity_title=Barack Obama), 0.0)]
even though, for example, id=Q16847466 corresponds to the wikipedia item "Efforts to impeach Barack Obama".
Am I missing something?
When will you release the code?
Hi, I use python 3.12.2 and torch 2.2.2 on macOS 12.7.4.
The ReFinED version is 1.0.
When trying fine-tuning the following error happened.
( python src/refined/training/fine_tune/fine_tune.py --experiment_name test )
TypeError: cannot pickle 'Environment' object
Could you tell me what could be done to workaround this issue?
Thanks.
Thanks for your awesome work named "Improving Entity Disambiguation by Reasoning over a Knowledge Base(KBED)", I am trying to reproduce the results, will you release the KBED source code?
In the LookupsInferenceOnly Class from the data_lookupss.py, the format of self.pem is Mapping[str, List[Tuple[str, float]]]. But in the preprocess_all.py, when we use it to download data and build the pem.lmdb, the format is Mapping[str, Mapping[str, float]]. So this inconsistent would result in the errors when load the data.
Great work!
I executed the following command and obtained the data file named wikipedia_links_aligned_spans.json
in the folder ~/.cache/refined/datasets
.
python3 src/refined/training/train/train.py --experiment_name test
I have two questions regarding this file:
wikipedia_links_aligned_spans.json
the training data?wikipedia_links_aligned_spans.json
, which are hyperlinks_clean
, hyperlinks
, and predicted_spans
. I'm not familiar with this three fields and I'm unsure how to proceed with obtaining the training data.Thanks !
Dear Tom,
First of all thank you for publishing this awesome and easy to use entity linker.
I've been running experiments with ReFinED for a while but only started using it on GPU a few days ago. I noticed some weird variations in the runtime on GPU (maybe they were there on CPU as well, and I didn't pay close attention to the runtime before, but I think I would have noticed):
If I run ReFinED over a benchmark for the first time (or for the first time after linking over several other benchmarks), it takes quite a while (in fact at least as long as on my CPU-only machine: 76s for the Wiki-Fair benchmark). If I run it again immediately on the same benchmark it is lightning fast and links the whole thing in 4s.
Is there any caching used that might explain this behavior? If so, can I disable it to get comparable runtime measurements?
The loading of the model does not count towards my time measurement. The model is loaded before the measurement is started:
self.refined = Refined.from_pretrained(model_name=model_name, entity_set=entity_set)
I'm using ReFinED from inside the ELEVANT entity linking evaluation tool with the AIDA model and the 33M entity set.
Thanks in advance,
Natalie
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.