Coder Social home page Coder Social logo

tigerchen52 / gladis Goto Github PK

View Code? Open in Web Editor NEW
12.0 3.0 2.0 13.04 MB

GLADIS: A General and Large Acronym Disambiguation Benchmark (EACL 23)

License: Creative Commons Zero v1.0 Universal

Python 100.00%
abbreviation-expansion abbreviations-detection acronym-disambiguation acronym-dictionary

gladis's Introduction

GLADIS

GLADIS: A General and Large Acronym Disambiguation Benchmark (Long paper at EACL 23) under the CC0-1.0 license.

In this paper, we propose a new Benchmark named Gladis and an acronym linking system named AcroBERT. We have a demo on Huggingface, and the below example shows that the result for the a sentence with the arcronym "NCBI":

model

To accelerate the research on acronym disambiguation, we constructed a new benchmark named GLADIS and the pre-trained AcroBERT. The table below shows the main components in our new benchmark.

Source Description
Acronym Dictionary Pile (MIT license), Wikidata, UMLS 1.6 million acronyms and 6.4 million long forms
Three Datasets WikilinksNED Unseen, SciAD(CC BY-NC-SA 4.0), Medmentions(CC0 1.0) three AD datasets that cover general, scientific, biomedical domains
A Pre-training Corpus Pile (MIT license) 160 million sentences with acronyms
AcroBERT BERT-based model the first pre-trained language model for general acronym disambiguation

model

Usage

AcroBERT can do end-to-end acronym linking. Given a sentence, our framework first recognize acronyms by using MadDog (CC BY-NC-SA 4.0), and then disambiguate them by using AcroBERT:

from inference.acrobert import acronym_linker

# input sentence with acronyms, the maximum length is 400 sub-tokens
sentence = "This new genome assembly and the annotation are tagged as a RefSeq genome by NCBI."

# mode = ['acrobert', 'pop']
# AcroBERT has a better performance while the pop method is faster but with a low accuracy.
results = acronym_linker(sentence, mode='acrobert')
print(results)

## expected output: [('NCBI', 'National Center for Biotechnology Information')]

Preparation

The new benchmark constructed in this paper is located in input/dataset. The acronym dictionary is supposed to stored in this file: input/dataset/acronym_kb.json, which includes 1.5M acronyms and 6.4M long forms. However, due to the size limit of the upload files, you have to download the dictionary (with the AcroBERT model together) from this link: dictionary and model. After downloading, decompress it and put the two files to this path input/

Re-production

Training

First you can use -help to show the arguments

python train.py -help

Once completing the data preparation and environment setup, we can train the model via acrobert.py.

python acrobert.py -pre_train_path ../input/pre_train_sample.txt

The entire pre-training needs two weeks on a single NVIDIA Tesla V100S PCIe 32 GB Specs.

Quick Reproduction

We provide a one-line command to reproduce the scores in Table A1, which is the easiest one to reproduce, and you can see the scores after only several minutes. The needed test sets are in this path /evaluation/test_set, and you can find three evaluation sets. The corresponding dictionaries are in this path /evaluation/dict. We also provided the AcroBERT model file in this path /input/acrobert.pt. Then the scores can be obtained by using the following command:

python acrobert.py -mode evaluation

Finally, you can see the results

F1: [88.8, 58.0, 67.5], ACC: [93.7, 72.0, 65.3]

Finetuning

python acrobert.py -mode finetuning -learning_rate 1e-6 -hard_neg_numbers 1 -lr_decay 0.6

Citation

@inproceedings{chen2023gladis,
  title={GLADIS: A General and Large Acronym Disambiguation Benchmark},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  booktitle={EACL 2023-The 17th Conference of the European Chapter of the Association for Computational Linguistics},
  year={2023}
}

Acknowledgements

This work was partially funded by ANR-20-CHIA0012-01 (“NoRDF”).

gladis's People

Contributors

tigerchen52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gladis's Issues

Loading Model from State Dictionary throws Unexpected Keys Error

Hi I tried using the AcroBert model on my machine and ran into an error:

I ran the installation as described in the Repo.
Downloaded the model and the dictionary from the given link.

As I try to try to run the acrobert.py script in the .\inference\ folder like this:

GLADIS\inference> py .\acrobert.py

it loads the dictionary:

2024-04-26 16:32:02,565 : utils.py : load_acronym_kb : INFO : loaded acronym dictionary successfully, in total there are [1542845] acronyms
2024-04-26 16:32:02,565 : acrobert.py : <module> : INFO : running .\acrobert.py

but then throws:

RuntimeError: Error(s) in loading state_dict for AcronymBERT: Unexpected key(s) in state_dict: "model.bert.embeddings.position_ids".

Im running Python 3.11.4 on Windows 11.
I have installed all requirements and updated all my modules.
I have re-downloaded the repo and the model files multiple times.

Am I doing something wrong?
Do you have an idea what might cause this?
Does it behave the same way on your machine?

Any help is appreciated!

Best,
Jasper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.