Coder Social home page Coder Social logo

izuna385 / entity-linking-tutorial Goto Github PK

View Code? Open in Web Editor NEW
32.0 2.0 4.0 2.76 MB

Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments on Co-lab pro GPU are also supported!

Home Page: https://medium.com/nerd-for-tech/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500

Python 100.00%
bert allennlp natural-language-processing entity-linking named-entity-disambiguation approximate-nearest-neighbor-search

entity-linking-tutorial's Introduction

Entity-Linking-Tutorial

  • In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.

  • We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.

Docs for English

Docs for Japanese

Tutorial with Colab-Pro.

See here.

Environment Setup

  • First, create base environment with conda.
# If you don't use colab-pro, create environment from conda.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt

Preprocessing

  • First, download preprocessed files from here, then unzip.

  • Second, download BC5CDR dataset to ./dataset/ and unzip.

  • You have to place CDR_DevelopmentSet.PubTator.txt, CDR_TestSet.PubTator.txt and CDR_TrainingSet.PubTator.txt under ./dataset/.

  • Then, run python3 BC5CDRpreprocess.py and python3 preprocess_mesh.py.

Models and Scoring

Models

  • Surface-Candidate based

    biencoder

  • ANN-search based

    entire_biencoder

Scoring

  • Default: Dot product between mention and predicted entity.

    scoring

  • L2-distance and cosine similarity are also supported.

Experiment and Evaluation

$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.
$ python3 main.py

Parameters

We only here note critical parameters for training and evaluation. For further detail, see parameters.py.

Parameter Name Description Default
batch_size_for_train Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples. 16
lr Learning rate. 1e-5
max_candidates_num Determine how many candidates are to be generated for each mention by using surface form. 5
search_method_for_faiss This specifies whether to use the cosine distance (cossim), inner product (indexflatip), or L2 distance (indexflatl2) when performing approximate neighborhood search. indexflatip

Result

  • Surface-Candidate based recall

    Generated Candidates Num 5 10 20
    dev_recall 76.80 79.91 80.92
    test_recall 74.35 77.14 78.25

batch_size_for_train: 16

  • Surface-Candidate based acc.

    Generated Candidates Num 5 10 20
    dev_acc 59.85 52.56 47.23
    test_acc 58.51 51.38 45.69
  • ANN-search Based

    (Generated Candidates Num: 50 (Fixed))

    Recall@X 1 (Acc.) 5 10 50
    dev_recall 21.58 42.28 50.48 67.11
    test_recall 21.50 40.29 47.95 64.52

batch_size_for_train: 48

  • Surface-Candidate based acc.

    Generated Candidates Num 5 10 20
    dev_acc 72.39 68.21 65.40
    test_acc 70.95 66.87 63.72
  • ANN-search Based

    (Generated Candidates Num: 50 (Fixed))

    Recall@X 1 (Acc.) 5 10 50
    dev_recall 58.86 74.33 78.14 83.10
    test_recall 57.66 73.14 76.73 81.39

LICENSE

MIT

entity-linking-tutorial's People

Contributors

izuna385 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

entity-linking-tutorial's Issues

Different Recall&Accuracy Results

Hi,

I ran your code successfully and some results are different, especially for batch_size: 48. I used your parameters, but accuracy and recall values' differences are overmuch. I wonder why, maybe you changed other parameters. Especially for batch size: 48, why accuracies and recalls values are quite different according to your results.

Also biobert model didn't run, I got an error

Have a nice day.

===PARAMETERS===
debug False
debug_data_num 200
dataset bc5cdr
dataset_dir ./dataset/
serialization_dir ./serialization_dir/
preprocessed_doc_dir ./preprocessed_doc_dir/
kb_dir ./mesh/
cached_instance False
lr 1e-05
weight_decay 0
beta1 0.9
beta2 0.999
epsilon 1e-08
amsgrad False
word_embedding_dropout 0.1
cuda_devices 0
scoring_function_for_model indexflatip
num_epochs 10
patience 10
batch_size_for_train 48
batch_size_for_eval 48 or 16, I tried this
bert_name bert-base-uncased
max_context_len 50
max_mention_len 12
max_canonical_len 12
max_def_len 36
model_for_training biencoder I tried other models
candidates_dataset ./candidates.pkl
max_candidates_num 10
search_method_for_faiss indexflatip
how_many_top_hits_preserved 50
===PARAMETERS END===

image

BioBERT error:
image

Generate candidates.pkl

How can I create my candidates.pkl file for another dataset. Your candidatesgenerator.py code just load candidates.pkl in the parameters, not generator I guess. Could you help me for this problem?

Thanks

Transformers: Dll load failed

Hi,

I ran this code on Colab. I also want to try it on my computer. I installed all required packages. I ran preprocess part without error, but the code was given an error when I ran main.py. This error: dll load failed module not found in tokenizer.py (line 4: from transformers import AutoTokenizer, AutoModel). I looked at this error, but I didn't fix yet. What can I do, what do you suggest to me?

Have a nice day
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.