Coder Social home page Coder Social logo

bowdbeg / edge-oriented-graph Goto Github PK

View Code? Open in Web Editor NEW

This project forked from fenchri/edge-oriented-graph

0.0 1.0 0.0 14.19 MB

Source code for "Connecting the dots: Document-level Relation Extraction with Edge-oriented Graphs" in EMNLP 2019

License: Other

Python 93.79% Shell 4.63% C 1.58%

edge-oriented-graph's Introduction

Edge-oriented Graph

Source code for the paper "Connecting the Dots: Document-level Relation Extraction with Edge-oriented Graphs" in EMNLP 2019.

Requirements

  • python 3.5 +
  • PyTorch 1.1.0
  • tqdm
  • matplotlib
  • recordtype
  • orderedyamlload
pip3 install -r requirements.txt

Environment

Results can be reproducable, when using a seed equal to 0 and the following settings: GK210GL Tesla K80 GPU, Ubuntu 16.04

Datasets & Pre-processing

Download the datasets

$ mkdir data && cd data
$ wget https://biocreative.bioinformatics.udel.edu/media/store/files/2016/CDR_Data.zip && unzip CDR_Data.zip && mv CDR_Data CDR
$ wget https://bitbucket.org/alexwuhkucs/gda-extraction/get/fd4a7409365e.zip && unzip fd4a7409365e.zip && mv alexwuhkucs-gda-extraction-fd4a7409365e GDA
$ cd ..

Download the GENIA Tagger and Sentence Splitter:

$ cd data_processing
$ mkdir common && cd common
$ wget http://www.nactem.ac.uk/y-matsu/geniass/geniass-1.00.tar.gz && tar xvzf geniass-1.00.tar.gz
$ cd geniass/ && make && cd ..
$ git clone https://github.com/bornabesic/genia-tagger-py.git
$ cd genia-tagger-py && make
$ cd ../../

In order to process the datasets, they should first be transformed into the PubTator format. The run the processing scripts as follows:

$ sh process_cdr.sh
$ sh process_gda.sh

In order to get the data statistics run:

python3 statistics.py --data ../data/CDR/processed/train.data

This will additionally generate the gold-annotation file in the same folder with suffix .gold.

Usage

Run the main script from training and testing as follows. Select gpu -1 for cpu mode.

CDR dataset: Train the model on the training set and evaluate on the dev set, in order to identify the best training epoch. For testing, re-run the model on the union of train and dev (train+dev_filter.data) until the best epoch and evaluate on the test set.

GDA dataset: Simply train the model on the training set and evaluate on the dev set. Test the saved model on the test set.

In order to ensure the usage of early stopping criterion, use the --early_stop option. If during training early stopping is not triggered, the maximum epoch (specified in the config file) will be used.

Otherwise, if you want to train up to a specific epoch, use the --epoch epochNumber option without early stopping. The maximum stopping epochs can be defined by the --epoch option.

For example, in the CDR dataset:

$ cd src/
$ python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu 0 --early_stop       # using early stopping
$ python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu 0 --epoch 15         # train until the 15th epoch *without* early stopping
$ python3 eog.py --config ../configs/parameters_cdr.yaml --train --gpu 0 --epoch 15 --early_stop  # set both early stop and max epoch

$ python3 eog.py --config ../configs/parameters_cdr.yaml --test --gpu 0

All necessary parameters can be stored in the yaml files inside the configs directory. The following parameters can be also directly given as follows:

usage: eog.py [-h] --config CONFIG [--train] [--test] [--gpu GPU]
              [--walks WALKS] [--window WINDOW] [--edges [EDGES [EDGES ...]]]
              [--types TYPES] [--context CONTEXT] [--dist DIST] [--example]
              [--seed SEED] [--early_stop] [--epoch EPOCH]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Yaml parameter file
  --train               Training mode - model is saved
  --test                Testing mode - needs a model to load
  --gpu GPU             GPU number
  --walks WALKS         Number of walk iterations
  --window WINDOW       Window for training (empty processes the whole
                        document, 1 processes 1 sentence at a time, etc)
  --edges [EDGES [EDGES ...]]
                        Edge types
  --types TYPES         Include node types (Boolean)
  --context CONTEXT     Include MM context (Boolean)
  --dist DIST           Include distance (Boolean)
  --example             Show example
  --seed SEED           Fixed random seed number
  --early_stop          Use early stopping
  --epoch EPOCH         Maximum training epoch

Evaluation

In order to evaluate you need to first generate the gold data format and then use the evaluation script as follows:

$ cd evaluation/
$ python3 evaluate.py --pred path_to_predictions_file --gold ../data/CDR/processed/test.gold --label 1:CDR:2
$ python3 evaluate.py --pred path_to_predictions_file --gold ../data/GDA/processed/test.gold --label 1:GDA:2

** Results **

Below are the results in terms of F1-score for the CDR and GDA datasets.
The results of CDR are the same as reported in the paper and can be reproduced using the provided source code with a similar environment.
For the GDA dataset, due to a sentence-splitting error, some inter-sentence pairs were missed.
Below are the GDA updated results. Intra-sentence and overall performance are similar, however, inter-sentence performance differs.

CDR Dev F1 (%) Test F1 (%)
Model Overall Intra Inter Overall Intra Inter
EoG L = 8 63.57 68.25 46.68 63.62 68.25 50.94
EoG (Full) L = 2 58.90 66.49 40.20 57.66 66.52 39.41
EoG (NoInf) - 50.07 57.58 33.93 49.24 60.12 30.64
EoG (Sent) L = 4 57.56 65.47 - 55.22 65.29 -
GDA Dev F1 (%) Test F1 (%)
Model Overall Intra Inter Overall Intra Inter
EoG L = 16 78.61 83.00 46.92 80.18 84.74 45.66
EoG (Full) L = 4 77.89 82.26 54.26 79.94 84.60 54.77
EoG (NoInf) - 71.60 77.19 45.34 73.73 79.22 47.06
EoG (Sent) L = 16 72.17 78.10 - 73.05 78.83 -

Citation

If you found this code useful and plan to use it, please cite the following paper =)

@inproceedings{christopoulou2019connecting,  
title = "Connecting the Dots: Document-level Neural Relation Extraction with Edge-oriented Graphs",  
author = "Christopoulou, Fenia and Miwa, Makoto and Ananiadou, Sophia",  
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",  
year = "2019",  
publisher = "Association for Computational Linguistics",  
pages = "4927--4938"  
}  

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.