dhlab-epfl / linkedbooksdeepreferenceparsing Goto Github PK

View Code? Open in Web Editor NEW

14.0 7.0 4.0 5.58 MB

A deep learning architecture for reference mining from literature in the arts and humanities.

Home Page: https://www.frontiersin.org/articles/10.3389/frma.2018.00021/full

License: MIT License

Jupyter Notebook 84.64% Python 15.36%

deep-learning crf crf-model annotations dataset annotations-dataset annotated-references venice footnotes citations

linkedbooksdeepreferenceparsing's Introduction

Deep Reference Parsing

This repository contains the code for the following article:

@article{alves_deep_2018,
      author       = {{Rodrigues Alves, Danny and Giovanni Colavizza and Frédéric Kaplan}},
      title        = {{Deep Reference Mining from Scholarly Literature in the Arts and Humanities}},
      journal      = {{Frontiers in Research Metrics & Analytics}},
      volume       = 3,
      number       = 21,
      year         = 2018,
      doi          = {10.3389/frma.2018.00021}
    }

Task definition

We focus on the task of reference mining, instantiated into three tasks: reference components detection (task 1), reference typology detection (task 2) and reference span detection (task 3).

Sequence: G. Ostrogorsky, History of the Byzantine State, Rutgers University Press, 1986.
Task 1: author author title title title title title publisher publisher publisher year
Task 2: b-secondary i-secondary ... e-secondary
Task 3: b-r i-r ... e-r

LICENSE MIT.
README.md this file.
dataset/
- train Train split, CoNLL format.
- test Test split, CoNLL format.
- validation Validation split, CoNLL format.
compressed dataset Compressed dataset.
data facts a Python notebook to explore the dataset (number of references, tag distributions).
crf_baseline CRF baseline implementation details.
keras Keras implementation details.
tensorflow TF implementation details.

Dataset

Example of dataset entry (beginning of validation dataset, first line/sequence): Token Task1tag Task2tag Task3tag`:

-DOCSTART- -X- -X- o

C author b-secondary b-r
. author i-secondary i-r
Agnoletti author i-secondary i-r
, author i-secondary i-r
Treviso title i-secondary i-r
e title i-secondary i-r
le title i-secondary i-r
sue title i-secondary i-r
pievi title i-secondary i-r
. title i-secondary i-r
Illustrazione title i-secondary i-r
storica title i-secondary i-r
, title i-secondary i-r
Treviso publicationplace i-secondary i-r
1898 year i-secondary i-r
, year i-secondary i-r
2 publicationspecifications i-secondary i-r
v publicationspecifications e-secondary i-r
. publicationspecifications e-secondary e-r

Pre-trained word vectors can be downloaded from Zenodo:

Implementations

CRF baseline

See internal readme for details.

Keras

See internal readme for details.

Tensor Flow

See internal readme for details.

This implementation borrows from Guillaume Genthial's Sequence Tagging with Tensorflow.

linkedbooksdeepreferenceparsing's People

Contributors

Stargazers

Watchers

Forkers

arockenberger mromanello ivyleavedtoadflax project-renard-survey

linkedbooksdeepreferenceparsing's Issues

Foldering

Excellent code Danny!

Can you please put all the Keras code into a keras/ folder, and add a README there with the details on how to use it?

The general structure we will have is:
data/
keras/
tensorflow/
...

Within each a README with details. The word embeddings I will store in a separate location.

Thanks

Incomplete list of dependencies

Hi @Giovanni1085 many thanks for publishing this code, and your very useful paper.

unfortunately I'm having several issues getting it run on my machine. I suspect much of it is caused by not having a complete list of dependencies. Would you consider adding a more comprehensive list in a requirements.txt?

I'll document the issues I'm having in some other issues.

Update TODO list

Remember to update the TODO list by striking through something done and adding/editing as needed.

IndexError: list index out of range

When running python main_threeTasks.py (from ./crf_baseline) I get the following error:

/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/crf_baseline/build/virtualenv/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Traceback (most recent call last):
  File "main_threeTasks.py", line 22, in <module>
    X_train_w, train_t1, train_t2, train_t3 = load_data("../dataset/clean_train.txt")
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/crf_baseline/code/utils.py", line 50, in load_data
    tags4.append(w[4])
IndexError: list index out of range

AttributeError: 'Model' object has no attribute 'output_layers'

When running python keras/main_multiTaskLearning.py I run into AttributeError: 'Model' object has no attribute 'output_layers'.

Full traceback included below:

(virtualenv)  matthew@xps15  ~/Documents/wellcome/LinkedBooksDeepReferenceParsing   master ●  python keras/main_multiTaskLearning.py
WARNING: Logging before flag parsing goes to stderr.
W0720 20:58:02.091192 140559446904960 deprecation_wrapper.py:119] From keras/main_multiTaskLearning.py:6: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

Using TensorFlow backend.
Number of  entries:  828394
Individual entries:  57099
Number of labels:  27
Number of labels:  10
Number of labels:  4
{1: 'abbreviation', 2: 'archivalreference', 3: 'archive_lib', 4: 'attachment', 5: 'author', 6: 'box', 7: 'cartulation', 8: 'column', 9: 'conjunction', 10: 'date', 11: 'filza', 12: 'folder', 13: 'foliation', 14: 'numbered_ref', 15: 'o', 16: 'pagination', 17: 'publicationnumber-year', 18: 'publicationplace', 19: 'publicationspecifications', 20: 'publisher', 21: 'ref', 22: 'registry', 23: 'series', 24: 'title', 25: 'tomo', 26: 'volume', 27: 'year'}
{1: 'b-meta-annotation', 2: 'b-primary', 3: 'b-secondary', 4: 'e-meta-annotation', 5: 'e-primary', 6: 'e-secondary', 7: 'i-meta-annotation', 8: 'i-primary', 9: 'i-secondary', 10: 'o'}
{1: 'b-r', 2: 'e-r', 3: 'i-r', 4: 'o'}
Maximum sequence length - general : 73
Maximum sequence length - data    : 73
Maximum sequence length - general : 73
Maximum sequence length - data    : 30
Maximum sequence length - general : 73
Maximum sequence length - data    : 35
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum sequence length - labels : 73
Maximum number of words in a sequence  : 73
Maximum number of characters in a word : 54
====== multi_task start ======
W0720 20:58:23.178692 140559446904960 deprecation_wrapper.py:119] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0720 20:58:23.178883 140559446904960 deprecation_wrapper.py:119] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0720 20:58:48.890198 140559446904960 deprecation_wrapper.py:119] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0720 20:58:48.894599 140559446904960 deprecation_wrapper.py:119] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2019-07-20 20:58:48.894895: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-20 20:58:48.915630: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2904000000 Hz
2019-07-20 20:58:48.916541: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55eca50c1120 executing computations on platform Host. Devices:
2019-07-20 20:58:48.916563: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-20 20:58:48.925211: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
W0720 20:58:49.350232 140559446904960 deprecation.py:506] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras_contrib-2.0.8-py3.7.egg/keras_contrib/layers/crf.py:346: UserWarning: CRF.loss_function is deprecated and it might be removed in the future. Please use losses.crf_loss instead.
/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras_contrib-2.0.8-py3.7.egg/keras_contrib/layers/crf.py:363: UserWarning: CRF.viterbi_acc is deprecated and it might be removed in the future. Please use metrics.viterbi_acc instead.
W0720 20:58:50.320957 140559446904960 deprecation_wrapper.py:119] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0720 20:58:50.371632 140559446904960 deprecation.py:323] From /home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:2403: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Traceback (most recent call last):
  File "keras/main_multiTaskLearning.py", line 87, in <module>
    gen_confusion_matrix=True, early_stopping_patience=5
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/keras/code/models.py", line 177, in BiLSTM_model
    hist = model.fit(X_train, y_train, validation_data=[X_test, y_test], epochs=nbr_epochs, batch_size=batch_size, callbacks=callbacks, verbose=2)
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 127, in fit_loop
    callbacks.on_train_begin()
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/build/virtualenv/lib/python3.7/site-packages/keras/callbacks.py", line 132, in on_train_begin
    callback.on_train_begin(logs)
  File "/home/matthew/Documents/wellcome/LinkedBooksDeepReferenceParsing/keras/code/utils.py", line 381, in on_train_begin
    if len(self.model.output_layers) > 1:
AttributeError: 'Model' object has no attribute 'output_layers'

I'm running python 3.7.0 with the following package versions:

absl-py==0.7.1
astor==0.8.0
cycler==0.10.0
gast==0.2.2
google-pasta==0.1.7
grpcio==1.22.0
h5py==2.9.0
joblib==0.13.2
Keras==2.2.4
Keras-Applications==1.0.8
keras-contrib==2.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
matplotlib==3.1.1
numpy==1.16.4
protobuf==3.9.0
pyparsing==2.4.0
python-crfsuite==0.9.6
python-dateutil==2.8.0
PyYAML==5.1.1
scikit-learn==0.21.2
scipy==1.3.0
six==1.12.0
sklearn-crfsuite==0.3.6
tabulate==0.8.3
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
termcolor==1.1.0
tqdm==4.32.2
Werkzeug==0.15.5
wrapt==1.11.2

Note that my versions do not match what are required in the README.md, however I have tried installing scikit-learn==0.19.1 (for example) and immediately run into installation issues due to version incompatibilities, which I think would be mostly solved with a requirements.txt hence #3.

Tokeniser

Hi @Giovanni1085 do you have any information about the tokeniser that was used on the training texts?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

dhlab-epfl / linkedbooksdeepreferenceparsing Goto Github PK

linkedbooksdeepreferenceparsing's Introduction

Deep Reference Parsing

Task definition

Contents

Dataset

Implementations

CRF baseline

Keras

Tensor Flow

linkedbooksdeepreferenceparsing's People

Contributors

Stargazers

Watchers

Forkers

linkedbooksdeepreferenceparsing's Issues

Recommend Projects

Recommend Topics

Recommend Org