https://competitions.codalab.org/competitions/20900
Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated with SemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at https://cnx.org.
The most recent version of the corpus was updated on 04 SEPT 2019.
For more information regarding the annotation, schema, or general characteristics of the corpus, please see our paper here.
We are currently releasing annotated data using a CoNLL 2003-like format with the following structure:
TOKEN TXT_SOURCE_FILE START_CHAR END_CHAR TAG TAG_ID ROOT_ID RELATION
Character indices are derived from the brat standoff format. Tags follow a BIO format with the tag schema outlined in the paper.
The entire dataset of textbook sentences with annotations is available for use under the CC BY-NC-SA 4.0 license. Contact the authors for information on commercial use.
If you use the DEFT corpus in your publication, please cite this paper:
@inproceedings{spala-etal-2019-deft,
title = "{DEFT}: A corpus for definition extraction in free- and semi-structured text",
author = "Spala, Sasha and
Miller, Nicholas A. and
Yang, Yiming and
Dernoncourt, Franck and
Dockhorn, Carl",
booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
month = aug,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-4015",
pages = "124--131",
abstract = "Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.",
}
My Commands --mode training --experiment car_mb --collection mb --local_model models/pytorch_car.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.car_mb --model_path models/saved.car_mb --eval_steps 1000 --device cuda
--mode training --experiment msmarco_mb --collection mb --local_model models/pytorch_car.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.car_mb --model_path models/saved.car_mb --eval_steps 1000 --device cuda
--mode training --experiment mb --collection mb --local_model models/bert-large-uncased.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.mb --model_path models/saved.mb --eval_steps 1000 --device cuda