Coder Social home page Coder Social logo

fciannella / deft_corpus Goto Github PK

View Code? Open in Web Editor NEW

This project forked from adobe-research/deft_corpus

0.0 2.0 0.0 29.23 MB

The Definition Extraction From Text corpus and relevant formatting scripts

License: Other

Python 82.98% Jupyter Notebook 17.02%

deft_corpus's Introduction

Welcome to the DEFT corpus!

https://competitions.codalab.org/competitions/20900

Welcome to the largest expertly annotated corpus for complex definition extraction in free text. Pardon our dust - this data is associated with SemEval 2020 Task 6 (DeftEval) and we are releasing the full dataset on the SemEval conference schedule. Train and dev data are available, and test data will become available after the completion of the SemEval evaluation period on 2 Feb 2020. You can source the complete text from the corresponding textbooks at https://cnx.org.

The most recent version of the corpus was updated on 04 SEPT 2019.

For more information regarding the annotation, schema, or general characteristics of the corpus, please see our paper here.

Data Format

We are currently releasing annotated data using a CoNLL 2003-like format with the following structure:

TOKEN TXT_SOURCE_FILE START_CHAR END_CHAR TAG TAG_ID ROOT_ID RELATION

Character indices are derived from the brat standoff format. Tags follow a BIO format with the tag schema outlined in the paper.

Licensing Information

The entire dataset of textbook sentences with annotations is available for use under the CC BY-NC-SA 4.0 license. Contact the authors for information on commercial use.

Citation

If you use the DEFT corpus in your publication, please cite this paper:

@inproceedings{spala-etal-2019-deft,
    title = "{DEFT}: A corpus for definition extraction in free- and semi-structured text",
    author = "Spala, Sasha  and
      Miller, Nicholas A.  and
      Yang, Yiming  and
      Dernoncourt, Franck  and
      Dockhorn, Carl",
    booktitle = "Proceedings of the 13th Linguistic Annotation Workshop",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4015",
    pages = "124--131",
    abstract = "Definition extraction has been a popular topic in NLP research for well more than a decade, but has been historically limited to well-defined, structured, and narrow conditions. In reality, natural language is messy, and messy data requires both complex solutions and data that reflects that reality. In this paper, we present a robust English corpus and annotation schema that allows us to explore the less straightforward examples of term-definition structures in free and semi-structured text.",
}

My Commands --mode training --experiment car_mb --collection mb --local_model models/pytorch_car.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.car_mb --model_path models/saved.car_mb --eval_steps 1000 --device cuda

--mode training --experiment msmarco_mb --collection mb --local_model models/pytorch_car.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.car_mb --model_path models/saved.car_mb --eval_steps 1000 --device cuda

--mode training --experiment mb --collection mb --local_model models/bert-large-uncased.tar.gz --local_tokenizer models/bert-large-uncased-vocab.txt --batch_size 16 --data_path data --predict_path data/predictions/predict.mb --model_path models/saved.mb --eval_steps 1000 --device cuda

deft_corpus's People

Contributors

fciannella avatar franck-dernoncourt avatar vembar avatar

Watchers

James Cloos avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.