Coder Social home page Coder Social logo

aakhundov / sequence-labeling Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 2.0 2.88 MB

Accompanying repository of our NLP paper "Sequence Labeling: A Practical Approach" (arXiv: 1808.03926).

License: Apache License 2.0

Python 100.00%
nlp machine-learning sequence-labeling practical-machine-learning bilstm-crf embeddings

sequence-labeling's Introduction

Sequence Labeling: A Practical Approach

Accompanying repository of the paper "Sequence Labeling: A Practical Approach". Works well with TensorFlow 1.3 and 1.4 (the performance drops substantially under the version 1.5 and above). The resources within the repo are structured as follows:

  • convert - the folder with several scripts for converting the standard sequence labeling datasets (including those mentioned in the paper) to the standardized format recognizable by the code. The instructions on converting each particular standard dataset are in the heading comments of the respective conversion script. The recognizable format is:

    • two files with the data - train.txt and val.txt - each containing the respective part of the dataset (training and validation/development set). Each line of these files should contain the space-separated list of tokens of one sentence, separated by a tab from the space-separated list of the corresponding labels.
    • one file with the set of all available labels - labels.txt - containing one label in each of its lines. The labels in the file should come in the alphabetical order. The exception is the null-label (e.g. "O") in multi-token labeling tasks (e.g. IOB or IOBES tagging scheme used for NER or Chunking), which should be in the last line.
  • The files with the word embeddings should be copied into the sub-folders of the data/embeddins folder. Each sub-folder corresponds to one of the supported embedding types - GloVe, Polyglot, or Senna - and contains the specific instructions.

  • train.py trains and saves a sequence labeling model, given the path to the folder with the input data in the recognizable format (see above). The specification of the command-line arguments can be obtained by running the script with "-h" key. The training results are written to a sub-folder of the "results" folder named after the input folder plus a timestamp. These results may be used further for evaluation on and/or annotation of new data.

  • evaluate.py evaluates the trained model, given the path to the training results and the name of the data file (within the original data folder) to evaluate on (e.g. "test.txt" containing the testing set). The specification of the command-line arguments can be obtained by running the script with "-h" key.

  • annotate.py annotates new data, given the path to the training results (containing the trained model to be used for the annotation) and the path to the data to be annotated in the recognizable format (see above; labels are not required).

  • logs folder contains the detailed results of the experiments on the eight standard datasets mentioned in the paper. Each dataset folder contains six sub-folders corresponding to six different scenarios used in the ablation studies (with or without byte embeddings, word embeddings, and CRF layer). Each scenario folder, in turn, contains the following three files: training_log.txt (detailed log of the model training with the validation results after each epoch), evaluation_log.txt (the results of the evaluation performed using the official CoNLL evaluation script), and labeled_test_set.zip (the test set with the predicted and ground truth labels, ready to be evaluated by the above-mentioned CoNLL evaluation script; for privacy reasons, all tokens in the file are replaced with a "W" token).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.