Coder Social home page Coder Social logo

pytorch-sequence-tagger's Introduction

pytorch-sequence-tagger

This repo will try to build model that is used for sequence tagging (POS Tag, NER, etc.). The implementation is following the experiment of [1]. The model is a combination between embedding, LSTM and CRF. The architecture shown as the following

alt text

The difference between the above architecture and the one that implemented on this exercise is before CRF layer linear projection is added in order to map existing result from Bi-LSTM layer to the size of tags.

  • The embedding used by this architeture is a combination between word embedding and character embedding. There are two types of character embedding available on this repository, one is lstm from [3] and cnn from [2]. According to [2], convolutional layer combined with max pooling is able to capture morphological feature between characters. Another experiment in [3], although the model that is employed is different (using LSTM to extract character level feature), the idea to use character-level based helps to make the model more robust while handling OOV in low resource language.
  • The architecture that is used in this exercise is Bi-LSTM. Bi-LSTM is almost a go-to architecture in solving NLP problems. It works by traversing over the sentence in forward and backward manners and combining the vectors resulted from those two.
  • Thanks to Bi-LSTM, at every timestep t we are able to get h which is a combination between current word t and its surrounding (context). However, the tags decision that made relying on the output of Bi-LSTM only is still considered local. Therefore, CRF is employed on this exercise to mitigate the aforementioned local decision.
  • Several weights and bias initialization are conducted in RNN, embedding and linear layer. Most of them are using xavier initialization based on empirical observation while tuning the model.

How to Run?

Data Preparation

Within the same folder as this answer, please put the data inside data/ folder. Your data must have <dataset_name>.[dev|train|test].csv format. For example conll.dev.csv. All three must exist before you run the training phase.

Below is the example of the data format

sentece,tags
CRICKET - LEICESTERSHIRE TAKE OVER AT TOP AFTER INNINGS VICTORY ., O O I-ORG O O O O O O O O

Installation

  1. Ensure you have python3.6 installed on your device.
  2. Run pip install -r requirements.txt to install the dependencies.

Training and Validation

Run python train.py with the following arguments:

optional arguments:
  -h, --help            show this help message and exit
  --batch_size BATCH_SIZE
                        Number of batch in single iteration
  --dataset_path DATASET_PATH
                        Path for source training data. Ex: data/train.en
  --max_epochs MAX_EPOCHS
                        Number of epochs
  --model_config MODEL_CONFIG
                        Location of model config
  --learning_rate LEARNING_RATE
                        Learning rate size
  --log_interval LOG_INTERVAL
                        Print loss for every N steps
  --save_interval SAVE_INTERVAL
                        Save model for every N steps
  --restore_nth_model RESTORE_NTH_MODEL
                        Restore the nth model saved on model_path. The valid
                        values are string `latest`, and numbers (1, 2, 3, ...)
  --early_stopping_patience EARLY_STOPPING_PATIENCE
                        The number of patience required for early stopping
  --num_retained_models NUM_RETAINED_MODELS
                        Number of models retained for checkpoint
  --model_path MODEL_PATH
                        Location to save the model

Configuration

In order to run the training phase, one need to prepare configuration file with the following format

embedding:
  embedding_type: ["lstm"|"cnn"]
  word_embedding_size: <int>
  char_embedding_size: <int>
  embedding_dropout: <float>
  output_size: <int>
  char_lstm_layers: <int>
  char_lstm_dropout: <float>
  pretrained: [null|string from [here](https://github.com/pytorch/text/blob/master/torchtext/vocab.py)
tagger:
  hidden_size: <int>
  layer_size: <int>
  rnn_dropout: <float>

Examples can bee seen under config/

Reference

  1. https://arxiv.org/pdf/1603.01354.pdf
  2. http://aclweb.org/anthology/D15-1025
  3. https://arxiv.org/pdf/1805.12291.pdf

pytorch-sequence-tagger's People

Contributors

akurniawan avatar

Watchers

James Cloos avatar almamon rasool abdali avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.