Coder Social home page Coder Social logo

tanl's Introduction

TANL: Structured Prediction as Translation between Augmented Natural Languages

Code for the paper "Structured Prediction as Translation between Augmented Natural Languages" (ICLR 2021) and fine-tuned multi-task model.

If you use this code, please cite the paper using the bibtex reference below.

@inproceedings{tanl,
    title={Structured Prediction as Translation between Augmented Natural Languages},
    author={Giovanni Paolini and Ben Athiwaratkun and Jason Krone and Jie Ma and Alessandro Achille and Rishita Anubhai and Cicero Nogueira dos Santos and Bing Xiang and Stefano Soatto},
    booktitle={9th International Conference on Learning Representations, {ICLR} 2021},
    year={2021},
}

Requirements

  • Python 3.6+
  • PyTorch (tested with version 1.7.1)
  • Transformers (tested with version 4.0.0)
  • NetworkX (tested with version 2.5, only used in coreference resolution)

You can install all required Python packages with pip install -r requirements.txt

Datasets

By default, datasets are expected to be in data/DATASET_NAME. Dataset-specific code is in datasets.py.

The CoNLL04 and ADE datasets (joint entity and relation extraction) in the correct format can be downloaded using https://github.com/markus-eberts/spert/blob/master/scripts/fetch_datasets.sh. For other datasets, we provide sample processing code which does not necessarily match the format of publicly available versions (we do not plan to adapt the code to load datasets in other formats).

Running the code

Use the following command: python run.py JOB

The JOB argument refers to a section of the config file, which by default is config.ini. A sample config file is provided, with settings that allow for a faster training and less memory usage than the settings used to obtain the final results in the paper.

For example, to replicate the paper's results on CoNLL04, have the following section in the config file:

[conll04_final]
datasets = conll04
model_name_or_path = t5-base
num_train_epochs = 200
max_seq_length = 256
max_seq_length_eval = 512
train_split = train,dev
per_device_train_batch_size = 8
per_device_eval_batch_size = 16
do_train = True
do_eval = False
do_predict = True
episodes = 1-10
num_beams = 8

Then run python run.py conll04_final. Note that the final results will differ slightly from the ones reported in the paper, due to small code changes and randomness.

Config arguments can be overwritten by command line arguments. For example: python run.py conll04_final --num_train_epochs 50.

Additional details

If do_train = True, the model is trained on the given train split (e.g., 'train') of the given datasets. The final weights and intermediate checkpoints are written in a directory such as experiments/conll04_final-t5-base-ep200-len256-b8-train, with one subdirectory per episode. Results in JSON format are also going to be saved there.

In every episode, the model is trained on a different (random) permutation of the training set. The random seed is given by the episode number, so that every episode always produces the same exact model.

Once a model is trained, it is possible to evaluate it without training again. For this, set do_train = False or (more easily) provide the -e command-line argument: python run.py conll04_final -e.

If do_eval = True, the model is evaluated on the 'dev' split. If do_predict = True, the model is evaluated on the 'test' split.

Arguments

The following are the most important command-line arguments for the run.py script. Run python run.py -h for the full list.

  • -c CONFIG_FILE: specify config file to use (default is config.ini)
  • -e: only run evaluation (overwrites the setting do_train in the config file)
  • -a: evaluate also intermediate checkpoints, in addition to the final model
  • -v : print results for each evaluation run
  • -g GPU: specify which GPU to use for evaluation

The following are the most important arguments for the config file. See the sample config file to understand the format.

  • datasets (str): comma-separated list of datasets for training
  • eval_datasets (str): comma-separated list of datasets for evaluation (default is the same as for training)
  • model_name_or_path (str): path to pretrained model or model identifier from huggingface.co/models (e.g. t5-base)
  • do_train (bool): whether to run training (default is False)
  • do_eval (bool): whether to run evaluation on the dev set (default is False)
  • do_predict (bool): whether to run evaluation on the test set (default is False)
  • train_split (str): comma-separated list of data splits for training (default is train)
  • num_train_epochs (int): number of train epochs
  • learning_rate (float): initial learning rate (default is 5e-4)
  • train_subset (float > 0 and <=1): portion of training data to effectively use during training (default is 1, i.e., use all training data)
  • per_device_train_batch_size (int): batch size per GPU during training (default is 8)
  • per_device_eval_batch_size (int): batch size during evaluation (default is 8; only one GPU is used for evaluation)
  • max_seq_length (int): maximum input sequence length after tokenization; longer sequences are truncated
  • max_output_seq_length (int): maximum output sequence length (default is max_seq_length)
  • max_seq_length_eval (int): maximum input sequence length for evaluation (default is max_seq_length)
  • max_output_seq_length_eval (int): maximum output sequence length for evaluation (default is max_output_seq_length or max_seq_length_eval or max_seq_length)
  • episodes (str): episodes to run (default is 0; an interval can be specified, such as 1-4; the episode number is used as the random seed)
  • num_beams (int): number of beams for beam search during generation (default is 1)
  • multitask (bool): if True, the name of the dataset is prepended to each input sentence (default is False)

See arguments.py and transformers.TrainingArguments for additional config arguments.

Fine-tuned multi-task model

The weights of our multi-task model (released under the CC BY 4.0 license) can be downloaded here: https://tanl.s3.amazonaws.com/tanl-multitask.zip

Extract the zip file in the experiments/ directory. This will create a subdirectory called multitask-t5-base-ep50-len512-b8-train,dev-overlap96. For example, to test the multi-task model on the CoNLL04 dataset, run python run.py multitask -e --eval_datasets conll04.

Note that: the multitask job is defined in config.ini; the -e flag is used to skip training and run evaluation only; the name of the subdirectory containing the weights is compatible with the definition of the multitask job.

The multi-task model was fine-tuned as described in the paper. The results differ slightly from what is reported in the paper due to small code changes.

Licenses

The code of this repository is released under the Apache 2.0 license. The weights of the fine-tuned multi-task model are released under the CC BY 4.0 license.

tanl's People

Contributors

amazon-auto avatar giove91 avatar jasonkrone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tanl's Issues

Bug in augment_sentence function

Hi, there seems to be a small bug in augment_sentence function in utils.py. When the root of the entity tree is an entity with tags, those tags won't be augmented onto the output. For example, when I run the code above:

from utils import augment_sentence

tokens = ['Tolkien', 'was', 'born', 'here']
augmentations =  [
        ([('person',), ('born in', 'here')], 0, 1),
        ([('location',)], 3, 4),
    ]

# example from the test set of conll03 NER
tokens = ['Premier', 'league']
augmentations = [([('miscellaneous',)], 0, 2)]

begin_entity_token = "["
sep_token = "|"
relation_sep_token = "="
end_entity_token = "]"

augmented_output = augment_sentence(tokens, augmentations, begin_entity_token, sep_token, relation_sep_token, end_entity_token)
print(augmented_output)

It prints out Premier league instead of [ Premier league | miscellaneous ]. This happened because in line 124 (utils.py), the value of the root in entity tree is reset to an empty list. My quick fix of this is initializing the start index of the root as -1. That is changing line 103 in utils.py to

root = (None, -1, len(tokens))   # this node represents the entire sentence

It would be great if someone could let me know if I am correct on this. Thanks!

reproduce on other datasets

Since you mentioned, "For other datasets, we provide sample processing code which does not necessarily match the format of publicly available versions (we do not plan to adapt the code to load datasets in other formats)". I'd like to know how can I reproduce the results on other datasets in the paper.

Inquiry Regarding ACE2005-Event Data for TANL

Hi,
could you kindly share the ACE2005 dataset for Event Extraction(Event Trigger Dataset & Event Argument Dataset) or provide guidance on how I might obtain access to it?
Thanks!

About performance on tacred

Hi,

Thanks for sharing the code. I try to reproduce the result on tacred. However, the F1 score on the test set is only 67.67.

The config I used is listed below.

[tacred]
datasets = tacred
multitask = False
model_name_or_path = t5-base
num_train_epochs = 10
max_seq_length = 256
train_split = train
per_device_train_batch_size = 16
do_train = True
do_eval = True
do_predict = True

I run the code with

CUDA_VISIBLE_DEVICES=0,1 nohup python3 -m torch.distributed.launch --nproc_per_node=2 run.py tacred > result.log 2>&1 &

May I ask which part goes wrong? Thank you.

Regards,
Yiming

The format of Multiwoz dataset

Hi Giovanni,

Nice work and thanks for the sharing. I am reproducing the results of the DST task. However, I found the processed data format of multiwoz 2.1 dataset using the script from https://github.com/jasonwu0731/trade-dst does not match your code. May I ask if you do additional preprocessing procedure? If so, would you mind sharing the script?

Sincerely,
Yan

Episode numbers in few-shot experiment

Hi,
Thank you for sharing the code ! I'm trying to reproduce the results on FewRel 1.0 . And I'm wondering how many episodes and query numbers are used in 1 shot , and 5 shot cases, respectively ?

Thanks.

Ace2005EventExtraction Dataset

Hi,

I've followed the instructions per section A.5 of the paper using this github repo: https://github.com/nlpcl-lab/ace2005-preprocessing/tree/96c8fd4b5a8c87dd6a265d5c14f4d8b8eb9b7fbe

which gives me train/dev/test.json files for ace2005.

However, inside of tanl/datasets.py, https://github.com/amazon-research/tanl/blob/2bd8052f0ff6df3b8fd04d7da1469d73f8639099/datasets.py#L1165 , I cannot find a way to run Ace2005. I am currently receiving the following error when attempting to train with ace2005 -

FileNotFoundError: [Errno 2] No such file or directory: 'data/ace2005event/ace2005event_types.json'

Does anyone have any advice on how to obtain the necessary files besides train/dev/test.json files for ace2005 to train Ace2005 event extraction?

Thanks,

About data files used for the FewRel dataset

Hi! I'm wondering how to prepare the data files for the FewRel dataset.
Do we use the full train_wiki.json from https://github.com/thunlp/FewRel/tree/master/data as the training split for meta-training, and the full val_wiki.json for evaluation (support&query)? I'm confused because I notice that the fewrel_meta config also specifies do_eval=True. Then what dev split would the code use?
Would appreciate any guidance on this!

ATIS and SNIPS Dataset Source

Hi,

Would you mind leaving some instructions for where you found/preprocessed the ATIS and SNIPS datasets?

I found some .tsv files here for train/dev/test, but the format is not exactly what tanl/datasets.py expects.

Thanks,

CoNLL2012 Datasets in .json format

datasets.py expects .json files for CoNLL2012 dataset. However, after searching online, I cannot find any preprocessing tools to yield .json files for the CoNLL2012 dataset.

Would the authors be able to provide a way to preprocess the CoNLL2012 dataset so that it can be used for training?

Thanks,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.