malllabiisc / dips Goto Github PK

NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

License: Apache License 2.0

Python 100.00%

paper paraphrase-generation submodular-optimization diversity naacl2019 data-augmentation diverse-decoding natural-language-generation

dips's Introduction

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Source code for NAACL 2019 paper: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Overview of DiPS during decoding to generate k paraphrases. At each time step, a set of N sequences V^(t) is used to determine k < N sequences (X^∗) via submodular maximization . The above figure illustrates the motivation behind each submodular component. Please see Section 4 in the paper for details.

Also on GEM/NL-Augmenter 🦎 → 🐍

Please use/check diverse_paraphrase in NL-Augmenter for the transformer-model version. Diverse-Paraphrase: NL-Augmenter.

Dependencies

compatible with python 3.6
dependencies can be installed using requirements.txt

Dataset

Download the following datasets:

Extract and place them in the data directory. Path : data/<dataset-folder-name>. A sample dataset folder might look like data/quora/<train/test/val>/<src.txt/tgt.txt>.

Download GoogleNews-vectors-negative300.bin.gz into the data directory. In case the above link doesn't work, find the zip file here

Setup:

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/DiPS

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Install the submodopt package by running the following command from the root directory of the repository:

$ cd ./packages/submodopt
$ python setup.py install
$ cd ../../

Training the sequence to sequence model

python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset quora -run_name <run_name>

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

To use trained embeddings -

python -m src.create_dict -model trained -run_name <run_name> -gpu 0

To use pretrained word2vec embeddings -

python -m src.create_dict -model pretrained -run_name <run_name> -gpu 0

This will generate the word2vec.pickle file in data/embeddings

Decoding using submodularity

python -m src.main -mode decode -selec submod -run_name <run_name> -beam_width 10 -gpu 0

Citation

Please cite the following paper if you find this work relevant to your application

@inproceedings{dips2019,
    title = "Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation",
    author = "Kumar, Ashutosh  and
      Bhattamishra, Satwik  and
      Bhandari, Manik  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1363",
    pages = "3609--3619"
}

For any clarification, comments, or suggestions please create an issue or contact [email protected] or Satwik Bhattamishra

dips's People

Contributors

Stargazers

Watchers

Forkers

colinsongf yytzsy ryan2x toruurakawa yyht sasshoumaru hzguo arshiasait kiminh nlp4science smksyj ibrahim85 chrismii 00mjk bellaha sdadas

dips's Issues

Creating new paraphrases

Thank you for uploading this code!

I just finished training the model and would like to know how could I use it to generate new sentences based on unseen sentence seeds?

Say I have a text file src.txt containing 3 sentences, and I would like to generate 2 or 3 paraphrases for each of those sentences. I saw that the decode mode of the main.py file requires a src.txt and a tgt.txt files, but is it possible to do it without the tgt.txt? Many thanks in advance.

What is the maximum number of paraphrases can be generated for single input sentence?

Hi.
I'm trying to generate 500 paraphrases for each input sentence. But I got "Error in Submod: attempt to get argmax of an empty sequence". And I found that there are only 50 paraphrases generated for each sentence in the output file. I'm wondering if you can tell me the maximum number of paraphrases can be generated for single input sentence, or how can I get more paraphrases using DiPS.
Thanks.
Merry Christmas and Happy New Year!

About the maximum word length set to 20

In the paper and the implementation, the maximum token (=word) length of the dataset is truncated up to 20 tokens, but I'm not sure the reason why you set such a condition.

Would you tell me more about this setting, like why you set this limitation and whether you have a paper that you referenced for it or empirical findings about that ?

Thanks.

bleu_score referenced before assignment

In helper.py, inside the load_checkpoint function, in the except block, bleu_score is undefined. I believe initializing it to None should solve it

How to generate multiple paraphrases for a question

I wanted to know what are the required changes to be done in order to get more than one paraphrased questions for a given question. :D

how to measure BLEU and other metrices

Hi,

I am trying to do the experiment of section 5 in the paper, using evaluation scripts in src/evaluation/, but now I am struggling to reproduce results of table 4 and 5.

The model trained with Twitter gave almost the same results (BLEU:47-56, lambda:0.5-1), however the model trained with Quora-Div had a BLEU of 21-26 (lambda:0.5-1), which was lower than that of the paper (35.1).

Here are the outputs of src/evaluation/get_bleu_score.py to the decoding results of quora-div test data.

# lambda = 0.5
results_submod_src_0.5_1.0_1.0_1.0_1.0.npy : (0.2133832613042971, [0.5514513662938855, 0.2794985741291888, 0.16517278760476983, 0.10313060915605508], 0.9426644403664957, 0.9442470265324794, 516031, 546500)
# lambda = 1.0
results_submod_src_1.0_1.0_1.0_1.0_1.0.npy : (0.2677655774938234, [0.6150636401998018, 0.3489407079357134, 0.22080072282389715, 0.14503831657417837], 0.92996373353237, 0.9323055809698079, 509505, 546500)

In my setup, the seq2seq model was trained using the quora-div data set with same hyperparameter settings as the ones in the supplementary material, the w2v dictionary was created using a trained embedding, and each decoding was done with beam=10.

Would you provide detailed information about conducting the experiment, including procedures such as how to calculate bleu for each ref and candidates pairs (like getting average of the all candidates or max value of them)?

I also would like to know how you measure METEOR and TERp scores in the paper. (libraries or other OSSs you used for calculating them)

Thanks

Word embedding file

I downloaded the Google word vectors bin file and placed it inside data, but I encountered a No such file or directory: 'data/embeddings/word2vec.pickle' while trying to train the model

Link to Quora-Div doesnt work

Link to Quora-Div doesnt work, could you please update it. Thanks.

Question: Using a pre-trained encoder?

Is there any reason or limitation why one could not use BERT or other Transformer-based encoders for:

Word vector generation instead of word2vec
Use directly the encoder and only train the decoder part?

Best

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, firstly thank you very much for the code and detailed explanation.

I am running this line of code:
! python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset twitter -run_name DiPS

and I am getting following error:
Loading Word2Vec
Word2vec Loaded
Time Taken : 0.0016974012056986492
2020-06-07 22:22:12,866 | INFO | main | Training and Validation data loading..
2020-06-07 22:22:13,243 | INFO | main | Training and Validation data loaded!
2020-06-07 22:22:13,243 | INFO | main | Creating vocab ...
2020-06-07 22:22:30,206 | INFO | main | Vocab created with number of words = 18865
2020-06-07 22:22:30,207 | INFO | main | Saving Vocabulary file
2020-06-07 22:22:30,217 | INFO | main | Vocabulary file saved in Model/DiPS/vocab.p
2020-06-07 22:22:30,219 | INFO | main | Checkpoint found with epoch num 30
2020-06-07 22:22:30,297 | INFO | main | Building Encoder RNN..
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 473, in
main()
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 415, in main
model = s2s(args, voc, device, logger)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/model.py", line 44, in init
self.config.bidirectional).to(device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Actually, I am using colab for training and I trained for 12 hours and it went upto 74 epochs with defaults. Although in my drive I could see the last file of 30th epoch. So i thought of restarting it last night and it resumed from 30th epoch. But due to some reason I had to stop after 4 epochs and I stopped the code and switched off. But now when I am running the whole thing same way again today, I am unable to proceed with training and getting this error. Please help. I am just a newbie in this field.

Remove "pkg-resources==0.0.0" from requirements.txt

pkg-resources==0.0.0 causes pip install -r requirements.txt to fail (on windows).

According to this link, it should be safe to just remove it from requirements.txt.

Generate paraphrases for a new dataset

Hello, thanks for the code!

I'd like to achieve the following:

train your model for the best effectiveness (or at least quite good effectiveness)
generate let's say 5 paraphrases for each sentence in a dataset (eg TREC dataset from https://github.com/harvardnlp/sent-conv-torch/tree/master/data)

What are the exact steps for this?