Coder Social home page Coder Social logo

malllabiisc / dips Goto Github PK

View Code? Open in Web Editor NEW
71.0 9.0 16.0 513 KB

NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

License: Apache License 2.0

Python 100.00%
paper paraphrase-generation submodular-optimization diversity naacl2019 data-augmentation diverse-decoding natural-language-generation

dips's Introduction

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Source code for NAACL 2019 paper: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Image

  • Overview of DiPS during decoding to generate k paraphrases. At each time step, a set of N sequences V(t) is used to determine k < N sequences (Xโˆ—) via submodular maximization . The above figure illustrates the motivation behind each submodular component. Please see Section 4 in the paper for details.

Also on GEM/NL-Augmenter ๐ŸฆŽ โ†’ ๐Ÿ

Dependencies

  • compatible with python 3.6
  • dependencies can be installed using requirements.txt

Dataset

Download the following datasets:

Extract and place them in the data directory. Path : data/<dataset-folder-name>. A sample dataset folder might look like data/quora/<train/test/val>/<src.txt/tgt.txt>.

Setup:

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/DiPS

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Install the submodopt package by running the following command from the root directory of the repository:

$ cd ./packages/submodopt
$ python setup.py install
$ cd ../../

Training the sequence to sequence model

python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset quora -run_name <run_name>

Create dictionary for submodular subset selection. Used for Semantic similarity (L2)

To use trained embeddings -

python -m src.create_dict -model trained -run_name <run_name> -gpu 0

To use pretrained word2vec embeddings -

python -m src.create_dict -model pretrained -run_name <run_name> -gpu 0

This will generate the word2vec.pickle file in data/embeddings

Decoding using submodularity

python -m src.main -mode decode -selec submod -run_name <run_name> -beam_width 10 -gpu 0

Citation

Please cite the following paper if you find this work relevant to your application

@inproceedings{dips2019,
    title = "Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation",
    author = "Kumar, Ashutosh  and
      Bhattamishra, Satwik  and
      Bhandari, Manik  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1363",
    pages = "3609--3619"
}

For any clarification, comments, or suggestions please create an issue or contact [email protected] or Satwik Bhattamishra

dips's People

Contributors

ashutoshml avatar dependabot[bot] avatar parthatalukdar avatar satwik77 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dips's Issues

Creating new paraphrases

Thank you for uploading this code!

I just finished training the model and would like to know how could I use it to generate new sentences based on unseen sentence seeds?

Say I have a text file src.txt containing 3 sentences, and I would like to generate 2 or 3 paraphrases for each of those sentences. I saw that the decode mode of the main.py file requires a src.txt and a tgt.txt files, but is it possible to do it without the tgt.txt? Many thanks in advance.

What is the maximum number of paraphrases can be generated for single input sentence?

Hi.
I'm trying to generate 500 paraphrases for each input sentence. But I got "Error in Submod: attempt to get argmax of an empty sequence". And I found that there are only 50 paraphrases generated for each sentence in the output file. I'm wondering if you can tell me the maximum number of paraphrases can be generated for single input sentence, or how can I get more paraphrases using DiPS.
Thanks.
Merry Christmas and Happy New Year!

About the maximum word length set to 20

In the paper and the implementation, the maximum token (=word) length of the dataset is truncated up to 20 tokens, but I'm not sure the reason why you set such a condition.

Would you tell me more about this setting, like why you set this limitation and whether you have a paper that you referenced for it or empirical findings about that ?

Thanks.

how to measure BLEU and other metrices

Hi,

I am trying to do the experiment of section 5 in the paper, using evaluation scripts in src/evaluation/, but now I am struggling to reproduce results of table 4 and 5.

The model trained with Twitter gave almost the same results (BLEU:47-56, lambda:0.5-1), however the model trained with Quora-Div had a BLEU of 21-26 (lambda:0.5-1), which was lower than that of the paper (35.1).

Here are the outputs of src/evaluation/get_bleu_score.py to the decoding results of quora-div test data.

# lambda = 0.5
results_submod_src_0.5_1.0_1.0_1.0_1.0.npy : (0.2133832613042971, [0.5514513662938855, 0.2794985741291888, 0.16517278760476983, 0.10313060915605508], 0.9426644403664957, 0.9442470265324794, 516031, 546500)
# lambda = 1.0
results_submod_src_1.0_1.0_1.0_1.0_1.0.npy : (0.2677655774938234, [0.6150636401998018, 0.3489407079357134, 0.22080072282389715, 0.14503831657417837], 0.92996373353237, 0.9323055809698079, 509505, 546500)

In my setup, the seq2seq model was trained using the quora-div data set with same hyperparameter settings as the ones in the supplementary material, the w2v dictionary was created using a trained embedding, and each decoding was done with beam=10.

Would you provide detailed information about conducting the experiment, including procedures such as how to calculate bleu for each ref and candidates pairs (like getting average of the all candidates or max value of them)?

I also would like to know how you measure METEOR and TERp scores in the paper. (libraries or other OSSs you used for calculating them)

Thanks

Word embedding file

I downloaded the Google word vectors bin file and placed it inside data, but I encountered a No such file or directory: 'data/embeddings/word2vec.pickle' while trying to train the model

Question: Using a pre-trained encoder?

Is there any reason or limitation why one could not use BERT or other Transformer-based encoders for:

  1. Word vector generation instead of word2vec
  2. Use directly the encoder and only train the decoder part?

Best

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, firstly thank you very much for the code and detailed explanation.

I am running this line of code:
! python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset twitter -run_name DiPS

and I am getting following error:
Loading Word2Vec
Word2vec Loaded
Time Taken : 0.0016974012056986492
2020-06-07 22:22:12,866 | INFO | main | Training and Validation data loading..
2020-06-07 22:22:13,243 | INFO | main | Training and Validation data loaded!
2020-06-07 22:22:13,243 | INFO | main | Creating vocab ...
2020-06-07 22:22:30,206 | INFO | main | Vocab created with number of words = 18865
2020-06-07 22:22:30,207 | INFO | main | Saving Vocabulary file
2020-06-07 22:22:30,217 | INFO | main | Vocabulary file saved in Model/DiPS/vocab.p
2020-06-07 22:22:30,219 | INFO | main | Checkpoint found with epoch num 30
2020-06-07 22:22:30,297 | INFO | main | Building Encoder RNN..
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 473, in
main()
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 415, in main
model = s2s(args, voc, device, logger)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/model.py", line 44, in init
self.config.bidirectional).to(device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Actually, I am using colab for training and I trained for 12 hours and it went upto 74 epochs with defaults. Although in my drive I could see the last file of 30th epoch. So i thought of restarting it last night and it resumed from 30th epoch. But due to some reason I had to stop after 4 epochs and I stopped the code and switched off. But now when I am running the whole thing same way again today, I am unable to proceed with training and getting this error. Please help. I am just a newbie in this field.

How to test on custom dataset

@ashutoshml I am traying to make this run for my custom dataset. I have some questions, which I have kept in the src.txt file. I wanted to know what should I keep in the tgt.txt file. Thanks in advance for the help :D

Can I get the pre-trained word-embedding file?

Hello,

Can I get the pre-trained word embedding file of the model?
I am facing issue with GPU compatibility. The code is completely working fine. Just the hardware issues. As the reason, I am asking for a pretrained file.

Can this method be used for other languages?

Hi, Thank you for sharing the code!
I want to know whether this code can be used for Chinese?
If I want to use the code for Chinese data augmentaion, which part should I change?
Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.