- data: store the preprocessed data
- baseline: store the results adapted from https://github.com/lijuncen/Sentiment-and-Style-Transfer
- pretrain_w2v: store glove 100d pretrain models
- evaluation: store the evaluation method from external evaluation
- style_word_alignment: a folder that store two method of generate style word alignment. 1) self attention based method 2) statistic based method(called stanford method instead in the folder)
- fader_network: an implementation of fader network
- two_decoder: an implementation of two decoder method
- cross_aligned : an implementation of cross_aligned method
- multiple_attr_rewrite: an implementation of multiple-attribute-text-rewriting
- delete_only: an implementation of delete only method from Juncen Li et al.
- delete_retrieve_generate: an implementation of full method from Juncen Li et al.
- dynamic_mask: an implementation of dynamic generating mask method
- Raw Data adapted from https://github.com/lijuncen/Sentiment-and-Style-Transfer
- Trim data in training and dev set, build vocabulary from training set
- Contain three dataset
- amazon review
- trim length: maxlength 20, delete duplicate
- trim words frequency: min word frequency 3, if word frequency < 3 => <UNK>
- train: 253807 negative reviews, 255524 positvie reviews
- dev: 942 negative reviews, 914 positive reviews
- test: 500 negative reviews, 500 positive reviews
- 24363 vocabs, including <PAD>, <BOS>, <EOS>, <UNK>
- yelp review
- trim length: maxlength 15, delete duplicate
- trim words frequency: min word frequency 5, if word frequency < 5 => <UNK>
- train: 157769 negative reviews, 222859 positvie review
- dev: 1926 negative reviews, 1922 positive reviews
- test: 500 negative reviews, 500 positive reviews
- 9344 vocabs, including <PAD>, <BOS>, <EOS>, <UNK>
- captions
- trim length: maxlength 20
- retain all training words
- train: 6000 Humorous sentences (label 0), 6000 Romantic sentences (label 1)
- dev: 500 Humorous sentences (label 0), 500 Romantic sentences (label 1)
- test: 300 Factual sentences
- 8983 vocabs, including <PAD>, <BOS>, <EOS>, <UNK>
- amazon review
- Human reference
- all corpus contains 500 human reference outputs
- filename explanation:
- reference.0.input: the inputs that are style 0 (except captions are neutral)
- reference.0.humanout: the human reference output that transfering reference.0.input to style 1 sentences.
- vice versa for style 1
- Transfer ability
- pretrain a style classifier by fastText
- installation
- wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
- unzip v0.2.0.zip
- cd fastText-0.2.0
- make
- train model: go to folder eval/transfer_ability, bash makemodel.sh
- results:
- amazon: dev: precision 0.823 recall 0.822; test: precision 0.809 recall 0.808
- yelp: dev: precision 0.973 recall 0.973; test: precision 0.973 recall 0.973
- caption: dev: precision 0.764 recall 0.764
- test result:
- ./fastText-0.2.0/fasttext test $PATHTOMODEL $TESTFILE
- installation
- python usage interface: use the function Transferability in calculate_transfer.py
- pretrain a style classifier by fastText
- bleu score: use the BLEU function in calculate_bleu.py
- fluency score: use pretrain BERT as language model
- in a conda environment, I installed pytorch with
conda install
(as described on pytorch web site) and fastText withpip install .
from their git clone. - that resulted in a segfault when doing
import fastText
andimport torch
Reason:
- pytorch is compiled with gcc 4.9.2
- conda's default gcc is 4.8.5
Fix:
- install gcc-4.9 in conda (e.g.
conda install -c serge-sans-paille gcc_49
) - install pytorch with
conda install
(in my case,conda install pytorch torchvision cuda90 -c pytorch
) - install fastText with gcc-4.9 compiler:
CC=gcc-4.9 pip install .
in the fastText git clone