teslacool / sca Goto Github PK

View Code? Open in Web Editor NEW

39.0 39.0 9.0 3.16 MB

Soft Contextual Data Augmentation

License: Other

Python 98.88% C++ 0.50% Lua 0.62%

sca's People

Contributors

Stargazers

Watchers

Forkers

liuyeah alphadl lifeixianshen qpanai linloong mars-wei baohaoliao songyf tokisakikurumi2001

sca's Issues

setups for reproducing IWSLT14 De-en

Hi,

I want to reproduce your result on IWSLT14 De-En, but I can't get 35.78. My best result is 34.25. Here I want to ask some detailed setup:

Do you use share embedding? I don't use. If yes, how about your size of vocabulary.
For language model, I use
python ~/fairseq/train.py
~/de2en/lmofde
--task language_modeling
--arch transformer_lm_iwslt
--optimizer adam
--adam-betas '(0.9, 0.98)'
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 1e-07
--warmup-updates 4000
--lr 0.0005
--min-lr 1e-09
--dropout 0.1
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 4096
--tokens-per-sample 4096
--save-dir $dir
--update-freq 16
--no-epoch-checkpoints
--log-format simple
--log-interval 1000
for both De and En language model. I train the language model until convergence and use the best checkpoint for NMT. Do you have suggestion for my settings?
For NMT, I use
python ~/SCA/train.py
$DATA_PATH
--task lm_translation
--arch transformer_iwslt_de_en
--optimizer adam
--adam-betas '(0.9, 0.98)'
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 1e-07
--warmup-updates 4000
--lr 0.0009
--min-lr 1e-09
--dropout 0.3
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 2048
--update-freq 2
--save-dir $SAVE_DIR
--tradeoff $i
--load-lm
--seed 200
--no-epoch-checkpoints
--log-format simple
--log-interval 1000
for i (tradeoff), I use 0.1, 0.15 and 0.2. The best result is got by 0.15. When you calculate BLEU score, do you use best checkpoint or average checkpoint (average how many epoch's checkpoints). Do you also have other suggestions?

inf value for ppl

before the last part of the log of one of my training of the language model

Why the ppl reports a inf value?

| epoch 043 | loss 2837.172 | ppl inf | wps 22040 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 39302 | lr 2.5e-05 | gnorm 17885.347 | clip 1.000 | oom 0.000 | wall 9326 | train_wall 8854
| epoch 043 | valid on 'valid' subset | loss 2322.051 | ppl inf | num_updates 39302 | best_loss 2322.05
| epoch 044 | loss 2819.246 | ppl inf | wps 22042 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 40216 | lr 2.5e-05 | gnorm 19552.845 | clip 1.000 | oom 0.000 | wall 9543 | train_wall 9060
| epoch 044 | valid on 'valid' subset | loss 2272.617 | ppl inf | num_updates 40216 | best_loss 2272.62
| epoch 045 | loss 2802.761 | ppl inf | wps 22039 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 41130 | lr 2.5e-05 | gnorm 354250.108 | clip 1.000 | oom 0.000 | wall 9761 | train_wall 9266
| epoch 045 | valid on 'valid' subset | loss 2269.807 | ppl inf | num_updates 41130 | best_loss 2269.81
| epoch 046 | loss 2782.943 | ppl inf | wps 22041 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 42044 | lr 2.5e-05 | gnorm 30559.840 | clip 1.000 | oom 0.000 | wall 9978 | train_wall 9472
| epoch 046 | valid on 'valid' subset | loss 2250.028 | ppl inf | num_updates 42044 | best_loss 2250.03
| epoch 047 | loss 2769.120 | ppl inf | wps 22042 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 42958 | lr 2.5e-05 | gnorm 18006.640 | clip 1.000 | oom 0.000 | wall 10196 | train_wall 9678
| epoch 047 | valid on 'valid' subset | loss 2268.014 | ppl inf | num_updates 42958 | best_loss 2250.03

which is the meaning of "I shift a sentence twice in decoder input"

In README.md,, it is written
I shift a sentence twice in decoder input, so the shortest sentence length after bpe should be no less than 2.

What does exactly mean?

If I use a "standard" set of preprocessed data created by fairseq-preprocess, I got this error, when trying to train the LM.

$ python3 ../train.py ./runtime/default/tmp/training/data_generated   --task language_modeling --arch transformer_lm --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000   --lr 0.0005 --min-lr 1e-09   --dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1   --max-tokens 4096  --tokens-per-sample 4096  --save-dir ./SAVEDIR --update-freq 16
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, arch='transformer_lm', attention_dropout=0.0, bucket_cap_mb=150, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', clip_norm=0.0, criterion='label_smoothed_cross_entropy', data='./runtime/default/tmp/training/data_generated', ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, keep_interval_updates=-1, label_smoothing=0.1, log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_tokens=4096, max_update=0, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, optimizer='adam', optimizer_overrides='{}', output_dictionary_size=-1, past_target=False, raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode=None, save_dir='./SAVEDIR', save_interval=1, save_interval_updates=0, seed=1, self_target=False, sentence_avg=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, task='language_modeling', tie_adaptive_proj=False, tie_adaptive_weights=False, tokens_per_sample=4096, train_subset='train', update_freq=[16], valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)
Traceback (most recent call last):
  File "/home/nicola/workspace/SoftContextualDataAugmentation/fairseq/data/dictionary.py", line 169, in load
    return cls.load(fd)
  File "/home/nicola/workspace/SoftContextualDataAugmentation/fairseq/data/dictionary.py", line 183, in load
    count = int(line[idx+1:])
ValueError: invalid literal for int() with base 10: "'<Lua_Heritage>'\n"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../train.py", line 431, in <module>
    main(args)
  File "../train.py", line 36, in main
    task = tasks.setup_task(args)
  File "/home/nicola/workspace/SoftContextualDataAugmentation/fairseq/tasks/__init__.py", line 19, in setup_task
    return TASK_REGISTRY[args.task].setup_task(args)
  File "/home/nicola/workspace/SoftContextualDataAugmentation/fairseq/tasks/language_modeling.py", line 94, in setup_task
    dictionary = Dictionary.load(os.path.join(args.data, 'dict.txt'))
  File "/home/nicola/workspace/SoftContextualDataAugmentation/fairseq/data/dictionary.py", line 177, in load
    "rebuild the dataset".format(f))
Exception: Incorrect encoding detected in ./runtime/default/tmp/training/data_generated/dict.txt, please rebuild the dataset

Why not train the lm-nmt with a baseline for warmup?

As you claimed, you train your lm-nmt from scratch, but why do you not use a pretrained nmt model for warmup? Can you give some experimental results about the latter strategy?

Preprocessing script for Es-En and He-En

Hi there,

Thanks for your interesting work! Do you have data processing scripts for Es-En and He-En?

Traceback (most recent call last): File "train.py", line 431, in <module> main(args) File "train.py", line 77, in main if args.load_lm: AttributeError: 'Namespace' object has no attribute 'load_lm'
When training the language model, I used the script you provided.and arch=transformer_lm
Why are there still mistakes
Besides, I don't quite understand your operation.
src=en tgt=ru for l in $src $tgt; do srcdir=${src}2${tgt} tgtdir=lmof${l} mkdir -p $tgtdir cp $srcdir/dict.${l}.txt $tgtdir/dict.txt cp $srcdir/train.${src}-${tgt}.${l}.bin $tgtdir/train.bin cp $srcdir/train.${src}-${tgt}.${l}.idx $tgtdir/train.idx cp $srcdir/valid.${src}-${tgt}.${l}.bin $tgtdir/valid.bin cp $srcdir/valid.${src}-${tgt}.${l}.idx $tgtdir/valid.idx done
I didn't use the script you mentioned

Error training language model

I'm very sorry to disturb you, because I haven't solved this problem.
I encountered the following problems when training the language model

Traceback (most recent call last): File "train.py", line 431, in <module> main(args) File "train.py", line 77, in main if args.load_lm: AttributeError: 'Namespace' object has no attribute 'load_lm'
I use the same steps as you use,and arch=transformer_lm
python train.py $DATA --task language_modeling --arch $ARCH \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \ --lr 0.0005 --min-lr 1e-09 \ --dropout 0.1 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096 --tokens-per-sample 4096 --save-dir $SAVE --update-freq 16

I noticed what you said
I have modified this fairseq repo's dataloader, you'd better train language models with standard fairseq repo.
but I don’t know how to use standard fairseq. I can’t train on the current version because of the problem with the pytorch version. Later I downloaded the code of fairseq 0.6.0 and there are problems with it. I really It is impossible to train the language model, can you tell me the steps you train

Install fairseq error

pip install -r requirements.txt

python setup.py build develop

and

use of the language model layers during inference

I need few clarifications.

Please confirm and/or comment about the following claims related to your software:

during training of the transformer_lmnmt architecture, the parameters related to the source and target lm decoders (i.e the lowers layers of the entire architecture) are not trained
during inference with transformer_lmnmt architecture, the source and target lm decoders are active in the sense that the input tokens go through these layers before traversing the transformer encoder and decoder
the forward step of inference is essentially the same as the forward step of training

If any of the previous is wrong, please explain me the right process.

If I am totally right, I have a further question.
Have you ever tried to infer the translation without the source and the target lm layers, i.e. using a standard transformer? Which results did you get?
If you did not try, which is your feeling about such experiment?

optimization of SCA

After reading your paper, which is undoubtedly very much interesting, I gave a deep look into your code; I must admit that it is very well-organized. So thank you so much for your work.

Before starting my experimentation with it, I would like to know your suggestions about how to optimize the parameters of the system and of the training:

is there any better configuration of lm and lmnmt architectures with respect to the default?
for both lm and lmnmt training, which is the best configuration of the training parameters? (learning-rate, warmup-updates, etc)?

Should I pay particular attention to any aspect of the training to avoid bad performance?

Which ‘arch’ should I choose when training the language model

Hi, tesla,

When I use the code to train the language model follow your script，there was en error:

Traceback (most recent call last):
  File "train.py", line 431, in <module>
    main(args)
  File "train.py", line 77, in main
    if args.load_lm:
AttributeError: 'Namespace' object has no attribute 'load_lm'

Then I delete line 77-78，go on training, still meet errors:

Traceback (most recent call last):
  File "train.py", line 431, in <module>
    main(args)
  File "train.py", line 42, in main
    model = task.build_model(args)
  File "/data/experiment/sca/fairseq/tasks/language_modeling.py", line 118, in build_model
    model = super().build_model(args)
  File "/data/experiment/sca/fairseq/tasks/fairseq_task.py", line 131, in build_model
    return models.build_model(args, self)
  File "/data/experiment/sca/fairseq/models/__init__.py", line 34, in build_model
    return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
  File "/data/experiment/sca/fairseq/models/transformer.py", line 111, in build_model
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary
  File "/data/experiment/sca/fairseq/tasks/fairseq_task.py", line 201, in source_dictionary
    raise NotImplementedError
NotImplementedError

Following is my script:

num=0
data_bin=/data/experiment/sca/data/lmofch
save_dir=model-ch
dropout=0.1
arch=transformer
max_tokens=4096
criterion=label_smoothed_cross_entropy
label_smoothing=0.1
lrscheduler=inverse_sqrt

CUDA_VISIBLE_DEVICES=$num python train.py $data_bin \
                        --task language_modeling \
                        --arch $arch \
                        --optimizer adam \
                        --adam-betas '(0.9, 0.98)' \
                        --clip-norm 0.0 \
                        --lr-scheduler $lrscheduler \
                        --warmup-init-lr 1e-07 \
                        --warmup-updates 4000 \
                        --lr 0.0005 \
                        --min-lr 1e-09 \
                        --dropout $dropout \
                        --weight-decay 0.0 \
                        --criterion $criterion \
                        --label-smoothing $label_smoothing \
                        --max-tokens $max_tokens \
                        --tokens-per-sample 4096 \
                        --save-dir $save_dir \
                        --update-freq 16

So I wonder if it is I choose the wrong arch that bring about the error.

problem with interactive.py

I follow the instruction to preprocess and train an engine with your code with and without srclm and trglm. And I succeded. I trained two models, one with srclm and tgtlm and one engine without.

Then, I tried to translate with any of the two models, but in both cases I failed.
Here are the two command I used

echo "ciao ciao ciao" | python3 ../interactive.py --remove-bpe REMOVE_BPE --raw-text --path engine_nolm/checkpoint_best.pt  --src-no-lm --tgt-no-lm --load-nmt --task lm_translation  data_generated 

echo "ciao ciao ciao" | python3 ../interactive.py --remove-bpe REMOVE_BPE --raw-text --path engine_lm/checkpoint_best.pt --task lm_translation data_generated --src-no-lm --tgt-no-lm --load-srclm-file lm_sl/checkpoint_best.pt  --load-tgtlm-file lm_tl/checkpoint_best.pt --load-nmt-file engine_lm/checkpoint_best.pt

What's wrong?
Which is the correct command to activate both src and tgt LM, the command to disable them?

A little problem about IWSLT

I have downloaded theIWSLT data, no problem
Snipaste_2020-08-01_16-15-57
In addition
Did you add tag information to the results obtained in LM-sample experiment

Problem of training efficiency

Dear author. Thanks very much for your work. I have a question, in my experiment, the training speed of lm-nmt is much slower than that of pure nmt model, is it common?

Pull Request into fairseq

@teslacool

I am really impressed about your work, and I think it would be very useful for everyone (and for me in particular) having it inside fairseq.
Did you intend to make a Pull Request?
If you prefer, I volunteer to do it.

Can anybody reproduce the results of paper ?

This is outstanding work, using data augmentation in NMT .
I noticed that your experiment is based on fairseq.
In your paper, you used big-transformer as the baseline of wmt14-en-de, reported by 28.4 BLEU, but in Fairseq's paper can reach 29.3 BLEU.
Is your model different from Facebook?

unexpected key when generate

HI,

how do you generate by your code? When I use your generate script, it shows unexpected keys.

multilingual engine with SCA

@teslacool

I would like to use your software in a multilingual environment.

In practice, I would like to train one system for translating from English into Spanish and Italian.
I already have these system working using a standard transformer architecture.
To do this, I followed a quite standard procedure to add a language flag into the source text to trigger the right target translation (into Spanish or Italian).

In the same way, I also can train one system for translating from Spanish or Italian into English. In this case, no language flags are used; but I simply concatenate Spanish-English and Italian-English training data, and let the network do all the job.

I would like to know your idea about applying a similar strategy with a lm_translation task (i.e. a transformer plus LM).
In the first case (en->{es,it}), the source LM would contain the language flag, and only English tokens, while the target LM would contain both Spanish and Italian words.
In the second case ({es,it}->en), the source LM would contain both Spanish and Italian words, wile the target LM would be "standard".
Would the LMs be strong enough to "distinguish" between Spanish and Italian tokens?
Could the presence of the language flag disturb the quality of the LMs?

Do you see other approaches for creating a SCA multilingual engine (en->{es,it} or {es,it}->en)?

Any suggestions or comments are very welcome.

no improvement with huge data

I am using your software to create a large-sized system

My setting includes:

an encoder/decoder architecture similar to the transformer_big:
- encoder_embed_dim=decoder_embed_dim=1024
- encoder_ffn_embed_dim=decoder_ffn_embed_dim=4096
- encoder_input_dim=decoder_input_dim=1024
- encoder_output_dim=decoder_output_dim=1024
- encoder_layers=decoder_layers=6
- encoder_attention_heads=decoder_attention_heads=16
source and target language models:
- lmdecoder_embed_dim=1024
- lmdecoder_ffn_embed_dim=2048
- lmdecoder_input_dim=1024
- lmdecoder_output_dim=1024
- lmdecoder_layers=4
- lmdecoder_attention_heads=8

for a total of about 410M parameters.

The system was trained on a huge corpus having more than 1G words in each language.

Unfortunately and disappointingly, the performance of this system are slightly worse than the corresponding system without the LM having about 200M parameters.

I saw that you run your experiments showing a consistent improvement of 1 BLEU point on a smaller task (you train on only 4.5M sentence pairs, i.e. less than 100M words).

Did you run experiments on larger data sets?

What is your feeling about the use of LM on such big data set (more than 1G words)?

Do you think that I was wrong in some setting of my system?

Any comment or tip for improvement is welcome

Why do you want to train two language models?

Don't you use p（x） instead of x?
So I think that only the language model of the source language is trained, so what is the language model of the target side used for? Please answer, thank you very much.

not able to train a LM with SCA

I am trying to use your script ./train.py (instead of the official fairseq-train) to train the language models.

I run such a command

and I got this error

Traceback (most recent call last):
  File "../../../code/SCA//train.py", line 437, in <module>
    main(args)
  File "../../../code/SCA//train.py", line 89, in main
    trainer.dummy_train_step([dummy_batch])
  File "/home/nicola/workspace/SCA/code/SCA/fairseq/trainer.py", line 335, in dummy_train_step
    self.train_step(dummy_batch, dummy_batch=True)
  File "/home/nicola/workspace/SCA/code/SCA/fairseq/trainer.py", line 188, in train_step
    ignore_grad
  File "/home/nicola/workspace/SCA/code/SCA/fairseq/tasks/fairseq_task.py", line 169, in train_step
    loss, sample_size, logging_output = criterion(model, sample)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nicola/workspace/SCA/code/SCA/fairseq/criterions/adaptive_loss.py", line 41, in forward
    assert hasattr(model.decoder, 'adaptive_softmax') and model.decoder.adaptive_softmax is not None
AssertionError

Note that the same parameters work well when I use fairseq-train
The data bin were generated with fairseq-preprocess following your documentation.