nvidia / openseq2seq Goto Github PK

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Home Page: https://nvidia.github.io/OpenSeq2Seq

License: Apache License 2.0

Shell 2.05% Perl 0.29% Python 91.14% C++ 5.62% Makefile 0.14% Jupyter Notebook 0.56% Starlark 0.11% SWIG 0.08%

neural-machine-translation multi-gpu deep-learning sequence-to-sequence seq2seq multi-node speech-recognition speech-to-text mixed-precision float16

openseq2seq's Introduction

OpenSeq2Seq: toolkit for distributed and mixed precision training of sequence-to-sequence models

OpenSeq2Seq main goal is to allow researchers to most effectively explore various sequence-to-sequence models. The efficiency is achieved by fully supporting distributed and mixed-precision training. OpenSeq2Seq is built using TensorFlow and provides all the necessary building blocks for training encoder-decoder models for neural machine translation, automatic speech recognition, speech synthesis, and language modeling.

Documentation and installation instructions

https://nvidia.github.io/OpenSeq2Seq/

Features

Models for:
1. Neural Machine Translation
2. Automatic Speech Recognition
3. Speech Synthesis
4. Language Modeling
5. NLP tasks (sentiment analysis)
Data-parallel distributed training
1. Multi-GPU
2. Multi-node
Mixed precision training for NVIDIA Volta/Turing GPUs

Software Requirements

Python >= 3.5
TensorFlow >= 1.10
CUDA >= 9.0, cuDNN >= 7.0
Horovod >= 0.13 (using Horovod is not required, but is highly recommended for multi-GPU setup)

Acknowledgments

Speech-to-text workflow uses some parts of Mozilla DeepSpeech project.

Beam search decoder with language model re-scoring implementation (in decoders) is based on Baidu DeepSpeech.

Text-to-text workflow uses some functions from Tensor2Tensor and Neural Machine Translation (seq2seq) Tutorial.

Disclaimer

This is a research project, not an official NVIDIA product.

Related resources

Paper

If you use OpenSeq2Seq, please cite this paper

@misc{openseq2seq,
    title={Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq},
    author={Oleksii Kuchaiev and Boris Ginsburg and Igor Gitman and Vitaly Lavrukhin and Jason Li and Huyen Nguyen and Carl Case and Paulius Micikevicius},
    year={2018},
    eprint={1805.10387},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

openseq2seq's People

Contributors

Stargazers

Watchers

Forkers

codeaudit siddharthbhatnagar raghparihar ml-lab little1tow xiaoshuai09 benjamesbabala 0x38 ml-ai-nlp-ir sg3391 madrugado louischen1992 ginsongsong benson516 nluehr hal2001 gclouding matanhs cfandy leliaonvidia alexxnica kryndex labbros paulhendricks shunshunyin julianocristian aiedward just4jc gaoyiyeah klqulei juliegkim1 profcab kipok stevenlol davenso mamonraab dsp6414 waynesuzq poodarchu danlg vsl9 blisc xravitejax borisgin fotwo freewym ai3dvision denethor1997 ramananm niranjanaryan chomolungma vahidoox vsuthichai currylym dnola samikama auserj zhengjxu tungk juno119 cliang1453 sundeepteki chiphuyen winsonrich shankar635 by2101 jakepoz edwardhdlu keruhua hephaex qijiaxing sbarman-mi9 karakusc zhf459 huguanglong tarsbase shyamalschandra chuanli11 ka-bu rahul003 eastonyi raymondnie neverdoubt trevor-m fendaq mvankeirsbilck giranntu xdcesc the01 piandpower boostpapa samp-gr wolf1981 lqw198421 gabriellin tomerk wyxingyux cambwang makinglong nashannashui

openseq2seq's Issues

Add model pointer to encoder, decoder and loss classes

If we do that, all classes will have access to each other and we will be able to remove a lot of parameter exchange logic from model_builders.py

DeepSpeech2 encoder treats padding as valid input

This is because our input is batched with all samples padded to the longest length. That affects both convolutional part, as well as cudnn rnn part. There is a quick fix for convolutions (multiplying their output with correct mask), but for cudnn rnn we are blocked with this issue: tensorflow/tensorflow#6633

Add batch_size_per_gpu > 1 and num_gpus > 1 inference support for text2text

Fill out mixed precision section in docs

Overwritting when updating base params with train, eval, and infer params

In the current implementation, scalar parameters are overwritten when the base parms are updated with the train/eval/infer params as intended however it raises an issue for nested parameters in which case the entire dictionary is replaced with what is in the train/eval/infer params . For example:
If my config is
base_params = {
  "random_seed":0,
  "data_layer_params":{
    "num_features":10
  }
}
train_params = {
  "data_layer_params":{
    "dataset":train.csv
  }
}

Ideally I would want the train params to be
merged_params = {
  "random_seed":0,
  "data_layer_params":{
    "num_features":10,
    "dataset":train.csv
  }
}

but currently I get
merged_params = {
  "random_seed":0,
  "data_layer_params":{
    "dataset":train.csv
  }
}

It would be helpful for experimentation to leave num_features inside the base params as opposed to adding it to train_params so it can be easily changed via the command line as opposed to creating multiple config files.

Release new pre-trained models

throughput scaling issues

I'm attempting to benchmark throughput on transformer-big on the following:

AWS p3.16xlarge (8 gpus per node)
Horovod 0.13.10
OpenMPI 3.1.1
TensorFlow 1.9.0
CUDA 9.0
FP32

I'm benchmarking for 100 steps -- 10 to 109, skipping the first 0 to 9 steps. Here are some results. It seems to plateau at 8 gpus and then doesn't scale any further. I'm primarily interested in getting the throughput samples per second to scale well. Any thoughts?

Nodes	GPUs	Steps	Global Batch Size	Per GPU Batch Size	Seconds / Step	Objects / Sec	Samples / Sec
1	2	100	256	128	0.712	21489.658	360
1	4	100	512	128	0.877	34967.614	583
1	8	100	1024	128	1.287	47788.866	795
2	16	100	2048	128	2.906	42235.197	704
3	24	100	3072	128	3.972	46429.704	773
4	32	100	4096	128	5.09	48363.986	804

Question : GNMT model faster on single GPU than on 4 GPUs?

Hello
I have been experimenting with en-de-gnmt-like-4GPUs.py recipe. I have noticed that time per step is smaller if you use 1 GPU instead of 4 GPUs. It looks like step time is approximately 40% slower on 4 GPUs than 1 GPU. I have the default batch-size(32) in both cases.

Here is an example.
After training ~36 hours with 4 GPUs:
Global step 73968: loss = 1.0840, time per step = 0:00:1.390

However, after training ~36 hours with 1 GPU:
Global step 138432: loss = 1.2411, time per step = 0:00:0.789

I have a few questions:

Isn't the step time expected to be smaller on 4 GPU case?
However, after 36 hours, loss is smaller with 4 GPU case even though step time is higher.
I'm wondering if that's because objects per second is higher on multi-GPU case. But if that is the case, shouldn't the step time reduce?

Thanks for clarification.

iter_size does not properly work with sparse gradients

Error:

*** Building graph in Horovod rank: 0
Traceback (most recent call last):
File "run.py", line 254, in
main()
File "run.py", line 233, in main
train_model.compile()
File "/home/okuchaiev/repos/Work/OpenSeq2Seq/open_seq2seq/models/model.py", line 396, in compile
skip_update_ph=self.skip_update_ph,
File "/home/okuchaiev/repos/Work/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 194, in optimize_loss
expected_shape=grad.shape,
AttributeError: 'IndexedSlices' object has no attribute 'shape'

You can reproduce on dev0.4 with:
python run.py --config_file=example_configs/text2text/nmt-reversal-RR.py --mode=train_eval

All text2text embedding and projections should be able to (optionally) pad to 8

This might yield perf improvements on Volta in mixed precision

Be able to load a float16 or mixed model into a float32 model

Similar to #213.
It would be nice to train a model in mixed or fp16 and do inference in fp32 or vice-versa.

tf.data.Dataset.shard is doing split in different way as assumed by OS2S

tf data shard is doing "sequential" split, i.e. dataset k will get every i % num_workers == k element and OpenSeq2Seq assumes internally that data will be split "globally", meaning that all data is divided into equal sequential parts and the first one goes to the first dataset, second to the second and so on. This will break multi-GPU inference when tf.data.Dataset.shard approach is used. We should make OS2S either to have both options supported or to switch to tf.data version, since that's what we expect users to do when they create new datasets.

Encoding issue with python 2 when evaluating convs2s

When evaluation starts for convs2s on Python 2, the following encoding error is given:

*** Running evaluation on a validation set:
*** *****EVAL Source[0]: ~~Gut@@ ach : Incre@@ ased safety for pedestri@@ ans~~
main()
File "run.py", line 236, in main
train(train_model, eval_model, debug_port=args.debug_port)
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 126, in train
fetches_vals = sess.run(fetches, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1199, in run
run_metadata=run_metadata))
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/hooks.py", line 190, in after_run
self._model, run_context.session, mode="eval", compute_loss=True,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 223, in get_results_for_epoch
model, sess, compute_loss, mode, verbose,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 177, in iterate_data
results_per_batch.append(model.evaluate(inputs, outputs))
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/models/text2text.py", line 173, in evaluate
offset=4,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 331, in deco_print
print(start + " " * offset + line, end=end)
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 283, in write
self.log.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 71: ordinal not in range(128)
*** *****EVAL Target[0]: ~~Gut@@ ach : Noch mehr Sicherheit für Fußgän@@ ger~~

TensorFlow rnn cells don't have regularizer parameter and thus are difficult to use in MP mode

The problem is that our MP approach relies on regularization being applied through setting regularizer parameter on a layer or variable level. In case of RNNs it does not seem to be possible because for some reason there is no such parameter and variables are getting created deep inside TF code. The workaround for this would be to write something like this:

for var in rnn_cell.trainable_variables:
  if var.dtype.base_type == tf.float16:
    tf.add_to_collection('REGULARIZATION_FUNCTIONS', (var, self.params["regularizer"]))

but seems like a hacky way of fixing this problem and relies on mp_regularizer_wrapper working the way it works now.

No dataset augmentation in Wave2Letter+ models

The DeepSpeech 2 model includes dataset augmentation of time stretch and some Gaussian noise.
Link

Why does the Wave2Letter model not do the same thing?

Add Resnet50-based encoder to OpenSeq2Seq

text2text.py data pipeline inefficiency

There are some inefficiencies with applying a shard after cache in text2text.py build_graph method. Rewriting part of the code such that the shard is called first before map, cache results in a pretty sizable decrease in time per step when training big transformer on wmt14 en-de. Happy to provide a pull request. On 4 nodes, 32 gpus, batch size 128, iter size 16, mixed precision training, 16gb voltas, I've noticed time per step drop from 13.9s without the fix below to 6.6s with the following code change in the dataset pipeline.

  def build_graph(self):
    _sources = tf.data.TextLineDataset(self.source_file)

    if self._num_workers > 1:
      _sources = _sources.shard(num_shards=self._num_workers, index=self._worker_id)

    _sources = _sources.map(lambda line: tf.py_func(func=self._src_token_to_id, inp=[line],
                                   Tout=[tf.int32], stateful=False),
           num_parallel_calls=self._map_parallel_calls) \
      .map(lambda tokens: (tokens, tf.size(tokens)),
           num_parallel_calls=self._map_parallel_calls)

    _targets = tf.data.TextLineDataset(self.target_file) \

    if self._num_workers > 1:
      _targets = _targets.shard(num_shards=self._num_workers, index=self._worker_id)

    _targets = _targets.map(lambda line: tf.py_func(func=self._tgt_token_to_id, inp=[line],
                                   Tout=[tf.int32], stateful=False),
           num_parallel_calls=self._map_parallel_calls) \
      .map(lambda tokens: (tokens, tf.size(tokens)),
           num_parallel_calls=self._map_parallel_calls)

    _src_tgt_dataset = tf.data.Dataset.zip((_sources, _targets)).filter(
      lambda t1, t2: tf.logical_and(tf.less_equal(t1[1], self.max_len),
                                    tf.less_equal(t2[1], self.max_len))
    ).cache()

#204

Multi-GPU evaluation without horovod is sequential

Meaning that second GPU will start computation only when the first one is finished. This is an artifact of the new multi-GPU implementation. It's probably not too difficult to fix (just need to pass tensors from all copies of the model to sess.run), but will make the code much more complicated, since Horovod and non-Horovod processing will need to be different. We don't prioritize solving this issue since on Horovod everything works correctly and that's a recommended way to do multi-GPU anyway, because only Horovod has nccl support.

performance issue ~30mins loading WMT dataset, transformer

I'm attempting to train a transformer-big model with 16gpus across 2 nodes (8 gpus each) on AWS with Horovod and experiencing a performance issue when the run.py script starts up. With the tensorflow timeline, I've narrowed it down to what I believe is the IteratorGetNext op. I suspect it's spending a lot of time trying to load and preprocess the dataset. This is on an EBS backed instance.

Pretrained Model

Can we get a pretrained model? :)

In inference mode _get_num_objects_per_step throws exception

Ideally it should not be called if benchmark flag isn't set at all
Targets in inference is actual output generated by the model, right now it expects them from dl?

WARNING:tensorflow:From /home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/parts/transformer/beam_search.py:421: calling reduce_logsumexp (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
*** Inference Mode. Loss part of graph isn't built.
Traceback (most recent call last):
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 253, in
main()
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 242, in main
infer_model.compile()
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 350, in compile
for worker_id in range(self.num_gpus)]
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 350, in
for worker_id in range(self.num_gpus)]
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/text2text.py", line 223, in _get_num_objects_per_step
num_tokens += tf.reduce_sum(data_layer.input_tensors['target_tensors'][1])
KeyError: 'target_tensors'

Tacotron 2 model does not currently support CPU inference

Add unittests for all models convergence

Similar to how is done with DS2 and W2L models we should add small convergence tests on toy examples for all other models.

Any WaveNet related plans?

Hello, everyone.

Thanks you for your amazing work in this repo. Guys, could you please shed the light if you plan to do som WaveNet implementation/integration with nv-wavenet?

Suggestions about decoder codes with gnmt attention

First of all, thanks for your great work!

Here is just a small suggestion about improving the code readability for RNNDecoderWithAttention and BeamSearchDecoderWithAttention when gnmt or gnmt_v2 attention are enabled, because they confused me when I read them for the first time :)

There's several branches in these two classes when dealing with gnmt attentions. I understand branching is necessary for dealing different cases, but some of them are doing quite misleading stuffs, like:

OpenSeq2Seq/open_seq2seq/decoders/rnn_decoders.py

Lines 384 to 390 in 1efa595

    
           if self.params['attention_type'].startswith('gnmt'): 
        
             residual_connections = False 
        
             wrap_to_multi_rnn = False 
        
           else: 
        
             residual_connections = self.params['decoder_use_skip_connections'] 
        
             wrap_to_multi_rnn = True

In the case above this branch force set residual_connections and wrap_to_multi_rnn to False, however it indeed use residual connection below:

OpenSeq2Seq/open_seq2seq/decoders/rnn_decoders.py

Lines 422 to 424 in 1efa595

    
           attentive_decoder_cell = GNMTAttentionMultiCell( 
        
             attention_cell, self._add_residual_wrapper(self._decoder_cells), 
        
             use_new_attention=(self.params['attention_type'] == 'gnmt_v2'))

Also, this force overwrites the hparam in configure file, which makes more confusion ( I guess gnmt attention will always the residual connections even if set False in configure file?):

OpenSeq2Seq/example_configs/text2text/en-de-gnmt-like-4GPUs.py

Line 136 in 1efa595

"decoder_use_skip_connections": True,

Besides, there are redundant/inconsistent code for creating residual connection: either from create_rnn_cell, or by self._add_residual_wrapper.

Again thanks for your great work and expecting for new models :)

Evaluation numbers are not exact when using num_gpus > 1

This happens because all batches are split across GPUs in random order (because we use subsequent calls to tf.data.Iterator().get_next()). Thus, the last batch might be processed incorrectly when the dataset size is not divisible by the batch size.

performance---the training time of transformer model is too long in mixed-precision model

when i train the transformer in mixed-precision model, the time is so long, i.e., (1) the transformer_big.py : two GPU (V100), the parameters are all default, time per step: 13s,
(2) tensor2tensor (big model): two GPU (V100), the parameters are all default, time per step: 0.3s,but as present in mixed-precision training document (https://arxiv.org/pdf/1710.03740.pdf), the training time should be shorter than that in FP32 model. So, why?

Make RNN encoders and decoders accept arbitrary TF RNN cell from config

Similarly how it is done with initializer and regularizer; encoder and decoder should accept cell class in config and cell parameters

Re-write parallel "ParallelTextDataLayer" so that shuffle is fast.

1). Let's have a standalone script which will deterministically split file into N shards (default: 100).
So, we'll have train.tok.clean.BPE.en_XX_of_N and train.tok.clean.BPE.de_XX_of_N
2) Have two tf.data objects with filelists (for src and tgt) and zip them together with tf.data.Dataset.zip
3) Then shuffle fully only the dataset with filename pairs
4) Parse shuffled datasets and create batches much like "ParallelTextDataLayer" already does.

Expose global training step inside hooks.py to model.maybe_print_logs, and finalize_evaluation

For tacotron, I am currently saving images and wav files to disk as opposed to logging them in tensorboard. It would be nice to have access to the current step so I can save the file name for example as eval_sample_step_100.wav.

It would be nice to change inside utils/hooks.py from
dict_to_log = self._model.maybe_print_logs(input_values, output_values)
to
dict_to_log = self._model.maybe_print_logs(input_values, output_values, step)

and from
dict_to_log = self._model.finalize_evaluation(results_per_batch)
to
dict_to_log = self._model.finalize_evaluation(results_per_batch, step)

the downside would be that all models would have to be changed to accept this additional parameter. Alternatively, we could hide it inside input_values or results_per_batch, but this doesn't seem as nice.

Fix all deprecation warning

Such as:

Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.

ValueError: Internal error occurred mkl_fft

After following the installation instructions for OpenSeq2Seq and running the first test:

python run.py --config_file=example_configs/speech2text/ds2_toy_config.py --mode=train_eval

I received a ValueError: Internal error occurred related to mkl_fft causing the test to fail.

As a temporary fix found from IntelPython/mkl_fft#11, the test ran properly after adding the following line at the top of the file OpenSeq2Seq/open_seq2seq/data/speech2text.py to disable mkl optimization.

np.fft.restore_all()

Additional Details:

Tensorflow version 1.10.0
CUDA 9.2
cuDNN 7.3
Python 3.6

Add support for GNMT-like model

Beam search decoder does not work in mixed precision/float16

Traceback (most recent call last):
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 226, in
main()
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 220, in main
hvd=hvd)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/utils/model_builders.py", line 197, in create_encoder_decoder_loss_model
hvd=hvd,
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/seq2seq.py", line 49, in init
hvd=hvd)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 177, in init
gpu_id=gpu_ind,
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/seq2seq.py", line 82, in _build_forward_pass_graph
decoder_output = self.decoder.decode(input_dict=decoder_input)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/decoders/decoder.py", line 113, in decode
return self._decode(self._cast_types(input_dict))
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/decoders/rnn_decoders.py", line 449, in _decode
output_time_major=time_major,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 201, in dynamic_decode
initial_finished, initial_inputs, initial_state = decoder.initialize()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py", line 308, in initialize
dtype=nest.flatten(self._initial_cell_state)[0].dtype)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 2446, in one_hot
"dtype parameter {1}".format(on_dtype, dtype))
TypeError: dtype <dtype: 'float32'> of on_value does not match dtype parameter <dtype: 'float16'>

Batch norm is not compatible with gradient norm clipping in float16 mode

race condition with logdir while launching distributed training job with horovod

There may be a race condition when launching a distributed training job with --enable_logs turned on. On the master node with 8 gpus, I'm noticing that I'll seem to randomly get this error "Log directory is not empty. If you want to continue learning, you should provide "--continue_learning" flag" accompanied by 1 or more GPUs not being acquired. I believe it has to do with the the rank 0 process creating the logdir and other ranks checking for its existence and a checkpoint within. The logging directory is created first and other ranks will try to find a checkpoint afterward. They'll fail and an IOError will be raised and caught. The checkpoint variable is never set when being returned from check_logdir and finally the process seems to just hang never acquiring its GPU. Horovod shuts down moments later with an error.

OpenSeq2Seq/run.py

Line 36 in 296bad1

checkpoint = check_logdir(args, base_config, restore_best_checkpoint)

OpenSeq2Seq/open_seq2seq/utils/utils.py

Line 578 in 296bad1

if os.path.isdir(logdir) and os.listdir(logdir) != []:

Adding one line of code here fixes the problem for me. Happy to submit a PR.

  # Check logdir and create it if necessary
  if hvd is None or hvd.rank() == 0:
    checkpoint = check_logdir(args, base_config, restore_best_checkpoint)

Slow CTC beam search decoder

Current CTC beam search decoder runs on single CPU core. So it might be a bottleneck for speech recognition inference.

Fill out models and recipies section in docs

We should add a bit more information: dataset, language model (for speech), maybe how metrics were calculated, etc.

BeamSearchHelperTest.test_shape_list fails with TF 1.9rc2

TF 1.9 seems to improve static shape inference to pick up constant expressions within the graph. In consequence, all elements returned by beam_search._shape_list(x) are now int values. This causes the assert on shape[1] in BeamSearchHelperTest.test_shape_list to fail.

Allow RNN decoders to tie embedding and projection weights

RNNDecoderWithAttention and BeamSearchRNNDecoderWithAttention classes should be able to use transpose of elf._dec_emb_w in the self._output_projection_layer. This will probably require changing self._output_projection_layer to a function which does simple matmul, instead of using tf.layers.Dense

Add contributing.md and authors.md

Command-line overriding of bool values doesn't work

It is impossible to override bool parameters which have default value = True in the config, since argparse treats any string as True for bool type.

ds2 small - doc mismatch with config

For ds2 small model, doc is mismatch with config file on gru unidirection/bidirection

Doc:

OpenSeq2Seq/docs/html/models-and-recipes.html

Line 282 in 9eb0b4d

<td>This model has 2 convolutional layers and 2 unidirectional

Config File:

OpenSeq2Seq/example_configs/speech2text/ds2_small_1gpu.py

Line 66 in 9eb0b4d

"rnn_unidirectional": False,

Add slanted triangular learning rates

As presented in the paper "Universal Language Model Fine-tuning for Text Classification" (Howard et al., 2018).

Allow to load (pre-trained) weights from checkpoint.

Take pre-trained checkpoint from website and load weights only
Switch to another data-set and start training

Question: Are there any performance numbers available for the models?

Hi,

The models and recipes page has BLEU score mentioned.

Are there any performance numbers ("no of sentences per second" or "Seconds per step") for those configuration?

Thanks,
Srini

Training is incorrecly printing Validation Loss when it should be Loss

See
https://github.com/NVIDIA/OpenSeq2Seq/blob/8bf21488c8d8fb8e23c699143b6a5abf92d9942e/open_seq2seq/utils/hooks.py#L144-154

Add support for iter_size - virtual large batch

iter_size should do the following:

if iter_size=1 or not set then behave as before
if iter_size>1 then we should accumulated gradients on every worker and broadcast/sync every iter_size.

This would effectively mean that "algorithmic batch size" = batch_size_per_gpunum_gpusiter_size

Unify helper functions and layer building interface

By now for each model we have very similar (in terms of functionality) "layers" parameter that builds layers based on the specified configuration. I think we should unify this interface, probably based on the most general CNNEncoder approach. We should also unify all helper function in parts, since some of them do very similar things.

Error while scaling the number of nodes - training on CPU nodes

I'm trying to run the transformer big model with default config file and modifying the number of compute nodes. I'm not using any GPU, trying to experiment based on scaling the CPU nodes. I run into resource deadlock error as I try to scale up the nodes. Any insights on the error would be helpful.
This happens after the filling up the shuffle buffer.

Below is the warning and error segments from the log file.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllgather_Loss_Optimization_gradients_concat_0 [missing ranks: 25, 27, 60, 67, 103, 115]
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllreduce_Loss_Optimization_gradients_ForwardPass_transformer_encoder_encode_layer_0_self_attention_layer_normalization_mul_1_grad_tuple_control_dependency_1_0 [missing ranks: 25, 27, 60, 67, 103, 115]
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllreduce_Loss_Optimization_gradients_ForwardPass_transformer_encoder_encode_layer_0_self_attention_layer_normalization_add_1_grad_tuple_control_dependency_1_0 [missing ranks: 25, 27, 60, 67, 103, 115]

python:348666 terminated with signal 11 at PC=2aaaec522f14 SP=2aaaed31ce10. Backtrace:
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0x29af14)[0x2aaaec522f14]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0xe77af)[0x2aaaec36f7af]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(PMPI_Allgatherv+0xd7c)[0x2aaaec373aec]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x1eac1)[0x2aaaebaadac1]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x2304d)[0x2aaaebab204d]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x23eaf)[0x2aaaebab2eaf]
/cm/local/apps/gcc/7.2.0/lib64/libstdc++.so.6(+0xb9cff)[0x2aaac5e0bcff]
/lib64/libpthread.so.0(+0x7e25)[0x2aaaab0b3e25]
/lib64/libc.so.6(clone+0x6d)[0x2aaaabac934d]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0x29af14)[0x2aaaec522f14]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0xe77af)[0x2aaaec36f7af]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(PMPI_Allgatherv+0xd7c)[0x2aaaec373aec]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x1eac1)[0x2aaaebaadac1]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x2304d)[0x2aaaebab204d]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x23eaf)[0x2aaaebab2eaf]
/cm/local/apps/gcc/7.2.0/lib64/libstdc++.so.6(+0xb9cff)[0x2aaac5e0bcff]
/lib64/libpthread.so.0(+0x7e25)[0x2aaaab0b3e25]
/lib64/libc.so.6(clone+0x6d)[0x2aaaabac934d]
terminate called after throwing an instance of 'std::system_error'
what(): Resource deadlock avoided

Thanks,
srini

	if self.params['attention_type'].startswith('gnmt'):
	residual_connections = False
	wrap_to_multi_rnn = False
	else:
	residual_connections = self.params['decoder_use_skip_connections']
	wrap_to_multi_rnn = True

	attentive_decoder_cell = GNMTAttentionMultiCell(
	attention_cell, self._add_residual_wrapper(self._decoder_cells),
	use_new_attention=(self.params['attention_type'] == 'gnmt_v2'))