Coder Social home page Coder Social logo

nvidia / openseq2seq Goto Github PK

View Code? Open in Web Editor NEW
1.5K 92.0 371.0 58.82 MB

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

Home Page: https://nvidia.github.io/OpenSeq2Seq

License: Apache License 2.0

Shell 2.05% Perl 0.29% Python 91.14% C++ 5.62% Makefile 0.14% Jupyter Notebook 0.56% Starlark 0.11% SWIG 0.08%
neural-machine-translation multi-gpu deep-learning sequence-to-sequence seq2seq multi-node speech-recognition speech-to-text mixed-precision float16

openseq2seq's Introduction

License Documentation

OpenSeq2Seq

OpenSeq2Seq: toolkit for distributed and mixed precision training of sequence-to-sequence models

OpenSeq2Seq main goal is to allow researchers to most effectively explore various sequence-to-sequence models. The efficiency is achieved by fully supporting distributed and mixed-precision training. OpenSeq2Seq is built using TensorFlow and provides all the necessary building blocks for training encoder-decoder models for neural machine translation, automatic speech recognition, speech synthesis, and language modeling.

Documentation and installation instructions

https://nvidia.github.io/OpenSeq2Seq/

Features

  1. Models for:
    1. Neural Machine Translation
    2. Automatic Speech Recognition
    3. Speech Synthesis
    4. Language Modeling
    5. NLP tasks (sentiment analysis)
  2. Data-parallel distributed training
    1. Multi-GPU
    2. Multi-node
  3. Mixed precision training for NVIDIA Volta/Turing GPUs

Software Requirements

  1. Python >= 3.5
  2. TensorFlow >= 1.10
  3. CUDA >= 9.0, cuDNN >= 7.0
  4. Horovod >= 0.13 (using Horovod is not required, but is highly recommended for multi-GPU setup)

Acknowledgments

Speech-to-text workflow uses some parts of Mozilla DeepSpeech project.

Beam search decoder with language model re-scoring implementation (in decoders) is based on Baidu DeepSpeech.

Text-to-text workflow uses some functions from Tensor2Tensor and Neural Machine Translation (seq2seq) Tutorial.

Disclaimer

This is a research project, not an official NVIDIA product.

Related resources

Paper

If you use OpenSeq2Seq, please cite this paper

@misc{openseq2seq,
    title={Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq},
    author={Oleksii Kuchaiev and Boris Ginsburg and Igor Gitman and Vitaly Lavrukhin and Jason Li and Huyen Nguyen and Carl Case and Paulius Micikevicius},
    year={2018},
    eprint={1805.10387},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

openseq2seq's People

Contributors

arnav1993k avatar blisc avatar borisgin avatar edresson avatar edwardhdlu avatar gabriellin avatar gioannides avatar giranntu avatar ka-bu avatar kipok avatar louischen1992 avatar madrugado avatar matanhs avatar mvankeirsbilck avatar nluehr avatar okuchaiev avatar otstrel avatar raymondnie avatar samikama avatar shujian2015 avatar siddharthbhatnagar avatar the01 avatar trentlo avatar trevor-m avatar vahidoox avatar virajkarandikar avatar vsl9 avatar vsuthichai avatar xravitejax avatar yrebryk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openseq2seq's Issues

Overwritting when updating base params with train, eval, and infer params

In the current implementation, scalar parameters are overwritten when the base parms are updated with the train/eval/infer params as intended however it raises an issue for nested parameters in which case the entire dictionary is replaced with what is in the train/eval/infer params . For example:
If my config is
base_params = {
ย ย "random_seed":0,
ย ย "data_layer_params":{
ย ย ย ย "num_features":10
ย ย }
}
train_params = {
ย ย "data_layer_params":{
ย ย ย ย "dataset":train.csv
ย ย }
}

Ideally I would want the train params to be
merged_params = {
ย ย "random_seed":0,
ย ย "data_layer_params":{
ย ย ย ย "num_features":10,
ย ย ย ย "dataset":train.csv
ย ย }
}

but currently I get
merged_params = {
ย ย "random_seed":0,
ย ย "data_layer_params":{
ย ย ย ย "dataset":train.csv
ย ย }
}

It would be helpful for experimentation to leave num_features inside the base params as opposed to adding it to train_params so it can be easily changed via the command line as opposed to creating multiple config files.

throughput scaling issues

I'm attempting to benchmark throughput on transformer-big on the following:

  • AWS p3.16xlarge (8 gpus per node)
  • Horovod 0.13.10
  • OpenMPI 3.1.1
  • TensorFlow 1.9.0
  • CUDA 9.0
  • FP32

I'm benchmarking for 100 steps -- 10 to 109, skipping the first 0 to 9 steps. Here are some results. It seems to plateau at 8 gpus and then doesn't scale any further. I'm primarily interested in getting the throughput samples per second to scale well. Any thoughts?

Nodes GPUs Steps Global Batch Size Per GPU Batch Size Seconds / Step Objects / Sec Samples / Sec
1 2 100 256 128 0.712 21489.658 360
1 4 100 512 128 0.877 34967.614 583
1 8 100 1024 128 1.287 47788.866 795
2 16 100 2048 128 2.906 42235.197 704
3 24 100 3072 128 3.972 46429.704 773
4 32 100 4096 128 5.09 48363.986 804

Question : GNMT model faster on single GPU than on 4 GPUs?

Hello
I have been experimenting with en-de-gnmt-like-4GPUs.py recipe. I have noticed that time per step is smaller if you use 1 GPU instead of 4 GPUs. It looks like step time is approximately 40% slower on 4 GPUs than 1 GPU. I have the default batch-size(32) in both cases.

Here is an example.
After training ~36 hours with 4 GPUs:
Global step 73968: loss = 1.0840, time per step = 0:00:1.390

However, after training ~36 hours with 1 GPU:
Global step 138432: loss = 1.2411, time per step = 0:00:0.789

I have a few questions:

  1. Isn't the step time expected to be smaller on 4 GPU case?
  2. However, after 36 hours, loss is smaller with 4 GPU case even though step time is higher.
  3. I'm wondering if that's because objects per second is higher on multi-GPU case. But if that is the case, shouldn't the step time reduce?

Thanks for clarification.

iter_size does not properly work with sparse gradients

Error:

*** Building graph in Horovod rank: 0
Traceback (most recent call last):
File "run.py", line 254, in
main()
File "run.py", line 233, in main
train_model.compile()
File "/home/okuchaiev/repos/Work/OpenSeq2Seq/open_seq2seq/models/model.py", line 396, in compile
skip_update_ph=self.skip_update_ph,
File "/home/okuchaiev/repos/Work/OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py", line 194, in optimize_loss
expected_shape=grad.shape,
AttributeError: 'IndexedSlices' object has no attribute 'shape'

You can reproduce on dev0.4 with:
python run.py --config_file=example_configs/text2text/nmt-reversal-RR.py --mode=train_eval

tf.data.Dataset.shard is doing split in different way as assumed by OS2S

tf data shard is doing "sequential" split, i.e. dataset k will get every i % num_workers == k element and OpenSeq2Seq assumes internally that data will be split "globally", meaning that all data is divided into equal sequential parts and the first one goes to the first dataset, second to the second and so on. This will break multi-GPU inference when tf.data.Dataset.shard approach is used. We should make OS2S either to have both options supported or to switch to tf.data version, since that's what we expect users to do when they create new datasets.

Encoding issue with python 2 when evaluating convs2s

When evaluation starts for convs2s on Python 2, the following encoding error is given:

*** Running evaluation on a validation set:
*** *****EVAL Source[0]: Gut@@ ach : Incre@@ ased safety for pedestri@@ ans
main()
File "run.py", line 236, in main
train(train_model, eval_model, debug_port=args.debug_port)
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 126, in train
fetches_vals = sess.run(fetches, feed_dict)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 567, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1043, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1134, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1119, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1199, in run
run_metadata=run_metadata))
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/hooks.py", line 190, in after_run
self._model, run_context.session, mode="eval", compute_loss=True,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 223, in get_results_for_epoch
model, sess, compute_loss, mode, verbose,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 177, in iterate_data
results_per_batch.append(model.evaluate(inputs, outputs))
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/models/text2text.py", line 173, in evaluate
offset=4,
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 331, in deco_print
print(start + " " * offset + line, end=end)
File "/opt/OpenSeq2Seq/OpenSeq2Seq/open_seq2seq/utils/utils.py", line 283, in write
self.log.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 71: ordinal not in range(128)
*** *****EVAL Target[0]: Gut@@ ach : Noch mehr Sicherheit fรผr FuรŸgรคn@@ ger

TensorFlow rnn cells don't have regularizer parameter and thus are difficult to use in MP mode

The problem is that our MP approach relies on regularization being applied through setting regularizer parameter on a layer or variable level. In case of RNNs it does not seem to be possible because for some reason there is no such parameter and variables are getting created deep inside TF code. The workaround for this would be to write something like this:

for var in rnn_cell.trainable_variables:
  if var.dtype.base_type == tf.float16:
    tf.add_to_collection('REGULARIZATION_FUNCTIONS', (var, self.params["regularizer"]))

but seems like a hacky way of fixing this problem and relies on mp_regularizer_wrapper working the way it works now.

text2text.py data pipeline inefficiency

There are some inefficiencies with applying a shard after cache in text2text.py build_graph method. Rewriting part of the code such that the shard is called first before map, cache results in a pretty sizable decrease in time per step when training big transformer on wmt14 en-de. Happy to provide a pull request. On 4 nodes, 32 gpus, batch size 128, iter size 16, mixed precision training, 16gb voltas, I've noticed time per step drop from 13.9s without the fix below to 6.6s with the following code change in the dataset pipeline.

  def build_graph(self):
    _sources = tf.data.TextLineDataset(self.source_file)

    if self._num_workers > 1:
      _sources = _sources.shard(num_shards=self._num_workers, index=self._worker_id)

    _sources = _sources.map(lambda line: tf.py_func(func=self._src_token_to_id, inp=[line],
                                   Tout=[tf.int32], stateful=False),
           num_parallel_calls=self._map_parallel_calls) \
      .map(lambda tokens: (tokens, tf.size(tokens)),
           num_parallel_calls=self._map_parallel_calls)

    _targets = tf.data.TextLineDataset(self.target_file) \

    if self._num_workers > 1:
      _targets = _targets.shard(num_shards=self._num_workers, index=self._worker_id)

    _targets = _targets.map(lambda line: tf.py_func(func=self._tgt_token_to_id, inp=[line],
                                   Tout=[tf.int32], stateful=False),
           num_parallel_calls=self._map_parallel_calls) \
      .map(lambda tokens: (tokens, tf.size(tokens)),
           num_parallel_calls=self._map_parallel_calls)

    _src_tgt_dataset = tf.data.Dataset.zip((_sources, _targets)).filter(
      lambda t1, t2: tf.logical_and(tf.less_equal(t1[1], self.max_len),
                                    tf.less_equal(t2[1], self.max_len))
    ).cache()

#204

Multi-GPU evaluation without horovod is sequential

Meaning that second GPU will start computation only when the first one is finished. This is an artifact of the new multi-GPU implementation. It's probably not too difficult to fix (just need to pass tensors from all copies of the model to sess.run), but will make the code much more complicated, since Horovod and non-Horovod processing will need to be different. We don't prioritize solving this issue since on Horovod everything works correctly and that's a recommended way to do multi-GPU anyway, because only Horovod has nccl support.

performance issue ~30mins loading WMT dataset, transformer

I'm attempting to train a transformer-big model with 16gpus across 2 nodes (8 gpus each) on AWS with Horovod and experiencing a performance issue when the run.py script starts up. With the tensorflow timeline, I've narrowed it down to what I believe is the IteratorGetNext op. I suspect it's spending a lot of time trying to load and preprocess the dataset. This is on an EBS backed instance.

screen shot 2018-08-01 at 6 12 54 pm

In inference mode _get_num_objects_per_step throws exception

  1. Ideally it should not be called if benchmark flag isn't set at all
  2. Targets in inference is actual output generated by the model, right now it expects them from dl?

WARNING:tensorflow:From /home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/parts/transformer/beam_search.py:421: calling reduce_logsumexp (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
*** Inference Mode. Loss part of graph isn't built.
Traceback (most recent call last):
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 253, in
main()
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 242, in main
infer_model.compile()
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 350, in compile
for worker_id in range(self.num_gpus)]
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 350, in
for worker_id in range(self.num_gpus)]
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/text2text.py", line 223, in _get_num_objects_per_step
num_tokens += tf.reduce_sum(data_layer.input_tensors['target_tensors'][1])
KeyError: 'target_tensors'

Any WaveNet related plans?

Hello, everyone.

Thanks you for your amazing work in this repo. Guys, could you please shed the light if you plan to do som WaveNet implementation/integration with nv-wavenet?

Suggestions about decoder codes with gnmt attention

First of all, thanks for your great work!

Here is just a small suggestion about improving the code readability for RNNDecoderWithAttention and BeamSearchDecoderWithAttention when gnmt or gnmt_v2 attention are enabled, because they confused me when I read them for the first time :)

There's several branches in these two classes when dealing with gnmt attentions. I understand branching is necessary for dealing different cases, but some of them are doing quite misleading stuffs, like:

if self.params['attention_type'].startswith('gnmt'):
residual_connections = False
wrap_to_multi_rnn = False
else:
residual_connections = self.params['decoder_use_skip_connections']
wrap_to_multi_rnn = True

In the case above this branch force set residual_connections and wrap_to_multi_rnn to False, however it indeed use residual connection below:

attentive_decoder_cell = GNMTAttentionMultiCell(
attention_cell, self._add_residual_wrapper(self._decoder_cells),
use_new_attention=(self.params['attention_type'] == 'gnmt_v2'))

Also, this force overwrites the hparam in configure file, which makes more confusion ( I guess gnmt attention will always the residual connections even if set False in configure file?):

"decoder_use_skip_connections": True,

Besides, there are redundant/inconsistent code for creating residual connection: either from create_rnn_cell, or by self._add_residual_wrapper.

Again thanks for your great work and expecting for new models :)

Evaluation numbers are not exact when using num_gpus > 1

This happens because all batches are split across GPUs in random order (because we use subsequent calls to tf.data.Iterator().get_next()). Thus, the last batch might be processed incorrectly when the dataset size is not divisible by the batch size.

performance---the training time of transformer model is too long in mixed-precision model

when i train the transformer in mixed-precision model, the time is so long, i.e., (1) the transformer_big.py : two GPU (V100), the parameters are all default, time per step: 13s,
(2) tensor2tensor (big model): two GPU (V100), the parameters are all default, time per step: 0.3s,but as present in mixed-precision training document (https://arxiv.org/pdf/1710.03740.pdf), the training time should be shorter than that in FP32 model. So, why?

Re-write parallel "ParallelTextDataLayer" so that shuffle is fast.

1). Let's have a standalone script which will deterministically split file into N shards (default: 100).
So, we'll have train.tok.clean.BPE.en_XX_of_N and train.tok.clean.BPE.de_XX_of_N
2) Have two tf.data objects with filelists (for src and tgt) and zip them together with tf.data.Dataset.zip
3) Then shuffle fully only the dataset with filename pairs
4) Parse shuffled datasets and create batches much like "ParallelTextDataLayer" already does.

Expose global training step inside hooks.py to model.maybe_print_logs, and finalize_evaluation

For tacotron, I am currently saving images and wav files to disk as opposed to logging them in tensorboard. It would be nice to have access to the current step so I can save the file name for example as eval_sample_step_100.wav.

It would be nice to change inside utils/hooks.py from
dict_to_log = self._model.maybe_print_logs(input_values, output_values)
to
dict_to_log = self._model.maybe_print_logs(input_values, output_values, step)

and from
dict_to_log = self._model.finalize_evaluation(results_per_batch)
to
dict_to_log = self._model.finalize_evaluation(results_per_batch, step)

the downside would be that all models would have to be changed to accept this additional parameter. Alternatively, we could hide it inside input_values or results_per_batch, but this doesn't seem as nice.

Fix all deprecation warning

Such as:

Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.

ValueError: Internal error occurred mkl_fft

After following the installation instructions for OpenSeq2Seq and running the first test:

python run.py --config_file=example_configs/speech2text/ds2_toy_config.py --mode=train_eval

I received a ValueError: Internal error occurred related to mkl_fft causing the test to fail.

As a temporary fix found from IntelPython/mkl_fft#11, the test ran properly after adding the following line at the top of the file OpenSeq2Seq/open_seq2seq/data/speech2text.py to disable mkl optimization.

np.fft.restore_all()

Additional Details:

  • Tensorflow version 1.10.0
  • CUDA 9.2
  • cuDNN 7.3
  • Python 3.6

Beam search decoder does not work in mixed precision/float16

Traceback (most recent call last):
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 226, in
main()
File "/home/okuchaiev/repos/OpenSeq2Seq/run.py", line 220, in main
hvd=hvd)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/utils/model_builders.py", line 197, in create_encoder_decoder_loss_model
hvd=hvd,
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/seq2seq.py", line 49, in init
hvd=hvd)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/model.py", line 177, in init
gpu_id=gpu_ind,
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/models/seq2seq.py", line 82, in _build_forward_pass_graph
decoder_output = self.decoder.decode(input_dict=decoder_input)
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/decoders/decoder.py", line 113, in decode
return self._decode(self._cast_types(input_dict))
File "/home/okuchaiev/repos/OpenSeq2Seq/open_seq2seq/decoders/rnn_decoders.py", line 449, in _decode
output_time_major=time_major,
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 201, in dynamic_decode
initial_finished, initial_inputs, initial_state = decoder.initialize()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py", line 308, in initialize
dtype=nest.flatten(self._initial_cell_state)[0].dtype)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 2446, in one_hot
"dtype parameter {1}".format(on_dtype, dtype))
TypeError: dtype <dtype: 'float32'> of on_value does not match dtype parameter <dtype: 'float16'>

race condition with logdir while launching distributed training job with horovod

There may be a race condition when launching a distributed training job with --enable_logs turned on. On the master node with 8 gpus, I'm noticing that I'll seem to randomly get this error "Log directory is not empty. If you want to continue learning, you should provide "--continue_learning" flag" accompanied by 1 or more GPUs not being acquired. I believe it has to do with the the rank 0 process creating the logdir and other ranks checking for its existence and a checkpoint within. The logging directory is created first and other ranks will try to find a checkpoint afterward. They'll fail and an IOError will be raised and caught. The checkpoint variable is never set when being returned from check_logdir and finally the process seems to just hang never acquiring its GPU. Horovod shuts down moments later with an error.

checkpoint = check_logdir(args, base_config, restore_best_checkpoint)

if os.path.isdir(logdir) and os.listdir(logdir) != []:

Adding one line of code here fixes the problem for me. Happy to submit a PR.

  # Check logdir and create it if necessary
  if hvd is None or hvd.rank() == 0:
    checkpoint = check_logdir(args, base_config, restore_best_checkpoint)

Slow CTC beam search decoder

Current CTC beam search decoder runs on single CPU core. So it might be a bottleneck for speech recognition inference.

BeamSearchHelperTest.test_shape_list fails with TF 1.9rc2

TF 1.9 seems to improve static shape inference to pick up constant expressions within the graph. In consequence, all elements returned by beam_search._shape_list(x) are now int values. This causes the assert on shape[1] in BeamSearchHelperTest.test_shape_list to fail.

Allow RNN decoders to tie embedding and projection weights

RNNDecoderWithAttention and BeamSearchRNNDecoderWithAttention classes should be able to use transpose of elf._dec_emb_w in the self._output_projection_layer. This will probably require changing self._output_projection_layer to a function which does simple matmul, instead of using tf.layers.Dense

Add support for iter_size - virtual large batch

iter_size should do the following:

  1. if iter_size=1 or not set then behave as before
  2. if iter_size>1 then we should accumulated gradients on every worker and broadcast/sync every iter_size.

This would effectively mean that "algorithmic batch size" = batch_size_per_gpunum_gpusiter_size

Unify helper functions and layer building interface

By now for each model we have very similar (in terms of functionality) "layers" parameter that builds layers based on the specified configuration. I think we should unify this interface, probably based on the most general CNNEncoder approach. We should also unify all helper function in parts, since some of them do very similar things.

Error while scaling the number of nodes - training on CPU nodes

I'm trying to run the transformer big model with default config file and modifying the number of compute nodes. I'm not using any GPU, trying to experiment based on scaling the CPU nodes. I run into resource deadlock error as I try to scale up the nodes. Any insights on the error would be helpful.
This happens after the filling up the shuffle buffer.

Below is the warning and error segments from the log file.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Stalled ops:
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllgather_Loss_Optimization_gradients_concat_0 [missing ranks: 25, 27, 60, 67, 103, 115]
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllreduce_Loss_Optimization_gradients_ForwardPass_transformer_encoder_encode_layer_0_self_attention_layer_normalization_mul_1_grad_tuple_control_dependency_1_0 [missing ranks: 25, 27, 60, 67, 103, 115]
Loss_Optimization/DistributedLazyAdamOptimizer_Allreduce/HorovodAllreduce_Loss_Optimization_gradients_ForwardPass_transformer_encoder_encode_layer_0_self_attention_layer_normalization_add_1_grad_tuple_control_dependency_1_0 [missing ranks: 25, 27, 60, 67, 103, 115]

python:348666 terminated with signal 11 at PC=2aaaec522f14 SP=2aaaed31ce10. Backtrace:
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0x29af14)[0x2aaaec522f14]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0xe77af)[0x2aaaec36f7af]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(PMPI_Allgatherv+0xd7c)[0x2aaaec373aec]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x1eac1)[0x2aaaebaadac1]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x2304d)[0x2aaaebab204d]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x23eaf)[0x2aaaebab2eaf]
/cm/local/apps/gcc/7.2.0/lib64/libstdc++.so.6(+0xb9cff)[0x2aaac5e0bcff]
/lib64/libpthread.so.0(+0x7e25)[0x2aaaab0b3e25]
/lib64/libc.so.6(clone+0x6d)[0x2aaaabac934d]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0x29af14)[0x2aaaec522f14]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(+0xe77af)[0x2aaaec36f7af]
/cm/shared/apps/intel/compilers_and_libraries_2018.2.199/linux/mpi/intel64/lib/libmpi.so.12(PMPI_Allgatherv+0xd7c)[0x2aaaec373aec]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x1eac1)[0x2aaaebaadac1]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x2304d)[0x2aaaebab204d]
/home/srinivas/anaconda2/lib/python2.7/site-packages/horovod/common/mpi_lib.so(+0x23eaf)[0x2aaaebab2eaf]
/cm/local/apps/gcc/7.2.0/lib64/libstdc++.so.6(+0xb9cff)[0x2aaac5e0bcff]
/lib64/libpthread.so.0(+0x7e25)[0x2aaaab0b3e25]
/lib64/libc.so.6(clone+0x6d)[0x2aaaabac934d]
terminate called after throwing an instance of 'std::system_error'
what(): Resource deadlock avoided

Thanks,
srini

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.