Coder Social home page Coder Social logo

athena-team / athena Goto Github PK

View Code? Open in Web Editor NEW
944.0 37.0 194.0 10.18 MB

an open-source implementation of sequence-to-sequence based speech processing engine

Home Page: https://athena-team.readthedocs.io

License: Apache License 2.0

Python 28.24% Makefile 0.07% C++ 70.75% Shell 0.48% Dockerfile 0.07% CMake 0.16% C 0.23%
speech-recognition asr transformer tensorflow ctc unsupervised-learning sequence-to-sequence deployment wfst speaker-recognition

athena's Introduction

Athena

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing. To make speech processing available to everyone, we're also releasing example implementation and recipe on some opensource dataset for various tasks (Automatic Speech Recognition, Speech Synthesis, Voice activity detection, Wake Word Spotting, etc).

All of our models are implemented in Tensorflow>=2.0.1. For ease of use, we provide Kaldi-free pythonic feature extractor with Athena_transform.

Key Features

  • Hybrid Attention/CTC based end-to-end and streaming methods(ASR)
  • Text-to-Speech(FastSpeech/FastSpeech2/Transformer)
  • Voice activity detection(VAD)
  • Key Word Spotting with end-to-end and streaming methods(KWS)
  • ASR Unsupervised pre-training(MPC)
  • Multi-GPU training on one machine or across multiple machines with Horovod
  • WFST creation and WFST-based decoding with C++
  • Deployment with Tensorflow C++(Local server)

Versions

What's new

Discussion & Communication

We have set up a WeChat group for discussion. Please scan the QR and then the administrator will invite you to the group, if you want to join it.

1) Table of Contents

2) Installation

Athena can be installed based on Tensorflow2.3 and Tensorflow2.8 successfully.

  • Athena-v2.0 installed based on Tensorflow2.3:
pip install tensorflow-gpu==2.3.0

pip install -r requirements.txt

python setup.py bdist_wheel sdist

python -m pip install --ignore-installed dist/athena-2.0*.whl
  • Athena-v2.0 installed based on Tensorflow2.8:
pip install tensorflow-gpu==2.8.0

pip install -r requirements.txt

python setup.tf2.8.py bdist_wheel sdist

python -m pip install --ignore-installed dist/athena-2.0*.whl

3) Results

3.1) ASR

The performances of a part of models are shown as follow:

expand
Model LM HKUST AISHELL1 Dataset LibriSpeech Dataset Giga MISP Model link
CER% CER% WER% WER% CER%
dev dev test dev _clean dev _other test_ clean test_ other dev test -
transformer w 21.64 - 5.13 - - - - - 11.70 -
w/o 21.87 - 5.22 3.84 - 3.96 9.70 - - -
transformer-u2 w - - - - - - - - - -
w/o - - 6.38 - - - - - - -
conformer w 21.33 - 4.95 - - - - - - 50.50
w/o 21.59 - 5.04 - - - - - - -
conformer-u2 w - - - - - - - - - -
w/o - - 6.29 - - - - - - -
conformer-CTC w - - - - - - - - - -
w/o - - 6.60 - - - - - - -

To compare with other published results, see wer_are_we.md.

More details of U2, see ASR readme

3.2) TTS

Currently supported TTS tasks are LJSpeech and Chinese Standard Mandarin Speech Copus(data_baker). Supported models are shown in the table below: (Note:HiFiGAN is trained based on TensorflowTTS)

The performance of Athena-TTS are shown as follow:

expand
Traing Data Acoustic Model Vocoder Audio Demo
data_baker Tacotron2 GL audio_demo
data_baker Transformer_tts GL audio_demo
data_baker Fastspeech GL audio_demo
data_baker Fastspeech2 GL audio_demo
data_baker Fastspeech2 HiFiGAN audio_demo
ljspeech Tacotron2 GL audio_demo

More details see TTS readme

3.3) VAD

expand
Task Model Name Training Data Input Segment Frame Error Rate
VAD DNN Google Speech Commands Dataset V2 0.21s 8.49%
VAD MarbleNet Google Speech Commands Dataset V2 0.63s 2.50%

More details see VAD readme

3.4) KWS

The performances on MISP2021 task1 dataset are shown as follow:

expand
KWS Type Model Model Detail Data Loss Dev Eval
Streaming CNN-DNN 2 Conv+3 Dense 60h pos+200h neg CE 0.314 /
E2E CRNN 2 Conv+2 biGRU 60h pos+200h neg CE 0.209 /
E2E CRNN Conv+5 biLSTM 60h pos+200h neg CE 0.186 /
E2E CRNN Conv+5 biLSTM 170h pos+530h neg CE 0.178 /
E2E A-Transformer Conv+4 encoders+1 Dense 170h pos+530h neg CE&Focal 0.109 0.106
E2E A-Conformer Conv+4 encoders+1 Dense 170h pos+530h neg CE&Focal 0.105 0.116
E2E AV-Transformer 2 Conv+4 AV-encoders+1Dense A(170h pos+530h neg)+V(Far 124h) CE 0.132 /

More details you can see: KWS readme

3.5) CTC-Alignment

The CTC alignment result of one utterance is shown below, we can see the output of ctc alignment is with time delayed:

expand

More details see: Alignment readme

3.6) Deploy

Athena-V2.0 deployment only support the ASR. All the experiments are conducted on a CPU machine with 2.10GHz and 104 logic cores. We evaluate the performance on AIShell datasets. The results are shown as follow:

expand
Logic Core Decoder Type Beamsize RTF Character Accuracy
1 BeamSearch 1 0.0881 92.65%
10 0.2534 93.07%
20 0.4537 93.06%
10 1 0.04792 92.65%
10 0.1135 93.07%
20 0.1746 93.06%
1 CTC Prefix BeamSearch 1 0.0543 93.60%
10 0.06 93.60%
20 0.0903 93.60%
10 1 0.0283 93.60%
10 0.038 93.60%
20 0.0641 93.60%

More detail see: Runtime readme

4)Run demo

We provide a quick experience method as follow:

cd athena
source tools/env.sh
#ASR test
# Batch decoding test
python athena/run_demo.py --inference_type asr --saved_model_dir examples/asr/aishell/models/freeze_prefix_beam-20220620 --wav_list test.lst
# One wav test
python athena/run_demo.py --inference_type asr --saved_model_dir examples/asr/aishell/models/freeze_prefix_beam-20220620 --wav_dir aishell/wav/test/S0764/BAC009S0764W0121.wav

#TTS test
python athena/run_demo.py --inference_type tts --text_csv examples/tts/data_baker/test/test.csv --saved_model_dir athena-model-zoo/tts/data_baker/saved_model  

There are some pre-trained models and you can find at: Athena-model-zoo

More examples you can find at:

ASR examples

TTS examples

VAD examples

KWS examples

Alignment examples

C++ Decoder

Server

5) Supported Model architectures and reference

The Athena-v2.0 can support these architectures:

expand
Model Name Task Referenced Papers
Transformer ASR Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018: 5884-5888.
Conformer ASR Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.
Transformer-U2 ASR Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
Conformer-U2 ASR Yao Z, Wu D, Wang X, et al. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit[J]. arXiv preprint arXiv:2102.01547, 2021.
AV_Transformer ASR
AV_Conformer ASR
Fastspeech TTS Ren Y, Ruan Y, Tan X, et al. Fastspeech: Fast, robust and controllable text to speech[J]. Advances in Neural Information Processing Systems, 2019, 32.
Fastspeech2 TTS Ren Y, Hu C, Tan X, et al. Fastspeech 2: Fast and high-quality end-to-end text to speech[J]. arXiv preprint arXiv:2006.04558, 2020.
Tacotron2 TTS Shen J, Pang R, Weiss R J, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018: 4779-4783.
TTS_Transfprmer TTS Li N, Liu S, Liu Y, et al. Neural speech synthesis with transformer network[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 6706-6713.
Marblenet VAD Jia F, Majumdar S, Ginsburg B. Marblenet: Deep 1d time-channel separable convolutional neural network for voice activity detection[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6818-6822.
DNN VAD Tashev I, Mirsamadi S. DNN-based causal voice activity detector[C]//Information Theory and Applications Workshop. 2016.
CNN-DNN, CRNN, A-Transformer, A-Conformer, AV-Transformer KWS Xu Y, Sun J, Han Y, et al. Audio-Visual Wake Word Spotting System for MISP Challenge 2021[C]//ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022: 9246-9250.

6) Directory Structure

Below is the basic directory structure for Athena

expand
|-- Athena
|   |-- data  # - root directory for input-related operations
|   |   |-- datasets  # custom datasets for ASR, TTS and pre-training
|   |-- layers  # some layers
|   |-- models  # some models
|   |-- tools # contains various tools, e.g. decoding tools
|   |-- transform # custom featureizer based on C++
|   |   |-- feats
|   |   |   |-- ops # c++ code on tensorflow ops
|   |-- utils # utils, e.g. checkpoit, learning_rate, metric, etc
|-- docker
|-- docs  # docs
|-- examples  # example scripts for ASR, TTS, etc
|   |-- asr  # each subdirectory contains a data preparation scripts and a run script for the task
|   |   |-- aishell
|   |   |-- hkust
|   |   |-- librispeech
|   |   |-- gigaspeech
|   |   |-- misp
|   |-- kws ## Word wake spotting
|   |   |-- misp
|   |   |-- xtxt
|   |   |-- yesno
|   |-- tts ## TTS examples
|   |   |-- data_baker
|   |   |-- ljspeech
|   |-- vad #VAD example
|       |--google_dataset_v2
|-- tools  # need to source env.sh before training

#7) Acknowledgement

We want to thank Espnet, Wenet, TensorFlowTTS, NeMo, etc. These great projects give us lots of references and inspirations!

athena's People

Contributors

chenguoguo avatar cookingbear avatar dependabot[bot] avatar garygao99 avatar huang17 avatar hyx100e avatar jianweisun007 avatar leeyouxie avatar leixiaoning avatar neneluo avatar shuaijiang avatar shuaijiangke avatar some-random avatar studyself avatar teapoly avatar tjadamlee avatar trellixvulnteam avatar zouwei02 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

athena's Issues

How to Transfer Learning?

Hi, I wonder how to use a pretrained model for a new dataset? The new data is a total different one.
Am I supposed to use the MPC model? If so, do I need to recalculate cmvn for the new dataset?
Or can I skip the MPC stage,and finetune the latter model with new data? i.e, SpeechTransformer, and RNNLM ?
Please give me some guidance. Thanks a lot.

A little bug in decode code while restoring ckpt

Checkpoint saves files from index 1 by default, whereas the restoring code read ckpt file from index 0.In the athena/decode_main.py, the 55th line code could be modified with 'ckpt_path = p.ckpt + 'ckpt-' + str(idx + 1)'

Error of decoding stage

Hi,
When I was running the decoding stage, I got such error message:

<<<
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/array_ops.py", line 1431, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/8T_raid/user/venv_athena/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1249, in concat_v2
_six.raise_from(_core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in
NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values
{ list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat

A bug in rnnlm.json in librispeech example

"dataset_builder": "language_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/train-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
},
"devset_config":{
"data_csv":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/librispeech/data/test-clean-speaker-id.trans.csv",
"input_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"},
"output_text_config":{"type":"vocab", "model":"/nfs/cold_project/zhangruixiong/athena_librispeech/examples/asr/aishell/data/vocab"}
}
}

Accelerate decoding

Beam search with CTC joint decoding is really slow, we need to accelerate it. Two solutions on top of my head:

1 split test set into smaller pieces then divide-and-conquer with horovod
2 use tf-function compatible code to rewrite CTC joint decoding part

@cookingbear please look it this

redefine SpeechTransformer2

@leixiaoning using the new approach to implement the SpeechTransformer2 call function, in which we forward twise: the first time forward is used to generate the predicted target, the second forward is used to intergate the ground truch, so that we can enable the schedule sampling

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Hi, I use mtl_transformer_sp.json for Fine-tuning stage. It finished the epoch 0, but error occurs at some point of epoch 1.
BTW, when I run it without "speed_permutation": [0.9, 1.0, 1.1], it works.

Here's the command:
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/seewo/configs/mtl_transformer_sp.json
Here's the config:
`
{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/",
"summary_dir":"examples/asr/mine/ckpts/mtl_transformer_ctc_sp/event",

"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},

"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/mine/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},

"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_weight":0.7,
"lm_type": "rnn",
"lm_path":"examples/asr/mine/configs/rnnlm.json"
},

"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},

"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 1,
"trainset_config":{
"data_csv": "examples/asr/mine/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"speed_permutation": [0.9, 1.0, 1.1],
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/mine/data/dev.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/mine/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/mine/data/vocab"}
}
}

`
And here's the log:

[1,0]:INFO:absl:global_steps: 15314 learning_rate: 1.7122e-04 loss: 3.6860 Accuracy: 0.8822 CTCAccuracy: 0.7879 sec/iter: 0.7007
[1,0]:INFO:absl:global_steps: 15324 learning_rate: 1.7133e-04 loss: 9.4535 Accuracy: 0.8708 CTCAccuracy: 0.7742 sec/iter: 0.6727
[1,0]:INFO:absl:global_steps: 15334 learning_rate: 1.7144e-04 loss: 5.3108 Accuracy: 0.8724 CTCAccuracy: 0.7839 sec/iter: 0.6807
[1,0]:INFO:absl:global_steps: 15344 learning_rate: 1.7155e-04 loss: 21.0516 Accuracy: 0.8447 CTCAccuracy: 0.7498 sec/iter: 0.6419
[1,0]:INFO:absl:global_steps: 15354 learning_rate: 1.7166e-04 loss: 11.5386 Accuracy: 0.8252 CTCAccuracy: 0.7421 sec/iter: 0.6913
[1,0]:INFO:absl:global_steps: 15364 learning_rate: 1.7177e-04 loss: 9.2761 Accuracy: 0.8529 CTCAccuracy: 0.7660 sec/iter: 0.7714
[1,0]:INFO:absl:global_steps: 15374 learning_rate: 1.7189e-04 loss: 6.6673 Accuracy: 0.8661 CTCAccuracy: 0.7800 sec/iter: 0.8484
[1,0]:INFO:absl:global_steps: 15384 learning_rate: 1.7200e-04 loss: 8.5238 Accuracy: 0.8668 CTCAccuracy: 0.7932 sec/iter: 0.8024
[1,0]:INFO:absl:global_steps: 15394 learning_rate: 1.7211e-04 loss: 10.3194 Accuracy: 0.8655 CTCAccuracy: 0.7854 sec/iter: 0.6500
[1,0]:INFO:absl:global_steps: 15404 learning_rate: 1.7222e-04 loss: 7.6797 Accuracy: 0.8471 CTCAccuracy: 0.7623 sec/iter: 0.6240
[1,1]:2020-03-28 02:04:52.312873: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
[1,1]:2020-03-28 02:04:52.312922: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
[1,1]:[ef53b0d505e4:40836] *** Process received signal ***
[1,1]:[ef53b0d505e4:40836] Signal: Aborted (6)
[1,1]:[ef53b0d505e4:40836] Signal code: (-6)
[1,1]:[ef53b0d505e4:40836] [ 0] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f223a69ef20]
[1,1]:[ef53b0d505e4:40836] [ 1] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f223a69ee97]
[1,1]:[ef53b0d505e4:40836] [ 2] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f223a6a0801]
[1,1]:[ef53b0d505e4:40836] [ 3] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x88d59b4)[0x7f2187b219b4]
[1,1]:[ef53b0d505e4:40836] [ 4] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr10PollEventsEbPN4absl13InlinedVectorINS0_5InUseELm4ESaIS3_EEE+0x207)[0x7f2187a8d357]
[1,1]:[ef53b0d505e4:40836] [ 5] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow8EventMgr8PollLoopEv+0x9f)[0x7f2187a8dbef]
[1,1]:[ef53b0d505e4:40836] [ 6] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f217e5718b1]
[1,1]:[ef53b0d505e4:40836] [ 7] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f217e56efa8]
[1,1]:[ef53b0d505e4:40836] [ 8] [1,1]:/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0x167b7cf)[0x7f217ebc87cf]
[1,1]:[ef53b0d505e4:40836] [ 9] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f223a4486db]
[1,1]:[ef53b0d505e4:40836] [10] [1,1]:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f223a78188f]
[1,1]:[ef53b0d505e4:40836] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun noticed that process rank 1 with PID 0 on node ef53b0d505e4 exited on signal 6 (Aborted).

aishell scripts

Update the aishell examples.

  1. We should start with downloading data in prepare_data.py
  2. When run the run.sh, we may get some results, please write down those results in README
  3. ...

aishell decode error: tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2'....

when I run aishell decoding scripts independently, Error message below:


......
None
best_wer_checkpoint:
[]
Traceback (most recent call last):
File "athena/decode_main.py", line 87, in
decode(jsonfile, n=5, log_file='nohup.out')
File "athena/decode_main.py", line 65, in decode
v = tf.reduce_mean(tf.concat(v,axis=0),axis=0)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 1517, in concat
return gen_array_ops.concat_v2(values=values, axis=axis, name=name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1118, in concat_v2
_ops.raise_from_not_ok_status(e, name)
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 6606, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: OpKernel 'ConcatV2' has constraint on attr 'T' not in NodeDef '[N=0, Tidx=DT_INT32]', KernelDef: 'op: "ConcatV2" device_type: "GPU" constraint { name: "T" allowed_values { list { type: DT_INT32 } } } host_memory_arg: "values" host_memory_arg: "axis" host_memory_arg: "output"' [Op:ConcatV2] name: concat
WARNING:tensorflow:Unresolved object in checkpoint: (root).optimizer.iter
......

incorrect lm_path in aishell example config

lm_path in decode_config is incorrect in examples/asr/aishell/configs/mtl_transformer_sp.json.

examples/asr/aishell/rnnlm.json => examples/asr/aishell/configs/rnnlm.json

error from: pip install -r requirements

environment: GCC=8.3.0, python=3.7.4
Error:
(venv_athena) (base) root@12e6d012d4d5:~/luxy/athena# pip install -r requirements.txt
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: tensorflow-gpu==2.0.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: sox in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 2)) (1.3.7)
Requirement already satisfied: absl-py in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 3)) (0.9.0)
Requirement already satisfied: yapf in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 4)) (0.29.0)
Requirement already satisfied: pylint in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 5)) (2.4.4)
Requirement already satisfied: flake8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from -r requirements.txt (line 6)) (3.7.9)
Collecting horovod
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c0/31/dae1f224a284ccaf0fd700565a53658bfba9c3d5964719305953e72a11e0/horovod-0.19.1.tar.gz (2.9 MB)
Collecting tqdm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/1c/6359be64e8301b84160f6f6f7936bbfaaa5e9a4eab6cbc681db07600b949/tqdm-4.45.0-py2.py3-none-any.whl (60 kB)
Collecting sentencepiece
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/11/e0/1264990c559fb945cfb6664742001608e1ed8359eeec6722830ae085062b/sentencepiece-0.1.85-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
Processing /root/.cache/pip/wheels/6e/d3/47/7582e7e63ee9127f4773adeb8dcd8490771c063e2607354ba0/librosa-0.7.2-py3-none-any.whl
Collecting kenlm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/57/54/0cc492b8d7aceb17a9164c6e6b9c9afc2c73706bb39324e8f6fa02f7134a/kenlm-0.tar.gz (1.4 MB)
Processing /root/.cache/pip/wheels/95/1a/6d/75355e7a5c76ed48e2d6cde3b95c4828e83274b93f5392ac96/jieba-0.42.1-py3-none-any.whl
Collecting pandas
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/4a/6a/94b219b8ea0f2d580169e85ed1edc0163743f55aaeca8a44c2e8fc1e344e/pandas-1.0.3-cp37-cp37m-manylinux1_x86_64.whl (10.0 MB)
Requirement already satisfied: keras-applications>=1.0.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.8)
Requirement already satisfied: protobuf>=3.6.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.11.3)
Requirement already satisfied: tensorflow-estimator<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.1)
Requirement already satisfied: termcolor>=1.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: tensorboard<2.1.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.0.2)
Requirement already satisfied: google-pasta>=0.1.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.0)
Requirement already satisfied: wrapt>=1.11.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.12.1)
Requirement already satisfied: grpcio>=1.8.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.28.1)
Requirement already satisfied: gast==0.2.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.2)
Requirement already satisfied: numpy<2.0,>=1.16.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.18.2)
Requirement already satisfied: six>=1.10.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.14.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.0)
Requirement already satisfied: astor>=0.6.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.8.1)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.34.2)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: mccabe<0.7,>=0.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (0.6.1)
Requirement already satisfied: isort<5,>=4.2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (4.3.21)
Requirement already satisfied: astroid<2.4,>=2.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pylint->-r requirements.txt (line 5)) (2.3.3)
Requirement already satisfied: pycodestyle<2.6.0,>=2.5.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.5.0)
Requirement already satisfied: pyflakes<2.2.0,>=2.1.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (2.1.1)
Requirement already satisfied: entrypoints<0.4.0,>=0.3.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from flake8->-r requirements.txt (line 6)) (0.3)
Requirement already satisfied: cloudpickle in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: psutil in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.7.0)
Requirement already satisfied: pyyaml in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (5.3.1)
Requirement already satisfied: cffi>=1.4.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from horovod->-r requirements.txt (line 7)) (1.14.0)
Processing /root/.cache/pip/wheels/ad/c3/72/f5733d5e4abc9a637c9f6834a1a29429b4cd57b30a4585f91a/resampy-0.2.2-py3-none-any.whl
Processing /root/.cache/pip/wheels/0a/af/f6/aa7eefaad4a35a4f78adbfa0c2a99c53fda489e48132b037e4/audioread-2.1.8-py3-none-any.whl
Collecting decorator>=3.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/ed/1b/72a1821152d07cf1d8b6fce298aeb06a7eb90f4d6d41acec9861e7cc6df0/decorator-4.4.2-py2.py3-none-any.whl (9.2 kB)
Collecting numba>=0.43.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a6/91/3af4fcbe6f9c05f5d04d08b955f635fc9e3388b751a7f0af18e71809e10a/numba-0.48.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB)
Collecting soundfile>=0.9.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting scikit-learn!=0.19.0,>=0.14.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/41/b6/126263db075fbcc79107749f906ec1c7639f69d2d017807c6574792e517e/scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
Collecting scipy>=1.0.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/dd/82/c1fe128f3526b128cfd185580ba40d01371c5d299fcf7f77968e22dfcc2e/scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Collecting joblib>=0.12
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting pytz>=2017.2
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/e7/f9/f0b53f88060247251bf481fa6ea62cd0d25bf1b11a87888e53ce5b7c8ad2/pytz-2019.3-py2.py3-none-any.whl (509 kB)
Collecting python-dateutil>=2.6.1
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Requirement already satisfied: h5py in /root/luxy/venv_athena/lib/python3.7/site-packages (from keras-applications>=1.0.8->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.10.0)
Requirement already satisfied: setuptools in /root/luxy/venv_athena/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (40.8.0)
Requirement already satisfied: requests<3,>=2.21.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.23.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.13.1)
Requirement already satisfied: markdown>=2.6.8 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.2.1)
Requirement already satisfied: werkzeug>=0.11.15 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.0.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.1)
Requirement already satisfied: lazy-object-proxy==1.4.* in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.3)
Requirement already satisfied: typed-ast<1.5,>=1.4.0; implementation_name == "cpython" and python_version < "3.8" in /root/luxy/venv_athena/lib/python3.7/site-packages (from astroid<2.4,>=2.3.0->pylint->-r requirements.txt (line 5)) (1.4.1)
Requirement already satisfied: pycparser in /root/luxy/venv_athena/lib/python3.7/site-packages (from cffi>=1.4.0->horovod->-r requirements.txt (line 7)) (2.20)
Collecting llvmlite<0.32.0,>=0.31.0dev0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a0/10/d02c0ac683fc47ecda3426249509cf771d748b6a2c0e9d5ebbee76a7b80a/llvmlite-0.31.0-cp37-cp37m-manylinux1_x86_64.whl (20.2 MB)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.25.8)
Requirement already satisfied: idna<3,>=2.5 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2.9)
Requirement already satisfied: certifi>=2017.4.17 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (2020.4.5.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.2.8)
Requirement already satisfied: rsa<4.1,>=3.1.4 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (4.0.0)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /root/luxy/venv_athena/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Requirement already satisfied: oauthlib>=3.0.0 in /root/luxy/venv_athena/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (3.1.0)
Building wheels for collected packages: horovod, kenlm
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /root/luxy/venv_athena/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-j394y9lx/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-f6z4f15k
cwd: /tmp/pip-install-j394y9lx/horovod/
Complete output (190 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.7/horovod
creating build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.7/horovod/mxnet
creating build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow
creating build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.7/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.7/horovod/common
creating build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.7/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.7/horovod/run
creating build/lib.linux-x86_64-3.7/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.7/horovod/spark
creating build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.7/horovod/_keras
creating build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/keras
creating build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.7/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/torch
creating build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.7/horovod/run/task
creating build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.7/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/util
creating build/lib.linux-x86_64-3.7/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common
creating build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.7/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.7/horovod/run/http
creating build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/run/driver
creating build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/util
creating build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.7/horovod/run/common/service
creating build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/task
creating build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/common
creating build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/keras
creating build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/torch
creating build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.7/horovod/spark/driver
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib_impl
creating build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.7/horovod/torch/mpi_lib
running build_ext
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_cpp_flags.so
gcc -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/include -fPIC -I/root/luxy/venv_athena/include -I/root/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o
cc1plus: 警告:command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
gcc -pthread -shared -B /root/anaconda3/compiler_compat -L/root/anaconda3/lib -Wl,-rpath=/root/anaconda3/lib -Wl,--no-as-needed -Wl,--sysroot=/ -Wl,-rpath,/lib -L/lib -fPIC -I/include -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 622, in get_common_options
mpi_flags = get_mpi_flags()
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 354, in get_mpi_flags
'%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/root/anaconda3/lib/python3.7/subprocess.py", line 395, in check_output
**kwargs).stdout
File "/root/anaconda3/lib/python3.7/subprocess.py", line 472, in run
with Popen(*popenargs, **kwargs) as process:
File "/root/anaconda3/lib/python3.7/subprocess.py", line 775, in init
restore_signals, start_new_session)
File "/root/anaconda3/lib/python3.7/subprocess.py", line 1522, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 1566, in
scripts=['bin/horovodrun'])
File "/root/luxy/venv_athena/lib/python3.7/site-packages/setuptools/init.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/root/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
dist.run_commands()
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/luxy/venv_athena/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 223, in run
self.run_command('build')
File "/root/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/anaconda3/lib/python3.7/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/root/anaconda3/lib/python3.7/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/root/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/root/luxy/venv_athena/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 78, in run
_build_ext.run(self)
File "/root/anaconda3/lib/python3.7/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 1457, in build_extensions
options = get_common_options(self)
File "/tmp/pip-install-j394y9lx/horovod/setup.py", line 635, in get_common_options
raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for kenlm (setup.py) ... \

How to solove it?

cuDNN failed to initialize

806c16f

Hello, genius developer
I am a shenlan student. I installed it according to the instructions. No errors were reported in the middle, and I used the following code to verify that the translation model training is correct.
However, when I train the asr model in the examples/asr/aishell_sub/ directory and run to Fine-tuning, an error is reported.

Traceback (most recent call last):
File "athena/main.py", line 172, in
train(json_file, BaseSolver, 1, 0)
File "athena/main.py", line 117, in train
p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
File "athena/main.py", line 105, in build_model_from_jsonfile
solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
File "/home/hanzl/work/learn/athena/athena/solver.py", line 95, in evaluate_step
logits = self.model(samples, training=False)
...
...
File "/home/hanzl/work/learn/athena/venv_athena/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D]

It prompts me to go to the error message above, I found the following information:

2020-04-06 14:48:17.829495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10283 MB memory) -> physical GPU (device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:b1:00.0, compute capability: 7.5)
INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mtl_transformer_ctc/
2020-04-06 14:48:21.230124: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-06 14:48:22.942435: E tensorflow/stream_executor/cuda/cuda_dnn.cc:319] Loaded runtime CuDNN library: 7.5.0 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

Can anyone tell me why this is and how should I solve it? I look forward to your reply, thank you very much!

'python examples/asr/aishell/local/prepare_data.py' failed

image

$ python examples/asr/aishell/local/prepare_data.py
Traceback (most recent call last):
File "examples/asr/aishell/local/prepare_data.py", line 25, in
from athena import get_wave_file_length
File "/workspace/users/lpp/source/athena/athena/init.py", line 18, in
from .data import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/init.py", line 18, in
from .datasets.speech_recognition import SpeechRecognitionDatasetBuilder
File "/workspace/users/lpp/source/athena/athena/data/datasets/speech_recognition.py", line 22, in
from athena.transform import AudioFeaturizer
File "/workspace/users/lpp/source/athena/athena/transform/init.py", line 16, in
from athena.transform import audio_featurizer
File "/workspace/users/lpp/source/athena/athena/transform/audio_featurizer.py", line 19, in
from athena.transform import feats
File "/workspace/users/lpp/source/athena/athena/transform/feats/init.py", line 16, in
from athena.transform.feats.read_wav import ReadWav
File "/workspace/users/lpp/source/athena/athena/transform/feats/read_wav.py", line 21, in
from athena.transform.feats.ops import py_x_ops
File "/workspace/users/lpp/source/athena/athena/transform/feats/ops/py_x_ops.py", line 28, in
spectrum = gen_x_ops.spectrum
AttributeError: module '5fa89fc3154996733eabb433e18fa62f' has no attribute 'spectrum'

I will appreciate it for any help.

assert len(gpus) > len(visible_gpu_idx)

Wed Apr 1 08:54:43 2020[0]:name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.759
Wed Apr 1 08:54:43 2020[0]:pciBusID: 0000:01:00.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.171558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.173728: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175027: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.175832: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.178045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.179579: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.183987: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.184595: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
Wed Apr 1 08:54:43 2020[0]:2020-04-01 08:54:43.185004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
Wed Apr 1 08:54:43 2020[0]:Traceback (most recent call last):
Wed Apr 1 08:54:43 2020[0]: File "athena/main.py", line 171, in
Wed Apr 1 08:54:43 2020[0]: BaseSolver.initialize_devices(p.solver_gpu)
Wed Apr 1 08:54:43 2020[0]: File "/mnt/3T/mygits/ASR-NLP/FrameWorks/athena/athena/solver.py", line 54, in initialize_devices
Wed Apr 1 08:54:43 2020[0]: assert len(gpus) > len(visible_gpu_idx)
Wed Apr 1 08:54:43 2020[0]:AssertionError
Process 0 exit with status code 1.
Traceback (most recent call last):
File "/home/wcl/anaconda3/bin/horovodrun", line 21, in
run_commandline()
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
_run(args)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
_launch_job(args, remote_host_names, settings, common_intfs, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
_launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
File "/home/wcl/anaconda3/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
.format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

I'm trying single GPU and close hvd。

save log files during training and decoding

Currently users have to use redirect instructions which are easily to be forgotten to save logs and the log is critical to be used to extract n-best checkpoints during decoding. So the mechanism to automatically save logs may be needed.

decoding output

We have to do couple of things regarding decoding output:

  1. Add configuration to write file to disk;

  2. Compare WER/CER with standard tools.

Assigning tasks to myself for now.

Problem: Ran out of memory

Hi all,

I used this commit b7b2d91, and trained the aishell example. The occupied memory of GPU went larger and larger with global_steps, finally it ran out of memory.

os: ubuntu 16.04
tensorflow: 2.0.1

Thank you in advance.

Validation scripts

We need scripts to validate things such as data directory structure. Otherwise we won't know if certain step fails. I'll assign it to myself for now but may take some time to get back to this.

An installation issue

Hi,

When I was running the installation step:
pip3.7 install -r requirements.txt

I had some errors shown as below (it would be too long to paste all of them, so I pasted some screenshots here). I would be very grateful if you could have a look...Is it because I didn't configure the horovod environment right or I didn't successfully install mpi?

Requirement already satisfied: pyasn1>=0.1.3 in /home/pc21/venv_athena/lib/python3.7/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow-gpu==2.0.1->-r requirements.txt (line 1)) (0.4.8)
Building wheels for collected packages: horovod, librosa, kenlm, jieba, psutil, pyyaml, audioread, resampy
Building wheel for horovod (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/pc21/venv_athena/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-xp_is8ug/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-4e373ylc
cwd: /tmp/pip-install-xp_is8ug/horovod/

After many lines:

x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fdebug-prefix-map=/build/python3.7-1t2gIN/python3.7-3.7.0~b3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.7/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.7/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-xp_is8ug/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()

Then:

File "/usr/lib/python3.7/subprocess.py", line 453, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.7/subprocess.py", line 756, in init
restore_signals, start_new_session)
File "/usr/lib/python3.7/subprocess.py", line 1499, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

INFO: Cannot find MPI compilation flags, will skip compiling with MPI.

raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.

then

ERROR: Failed building wheel for horovod
Running setup.py clean for horovod
Building wheel for librosa (setup.py) ... done
Created wheel for librosa: filename=librosa-0.7.2-py3-none-any.whl size=1612883 sha256=862bd06e9c89bd1f80e6e702b637d5ddf048093d4fd402073d13aebbacdc0799
Stored in directory: /tmp/pip-ephem-wheel-cache-x3z9ygim/wheels/18/9e/42/3224f85730f92fa2925f0b4fb6ef7f9c5431a64dfc77b95b39
Building wheel for kenlm (setup.py) ... error
ERROR: Command errored out with exit status 1:

Mix-precisioned training

Hi. We want to use it in mix-precisioned mode, as our GPU don't have much memory, and we want to speed up the training.

I change the code to use mix-precisioned training feature in TF2. It works for MPC (stage 1).
But for the fine-tuning stage, the loss becomes nan at the very beginning.
I try to debug it, and find out the PositionalEncoding in speech_transformer.py is always returning NaN.

        input_labels = layers.Input(shape=data_descriptions.sample_shape["output"], dtype=tf.int32)
        inner = layers.Embedding(self.num_class, d_model)(input_labels)
        inner = PositionalEncoding(d_model, scale=True)(inner) #it returns NaN
        inner = layers.Dropout(self.hparams.rate)(inner)
        self.y_net = tf.keras.Model(inputs=input_labels, outputs=inner, name="y_net")

could anyone help? Thanks a lot

horovod terminated.

Pretraining
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= Parse Parameter ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= HorovodSolver Init ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:================================= Start Train ==============================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:13 2020[0]<stderr>:======================= build model from json file ======================
Wed Apr  8 12:05:13 2020[0]<stderr>:
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('batch_size', 64), ('ckpt', 'examples/asr/aishell/ckpts/mpc'), ('cls', 'main'), ('dataset_builder', 'speech_dataset'), ('decode_config', None), ('devset_config', {'data_csv': 'examples/asr/aishell/data/dev.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('model', 'mpc'), ('model_config', {'return_encoder_output': False, 'num_filters': 512, 'd_model': 512, 'num_heads': 8, 'num_encoder_layers': 12, 'dff': 1280, 'rate': 0.1, 'chunk_size': 1, 'keep_probability': 0.8}), ('num_classes', 40), ('num_data_threads', 1), ('num_epochs', 250), ('optimizer', 'warmup_adam'), ('optimizer_config', {'d_model': 512, 'warmup_steps': 2000, 'k': 0.3}), ('pretrained_model', None), ('solver_config', {'clip_norm': 100, 'log_interval': 10, 'enable_tf_function': True}), ('solver_gpu', [0]), ('sorta_epoch', 1), ('summary_dir', 'examples/asr/aishell/ckpts/mpc/event'), ('testset_config', {'data_csv': 'examples/asr/aishell/data/test.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]}), ('trainset_config', {'data_csv': 'examples/asr/aishell/data/train.csv', 'audio_config': {'type': 'Fbank', 'filterbank_channel_count': 40}, 'cmvn_file': 'examples/asr/aishell/data/cmvn', 'input_length_range': [10, 8000]})]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr  8 12:05:13 2020[0]<stdout>:Fbank params:  [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr  8 12:05:13 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.813653: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.828699: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2399915000 Hz
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833653: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55aa3912c090 executing computations on platform Host. Devices:
Wed Apr  8 12:05:14 2020[0]<stderr>:2020-04-08 12:05:14.833693: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Wed Apr  8 12:05:15 2020[0]<stdout>:Model: "x_net"
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:Layer (type)                 Output Shape              Param #
Wed Apr  8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr  8 12:05:15 2020[0]<stdout>:input_1 (InputLayer)         [(None, None, 40, 1)]     0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:conv2d (Conv2D)              (None, None, 20, 512)     4608
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:batch_normalization (BatchNo (None, None, 20, 512)     2048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6 (TensorFlo [(None, None, 20, 512)]   0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:conv2d_1 (Conv2D)            (None, None, 10, 512)     2359296
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:batch_normalization_1 (Batch (None, None, 10, 512)     2048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:tf_op_layer_Relu6_1 (TensorF [(None, None, 10, 512)]   0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:reshape (Reshape)            (None, None, 5120)        0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:dense (Dense)                (None, None, 512)         2621952
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:positional_encoding (Positio (None, None, 512)         0
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:dropout (Dropout)            (None, None, 512)         0
Wed Apr  8 12:05:15 2020[0]<stdout>:=================================================================
Wed Apr  8 12:05:15 2020[0]<stdout>:Total params: 4,989,952
Wed Apr  8 12:05:15 2020[0]<stdout>:Trainable params: 4,987,904
Wed Apr  8 12:05:15 2020[0]<stdout>:Non-trainable params: 2,048
Wed Apr  8 12:05:15 2020[0]<stdout>:_________________________________________________________________
Wed Apr  8 12:05:15 2020[0]<stdout>:None
Wed Apr  8 12:05:16 2020[0]<stderr>:INFO:absl:trying to restore from : examples/asr/aishell/ckpts/mpc
Wed Apr  8 12:05:20 2020[0]<stderr>:INFO:absl:hparams: [('audio_config', {'type': 'Fbank', 'filterbank_channel_count': 40}), ('cls', <class 'athena.data.datasets.speech_set.SpeechDatasetBuilder'>), ('cmvn_file', 'examples/asr/aishell/data/cmvn'), ('data_csv', 'examples/asr/aishell/data/train.csv'), ('input_length_range', [10, 8000])]
Wed Apr  8 12:05:20 2020[0]<stdout>:Fbank params:  [('channel', 1), ('cls', <class 'athena.transform.feats.fbank.Fbank'>), ('delta_delta', False), ('dither', 0.0), ('filterbank_channel_count', 40), ('frame_length', 0.01), ('global_mean', [0.0]), ('global_variance', [1.000001]), ('is_fbank', True), ('local_cmvn', False), ('lower_frequency_limit', 60), ('order', 2), ('output_type', 1), ('preEph_coeff', 0.97), ('raw_energy', 1), ('remove_dc_offset', True), ('snip_edges', 1), ('type', 'Fbank'), ('upper_frequency_limit', 0), ('window', 2), ('window_length', 0.025), ('window_type', 'povey')]
Wed Apr  8 12:05:21 2020[0]<stderr>:INFO:absl:Successfully load cmvn file examples/asr/aishell/data/cmvn
Wed Apr  8 12:05:21 2020[0]<stderr>:INFO:absl:Loading data from examples/asr/aishell/data/train.csv
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:Creates the sub-dataset which is the 0 part of 1
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:
Wed Apr  8 12:05:22 2020[0]<stderr>:>>>>> start training in epoch 0===============================
Wed Apr  8 12:05:22 2020[0]<stderr>:
Wed Apr  8 12:05:22 2020[0]<stderr>:INFO:absl:please be patient, enable tf.function, it takes time ...
Process 0 exit with status code 249.
Traceback (most recent call last):
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/bin/horovodrun", line 21, in <module>
    run_commandline()
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 876, in run_commandline
    _run(args)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 844, in _run
    _launch_job(args, remote_host_names, settings, common_intfs, command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/run.py", line 867, in _launch_job
    gloo_run(settings, remote_host_names, common_intfs, env, driver_ip, command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 287, in gloo_run
    _launch_jobs(settings, env, host_alloc_plan, remote_host_names, run_command)
  File "/home/users/xiongxinlei/opt/anaconda2/envs/athena/lib/python3.7/site-packages/horovod/run/gloo_run.py", line 259, in _launch_jobs
    .format(name=name, code=exit_code))
RuntimeError: Gloo job detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 0
Exit code: 249

seems deadlock occurs when using multithread

Here's the mtl_transformer.json for finetuning stage. I set the num_data_threads:32, and it hangs up at the end of the first training epoch.
horovodrun -np 4 -H localhost:4 python athena/horovod_main.py examples/asr/hkust/configs/mtl_transformer.json

{
"batch_size":24,
"num_epochs":50,
"sorta_epoch":1,
"ckpt":"examples/asr/hkust/ckpts/mtl_transformer_ctc/",
"summary_dir":"examples/asr/hkust/ckpts/mtl_transformer_ctc/event",

"solver_gpu":[0],
"solver_config":{
"clip_norm":100,
"log_interval":10,
"enable_tf_function":true
},

"model":"mtl_transformer_ctc",
"num_classes": null,
"pretrained_model": "examples/asr/hkust/configs/mpc.json",
"model_config":{
"model":"speech_transformer",
"model_config":{
"return_encoder_output":true,
"num_filters":512,
"d_model":512,
"num_heads":8,
"num_encoder_layers":12,
"num_decoder_layers":6,
"dff":1280,
"rate":0.1,
"label_smoothing_rate":0.0,
"schedual_sampling_rate":0.9
},
"mtl_weight":0.5
},

"decode_config":{
"beam_search":true,
"beam_size":10,
"ctc_weight":0.5,
"lm_type":"ngram",
"lm_weight":0.3,
"lm_path":"examples/asr/hkust/data/5gram.arpa"
},

"optimizer":"warmup_adam",
"optimizer_config":{
"d_model":512,
"warmup_steps":25000,
"k":1.0
},

"dataset_builder": "speech_recognition_dataset",
"num_data_threads": 12,
"trainset_config":{
"data_csv": "examples/asr/hkust/data/train.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"devset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"},
"input_length_range":[10, 8000]
},
"testset_config":{
"data_csv": "examples/asr/hkust/data/dev.mini.csv",
"audio_config":{"type":"Fbank", "filterbank_channel_count":40},
"cmvn_file":"examples/asr/hkust/data/cmvn",
"text_config": {"type":"vocab", "model":"examples/asr/hkust/data/vocab"}
}
}

Here's part of the strace log of one of the process:

restart_syscall(<... resuming interrupted futex ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=448417000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=453560000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=458726000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=463892000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=469063000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=474251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=479418000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=484583000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=489748000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=494910000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa745ac, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=500123000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=505351000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=510428000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=515504000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=520612000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=525693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=530785000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=535930000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=541074000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=546212000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=551297000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=556410000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=561516000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=566631000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=571733000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=576836000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=581946000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=587075000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=592220000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=597350000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=602524000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=607693000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=612850000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=617997000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=623167000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=628321000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=633475000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=638630000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=643783000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=648937000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=654094000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=659251000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=664403000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=669553000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=674704000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=679851000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=684998000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=690118000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=695268000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=700402000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=705515000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=710628000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=715747000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=720839000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=725960000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=731070000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=736199000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=741361000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=746488000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=751589000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=756712000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=761827000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=766948000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0xa745a8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1585236534, tv_nsec=772084000}, 0xffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0xa74540, FUTEX_WAKE_PRIVATE, 1) = 0

a little bug in aishell example

In the file examples/asr/aishell/local/prepare_data.py,the 47th line code:if not gfile.Exists(os.path.join(dataset_dir, subset)),I think the variable dataset_dir should be replaced to audio_dir.

Lot's of read -1 errors dumped during aishell CTC training

There were quite a lot of read errors dumped during the training process. However, it seemed no impact to the training. Not sure if you ever got such annoying errors. Anything wrong with my training set?

....
[1,7]:INFO:absl:perform batch_wise_shuffle with batch_size 16
[1,2]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,0]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:INFO:absl:please be patient, enable tf.function, it takes time ...
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,7]:WARNING:absl:the length of logits is shorter than that of labels
[1,5]:WARNING:absl:the length of logits is shorter than that of labels
[1,2]:WARNING:absl:the length of logits is shorter than that of labels
[1,3]:WARNING:absl:the length of logits is shorter than that of labels
[1,6]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:WARNING:absl:the length of logits is shorter than that of labels
[1,4]:WARNING:absl:the length of logits is shorter than that of labels
[1,1]:WARNING:absl:the length of logits is shorter than that of labels
[1,0]:[373f86b536f2:00117] Read -1, expected 5393, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 6345, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 5873, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 5184, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 5184, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 5440, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 4864, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 4928, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,5]:[373f86b536f2:00122] Read -1, expected 1002048, errno = 1
[1,6]:[373f86b536f2:00123] Read -1, expected 1002048, errno = 1
[1,0]:[373f86b536f2:00117] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,1]:[373f86b536f2:00118] Read -1, expected 1002048, errno = 1
[1,2]:[373f86b536f2:00119] Read -1, expected 1002048, errno = 1
[1,3]:[373f86b536f2:00120] Read -1, expected 1002048, errno = 1
[1,4]:[373f86b536f2:00121] Read -1, expected 1002048, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 131072, errno = 1
[1,7]:[373f86b536f2:00124] Read -1, expected 20480, errno = 1
[1,0]:INFO:absl:global_steps: 38522 learning_rate: 2.2517e-04 loss: 1.5734 CTCAccuracy: 0.9167 Accuracy: 0.9323
[1,7]:[373f86b536f2:00124] Read -1, expected 5248, errno = 1
...

call() got an unexpected keyword argument 'training'

384 [1,0]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7fda720b4b70>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7fda720b4b70>>: AssertionError: Bad argument number for Name: 3, expecting 4
385 [1,0]:2020-04-05 01:25:48.780970: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
386 [1,1]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4
387 [1,1]:WARNING:tensorflow:Entity <bound method TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>> could not be transformed and will be ex ecuted as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output. Cause: converting <bound meth od TensorFlowOpLayer._defun_call of <tensorflow.python.eager.function.TfMethodTarget object at 0x7f28792be7f0>>: AssertionError: Bad argument number for Name: 3, expecting 4
388 [1,1]:2020-04-05 01:25:48.903736: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
389 [1,1]:Traceback (most recent call last):
390 [1,1]: File "athena/horovod_main.py", line 42, in
391 [1,1]: train(json_file, HorovodSolver, hvd.size(), hvd.rank())
392 [1,1]: File "/qssd/athena/athena/main.py", line 117, in train
393 [1,1]: p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
394 [1,1]: File "/qssd/athena/athena/main.py", line 105, in build_model_from_jsonfile
395 [1,1]: solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
396 [1,1]: File "/qssd/athena/athena/solver.py", line 96, in evaluate_step
397 [1,1]: logits = self.model(samples, training=False)
398 [1,1]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
399 [1,1]: outputs = self.call(inputs, *args, **kwargs)
400 [1,1]: File "/qssd/athena/athena/models/mtl_seq2seq.py", line 69, in call
401 [1,1]: self.ctc_logits = self.decoder(encoder_output, training=training)
402 [1,1]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
403 [1,1]: outputs = self.call(inputs, *args, **kwargs)
404 [1,1]:TypeError: call() got an unexpected keyword argument 'training'
405 [1,0]:Traceback (most recent call last):
406 [1,0]: File "athena/horovod_main.py", line 42, in
407 [1,0]: train(json_file, HorovodSolver, hvd.size(), hvd.rank())
408 [1,0]: File "/qssd/athena/athena/main.py", line 117, in train
409 [1,0]: p, model, optimizer, checkpointer = build_model_from_jsonfile(jsonfile)
410 [1,0]: File "/qssd/athena/athena/main.py", line 105, in build_model_from_jsonfile
411 [1,0]: solver.evaluate_step(model.prepare_samples(iter(dataset).next()))
412 [1,0]: File "/qssd/athena/athena/solver.py", line 96, in evaluate_step
413 [1,0]: logits = self.model(samples, training=False)
414 [1,0]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
415 [1,0]: outputs = self.call(inputs, *args, **kwargs)
416 [1,0]: File "/qssd/athena/athena/models/mtl_seq2seq.py", line 69, in call
417 [1,0]: self.ctc_logits = self.decoder(encoder_output, training=training)
418 [1,0]: File "/home/quinnqiu/foundation/Q_athena/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 712, in call
419 [1,0]: outputs = self.call(inputs, *args, **kwargs)
420 [1,0]:TypeError: call() got an unexpected keyword argument 'training'

installation is complete, and well done.
tensorflow 2.0.0b0
CUDA 10.0.0

error from kenlm in "pip install -r requirements.txt"

Hi, thanks for your previous suggestion. And i have delete the "horovod" from the requirements.txt file
But i also find another problem. The detail error:


Building wheel for kenlm (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-vn7hxo83
cwd: /tmp/pip-install-zq8e3scd/kenlm/
Complete output (12 lines):
running bdist_wheel
running build
running build_ext
building 'kenlm' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/util
creating build/temp.linux-x86_64-3.6/lm
creating build/temp.linux-x86_64-3.6/util/double-conversion
creating build/temp.linux-x86_64-3.6/python
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I. -I/home/luxy/demo/venv_athena/include -I/usr/include/python3.6m -c util/exception.cc -o build/temp.linux-x86_64-3.6/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -std=c++11
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

ERROR: Failed building wheel for kenlm
Running setup.py clean for kenlm
Failed to build kenlm
Installing collected packages: kenlm, jieba, pytz, python-dateutil, pandas
Running setup.py install for kenlm ... error
ERROR: Command errored out with exit status 1:
command: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm
cwd: /tmp/pip-install-zq8e3scd/kenlm/
Complete output (12 lines):
running install
running build
running build_ext
building 'kenlm' extension
creating build/temp.linux-x86_64-3.6
creating build/temp.linux-x86_64-3.6/util
creating build/temp.linux-x86_64-3.6/lm
creating build/temp.linux-x86_64-3.6/util/double-conversion
creating build/temp.linux-x86_64-3.6/python
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I. -I/home/luxy/demo/venv_athena/include -I/usr/include/python3.6m -c util/exception.cc -o build/temp.linux-x86_64-3.6/util/exception.o -O3 -DNDEBUG -DKENLM_MAX_ORDER=6 -std=c++11
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxy/demo/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"'; file='"'"'/tmp/pip-install-zq8e3scd/kenlm/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-ahtuow06/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxy/demo/venv_athena/include/site/python3.6/kenlm Check the logs for full command output.


How to solove it? Thank you very much.

asr decoding is too slow.

The current decoding is much too slow. It almost takes 5 seconds for a single utterance decoding. I found there was a pull request ongoing for the decoding optimization. When will it be merged into master branch?

integrate https://github.com/athena-team/athena-decoder

start a new branch to integrate athena-decoder into athena project. @godjealous

  1. Currently, we can consider the athena-decoder as a seperate project, and install it using pip.
  2. In athena, we only use the provided interface in athena-decode.
  3. Evaluate the decoder from two aspects: the speed, and the CER
    In the future, we may update athena-decoder and athena simultaneously,
    @cookingbear please help hanyang to support this, Thanks.

thchs30 decode error

when I run thchs30 (like example aishell), I got an error during decoding , but procedure Fine-turning and Training language model are successful.
error message:

 ERROR:tensorflow:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7f77941ae910>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "athena/decode_main.py", line 89, in <module>
    decode(jsonfile, n=5, log_file='nohup_thchs30.out')  
File "athena/decode_main.py", line 73, in decode
    solver.decode(dataset_builder.as_dataset(batch_size=1))  
File "/raid/BH/mitom/athena/athena/solver.py", line 149, in decode
    predictions = self.model.decode(samples, self.hparams, lm_model=self.lm_model)  
File "/raid/BH/mitom/athena/athena/models/mtl_seq2seq.py", line 109, in decode 
  history_predictions.write(0, last_predictions) 
File "/home/bh/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/tf_should_use.py", line 237, in wrapped
    error_in_function=error_in_function)

TIMIT open wav: wave.Error: file does not start with RIFF id

When I run prepare_data.py for TIMIT, function: get_wave_file_length(wav_file), I got a error below:
Traceback (most recent call last):
File " - examples/asr/timit/local/prepare_data.py", line 116, in
processor(DATASET_DIR, SUBSET, True, OUTPUT_DIR)
File " - examples/asr/timit/local/prepare_data.py", line 100, in processor
convert_audio_and_split_transcript(dataset_dir, subset, subset_csv)
File "- /examples/asr/timit/local/prepare_data.py", line 59, in convert_audio_and_split_transcript
files_size_dict[wav_file] = get_wave_file_length(wav_file)
File "- /athena/utils/misc.py", line 103, in get_wave_file_length
with wave.open(wave_file) as wav_file:
File "/usr/lib/python3.5/wave.py", line 499, in open
return Wave_read(f)
File "/usr/lib/python3.5/wave.py", line 163, in init
self.initfp(f)
File "/usr/lib/python3.5/wave.py", line 130, in initfp
raise Error('file does not start with RIFF id')
wave.Error: file does not start with RIFF id

don't surport RIFF?

install error: Running setup.py install for horovod ... error

Hi,
my Install Environment:
VMware15.0, Ubuntu 18.04, python3.6.9

Error:*
Running setup.py install for horovod ... error
ERROR: Command errored out with exit status 1:
command: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod
cwd: /tmp/pip-install-s_f3vxnr/horovod/
Complete output (190 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.6
creating build/lib.linux-x86_64-3.6/horovod
copying horovod/init.py -> build/lib.linux-x86_64-3.6/horovod
creating build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/keras
copying horovod/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/keras
creating build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/util.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
copying horovod/tensorflow/compression.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow
creating build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run_task.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/gloo_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/init.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/mpi_run.py -> build/lib.linux-x86_64-3.6/horovod/run
copying horovod/run/task_fn.py -> build/lib.linux-x86_64-3.6/horovod/run
creating build/lib.linux-x86_64-3.6/horovod/spark
copying horovod/spark/init.py -> build/lib.linux-x86_64-3.6/horovod/spark
creating build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/util.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/init.py -> build/lib.linux-x86_64-3.6/horovod/common
copying horovod/common/basics.py -> build/lib.linux-x86_64-3.6/horovod/common
creating build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
copying horovod/mxnet/init.py -> build/lib.linux-x86_64-3.6/horovod/mxnet
creating build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/_keras
copying horovod/_keras/init.py -> build/lib.linux-x86_64-3.6/horovod/_keras
creating build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/mpi_ops.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/torch
copying horovod/torch/compression.py -> build/lib.linux-x86_64-3.6/horovod/torch
creating build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/callbacks.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
copying horovod/tensorflow/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/tensorflow/keras
creating build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/task
copying horovod/run/task/init.py -> build/lib.linux-x86_64-3.6/horovod/run/task
creating build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_client.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/init.py -> build/lib.linux-x86_64-3.6/horovod/run/http
copying horovod/run/http/http_server.py -> build/lib.linux-x86_64-3.6/horovod/run/http
creating build/lib.linux-x86_64-3.6/horovod/run/common
copying horovod/run/common/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common
creating build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/cache.py -> build/lib.linux-x86_64-3.6/horovod/run/util
copying horovod/run/util/threads.py -> build/lib.linux-x86_64-3.6/horovod/run/util
creating build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
copying horovod/run/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/run/driver
creating build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/timeout.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/config_parser.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/secret.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/network.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/settings.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/codec.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/host_hash.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
copying horovod/run/common/util/env.py -> build/lib.linux-x86_64-3.6/horovod/run/common/util
creating build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/task_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
copying horovod/run/common/service/init.py -> build/lib.linux-x86_64-3.6/horovod/run/common/service
creating build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_info.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/task_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/mpirun_exec_fn.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
copying horovod/spark/task/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/task
creating build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/tensorflow.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/bare.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
copying horovod/spark/keras/optimizer.py -> build/lib.linux-x86_64-3.6/horovod/spark/keras
creating build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/store.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/_namedtuple_fix.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/constants.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/params.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/cache.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/serialization.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
copying horovod/spark/common/backend.py -> build/lib.linux-x86_64-3.6/horovod/spark/common
creating build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/driver_service.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/job_id.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
copying horovod/spark/driver/mpirun_rsh.py -> build/lib.linux-x86_64-3.6/horovod/spark/driver
creating build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/estimator.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/util.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/init.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
copying horovod/spark/torch/remote.py -> build/lib.linux-x86_64-3.6/horovod/spark/torch
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
copying horovod/torch/mpi_lib/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib
creating build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
copying horovod/torch/mpi_lib_impl/init.py -> build/lib.linux-x86_64-3.6/horovod/torch/mpi_lib_impl
running build_ext
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -std=c++11 -fPIC -O2 -Wall -fassociative-math -ffast-math -ftree-vectorize -funsafe-math-optimizations -mf16c -mavx -mfma -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_cpp_flags.so
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/luxury/luxy/venv_athena/include -I/usr/include/python3.6m -c build/temp.linux-x86_64-3.6/test_compile/test_link_flags.cc -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,--version-script=horovod.lds build/temp.linux-x86_64-3.6/test_compile/test_link_flags.o -o build/temp.linux-x86_64-3.6/test_compile/test_link_flags.so
INFO: Cannot find CMake, will skip compiling Horovod with Gloo.
Traceback (most recent call last):
File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/usr/lib/python3.6/subprocess.py", line 423, in run
with Popen(*popenargs, **kwargs) as process:
File "/usr/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 622, in get_common_options
    mpi_flags = get_mpi_flags()
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 354, in get_mpi_flags
    '%s' % (show_command, traceback.format_exc()))
distutils.errors.DistutilsPlatformError: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has a custom command to show compilation flags, please specify it with the HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 341, in get_mpi_flags
    shlex.split(show_command), universal_newlines=True).strip()
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'mpicxx': 'mpicxx'


INFO: Cannot find MPI compilation flags, will skip compiling with MPI.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1566, in <module>
    scripts=['bin/horovodrun'])
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/install.py", line 61, in run
    return orig.install.run(self)
  File "/usr/lib/python3.6/distutils/command/install.py", line 589, in run
    self.run_command('build')
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/luxury/luxy/venv_athena/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 87, in run
    _build_ext.run(self)
  File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
    self.build_extensions()
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 1457, in build_extensions
    options = get_common_options(self)
  File "/tmp/pip-install-s_f3vxnr/horovod/setup.py", line 635, in get_common_options
    raise RuntimeError('One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.')
RuntimeError: One of Gloo or MPI are required for Horovod to run. Check the logs above for more info.
----------------------------------------
ERROR: Command errored out with exit status 1: /home/luxury/luxy/venv_athena/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"'; file='"'"'/tmp/pip-install-s_f3vxnr/horovod/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-uelars4k/install-record.txt --single-version-externally-managed --compile --install-headers /home/luxury/luxy/venv_athena/include/site/python3.6/horovod Check the logs for full command output.

How to solve it?

python script encoding suggestion

the python script examples/asr/aishell/local/prepare_data.py
when I run in a chinese code envirionment, cannot run.
So I suggest add the following code to avoid the above problem:

 #!/usr/bin/python
 # -*- coding: utf-8 -*-

CMVN values gets very large, and the loss of MPC is NaN

Hi, after I use our own data(10000h) to calculate the cmvn, I find the var is quite large. and the loss of MPC stage is NaN at the very beginning. Do anyone have any idea?

speaker mean var
global [41.53281, 50.375763, 53.979485, 55.0042, 55.01829, 55.294025, 55.537567, 55.66456, 55.297874, 54.5779, 54.301113, 54.04989, 53.69498, 53.33427, 53.045273, 52.622414, 52.82048, 53.421753, 53.865902, 53.84243, 53.45809, 52.944252, 53.01819, 53.3373, 53.863102, 54.33686, 54.909252, 55.299534, 55.040947, 54.846294, 54.568024, 53.713165, 52.363804, 49.36468, 45.567574, 45.262226, 44.527786, 45.601566, 45.84235, 45.581955] [-1176.7283, -1758.992, -2028.2766, -2111.5051, -2113.496, -2133.8308, -2152.609, -2164.494, -2136.68, -2079.5215, -2057.8743, -2038.9774, -2011.5557, -1983.8772, -1962.4875, -1930.8735, -1947.2842, -1993.6705, -2028.1543, -2025.6558, -1996.1725, -1956.0061, -1962.5137, -1986.1327, -2026.9241, -2065.188, -2109.8545, -2142.8325, -2122.1377, -2107.376, -2086.5679, -2019.9905, -1916.543, -1694.3958, -1440.5925, -1426.0725, -1364.4426, -1428.7397, -1445.1039, -1433.9104]

how to speed up cmvn computation

image
image
image
image

seems the cmvn computation stage is single process, it takes days for our dataset, around 17000 hour, and mostly 60 secs per clip.

is it possible to make it run parallelly ?

Error when run in pure CPU machine

Traceback (most recent call last):
File "athena/main.py", line 171, in
BaseSolver.initialize_devices(p.solver_gpu)
File "/media/runyu/D/works/algorithm/ASR/athena/athena/solver.py", line 54, in initialize_devices
assert len(gpus) > len(visible_gpu_idx)
AssertionError

Error occurred on following codes:
@staticmethod
def initialize_devices(visible_gpu_idx=None):
""" initialize hvd devices, should be called firstly """
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
if gpus is not None:
assert len(gpus) > len(visible_gpu_idx)
for idx in visible_gpu_idx:
tf.config.experimental.set_visible_devices(gpus[idx], "GPU")

the reason is the value of gpus is "[]" not "None" when running on pure CPU machine.
so "assert len(gpus) > len(visible_gpu_idx)" called but visible_gpu_idx is "None" and has no len() at this time.

horovod.tensorflow

When I typed import horovod.tensorflow, it occured an error like that:
Traceback (most recent call last):
File "", line 1, in
File "/adddisk/zhangjin/projects/horovod/horovod/tensorflow/init.py", line 25, in
check_extension('horovod.tensorflow', 'HOROVOD_WITH_TENSORFLOW', file, 'mpi_lib')
File "/adddisk/zhangjin/projects/horovod/horovod/common/util.py", line 51, in check_extension
'Horovod with %s=1 to debug the build error.' % (ext_name, ext_env_var))
ImportError: Extension horovod.tensorflow has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_TENSORFLOW=1 to debug the build error.

docker

Building some docker environment

Memory usage optimization

I had a discussion with @tjadamlee , we think our default examples' GPU memory usage is way too high, which is blocking a lot of users. We should tune the parameters (e.g., batch size) as well as the model structures, to keep the memory usage under 8G for default example setups.

@Some-random could you please take the lead and cut all examples' memory usage under 8G, while keeping the performance as much as possible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.