Comments (11)
(1) Are you using Volta GPU? (Only Volta has Tensor Cores used in mixed precision)
(2) Which version of CUDA and Tensorflow you have?
from openseq2seq.
1). Yes. We use V100, all the descriptions above is performance in V100.
2). CUDA 9.0, Tensorflow 1.8 and 1.9 are both tried, no difference.
When we use the default Transformer based model(https://github.com/tensorflow/tensor2tensor), we get the following speed:
4.3 global_steps/s with 4 V100 gpus
with OpenSeq2Seq (default configuration), we get
1 step/13 s with 2 V100 gpus in mixed
mode (Gpu utility:1%) and
1 step/0.33s with 2 V100 gpu in fp32
mode (Gpu utility:90%) and
1 step/19s with 2 V100 gpus in fp16
mode (Gpu utility:1%)
for the same based model.
from openseq2seq.
GPU utilization of 1% means that most of the work falls on CPU. This happens because public Tensorflow + CUDA 9.0 does not have batch gemm in float16 integrated.
This is why we require CUDA 9.1 (see https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html) and TF built with this PR included.
I would recommend you just use NVIDIA's Tensorflow container (18.07-py3) which you can get here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow . It contains cublas, cuda, cudnn + TF version tested to work with each other nicely and occasionally some GPU improvements which aren't in TF upstream yet. This way you don't need to worry about details like above.
from openseq2seq.
@Zrachel and @dingsiyu were you able to get speedups using mixed precision?
from openseq2seq.
I am not able to get speedups using mixed precision. After upgrading CUDA from 9.0 to 9.2, we have another problem, i,e,:
System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.8.0(it already has the code TF_CALL_half(REGISTER_BATCH_MATMUL_GPU);)
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100
Exact command to reproduce:
with tf.device("/gpu:0"):
a = tf.random_normal(dtype=tf.float16, shape=[5, 2, 3], name='a')
b = tf.random_normal(dtype=tf.float16, shape=[5, 3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, allow_soft_placement=False))
print(sess.run(c).shape)
Describe the problem :
Matrix of fp16 does multiplication on CPU. So we are not able to speedup mixed training
logs
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Registered kernels:
device='GPU'; T in [DT_DOUBLE]
device='GPU'; T in [DT_FLOAT]
device='GPU'; T in [DT_COMPLEX128]
device='GPU'; T in [DT_COMPLEX64]
device='CPU'; T in [DT_INT32]
device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_DOUBLE]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_COMPLEX128]
device='CPU'; T in [DT_COMPLEX64]
[[Node: MatMul_1 = BatchMatMul[T=DT_HALF, adj_x=false, adj_y=false, _device="/device:GPU:0"](a_1, b_1)]]
so, what can we do?
from openseq2seq.
Can you please try NVIDIA's TensorFlow container (18.07-py3)? You can get it here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow
I am not sure why upstream TF still doesn't have batched gemm in fp16 ...
from openseq2seq.
Thank you @okuchaiev . We cannot access this website. Is there any other ways (like GoogleDrive) to access this container?
from openseq2seq.
The website seems up and running for me https://ngc.nvidia.com NVIDIA's TF containers are available only from there - it requires registration but it is quick and free
from openseq2seq.
we have addressed the problem that FP16 matmul can not run on GPU, but but almost no speedup:
System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.9.0
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100(number : 2)
model : Openseq2seq---transformer_big.py
FP32 : batch_size = 128, 2 GPU(v100),time per step = 0.34
mixed : batch_size = 128, 2 GPU(v100),ime per step = 0.33
the speed of FP32 and mixed almost identical, why the mixed mode can not speedup the transformer model ?
from openseq2seq.
@dingsiyu
I tested "transformer-big.py" and I get the following (note increase in global_step/sec):
This is using NVIDIA's TF containers, 2 GPUs, OpenSeq2Seq from master branch and not using Horovod.
One thing I noticed is that my FP32 model reports around 0.424 time per step, while mixed has time closer to 0.33 (same is yours).
Can you please double-check if your FP32 model is actually FP32 and not mixed?
Did you make any changes to the model config or use different dataset?
from openseq2seq.
@okuchaiev
I tested "transformer-big.py" many times last weekend, and the speed of FP32 mode always stay in 0.34 time per step. And I have checked the model config and the used dataset, I only changed the parameter of 'dtype' from tf.float32 to mixed when compare the speed between FP32 and mixed.
But i do not know if our System informations are same. I am not sure if CUDA 9.2 will influence the speed of FP32 mode.
so can you test "transformer-big.py" on the newest NVIDIA's TF containers which may have the same config with me.
thanks a lot !
from openseq2seq.
Related Issues (20)
- Unable to initialize FrameASR object (Trying to infer from Pre-trained model) HOT 1
- Choppy generation using pre-trained tacotron-gst model checkpoint HOT 1
- download language model issue
- Unreadable Output HOT 1
- Model training stops after 1 step for Speech to Text Jasper model HOT 1
- AttributeError: module 'tensorflow._api.v2.train' has no attribute 'SessionRunHook' HOT 3
- Deep speech 2 training time
- INFO: Skipping trie generation, since no custom TF op based CTC decoder found.
- Windows operating system support. HOT 1
- Making use of Language Model with CTC Decoder HOT 2
- How can we run inference with a pb file HOT 1
- Early Stopping
- Data_layer typo
- Empty string can be added to ngram vector
- How can we emit word confidence while decoding?
- Jasper 10x3 and 10x5 same size of models
- Compatibility with tensorflow 2.3
- Nemo & OpenSeq2Seq difference ? HOT 2
- Streaming example output stuck words HOT 1
- GPU is not being used?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openseq2seq.