when i train the transformer in mixed-precision model, the time is so long, i.e., (1)

1). Yes. We use V100, all the deions above is performance in V100. 2). CUDA

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

The website seems up and running for me <a href="https://ngc.nvidia.com" rel="nofollow

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

performance---the training time of transformer model is too long in mixed-precision model about openseq2seq HOT 11 CLOSED

nvidia commented on May 13, 2024

performance---the training time of transformer model is too long in mixed-precision model

from openseq2seq.

Comments (11)

okuchaiev commented on May 13, 2024

(1) Are you using Volta GPU? (Only Volta has Tensor Cores used in mixed precision)
(2) Which version of CUDA and Tensorflow you have?

from openseq2seq.

Zrachel commented on May 13, 2024

1). Yes. We use V100, all the descriptions above is performance in V100.
2). CUDA 9.0, Tensorflow 1.8 and 1.9 are both tried, no difference.

When we use the default Transformer based model(https://github.com/tensorflow/tensor2tensor), we get the following speed:
4.3 global_steps/s with 4 V100 gpus

with OpenSeq2Seq (default configuration), we get
1 step/13 s with 2 V100 gpus in mixed mode (Gpu utility:1%) and
1 step/0.33s with 2 V100 gpu in fp32 mode (Gpu utility:90%) and
1 step/19s with 2 V100 gpus in fp16 mode (Gpu utility:1%)
for the same based model.

from openseq2seq.

okuchaiev commented on May 13, 2024

GPU utilization of 1% means that most of the work falls on CPU. This happens because public Tensorflow + CUDA 9.0 does not have batch gemm in float16 integrated.
This is why we require CUDA 9.1 (see https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html) and TF built with this PR included.

I would recommend you just use NVIDIA's Tensorflow container (18.07-py3) which you can get here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow . It contains cublas, cuda, cudnn + TF version tested to work with each other nicely and occasionally some GPU improvements which aren't in TF upstream yet. This way you don't need to worry about details like above.

from openseq2seq.

okuchaiev commented on May 13, 2024

@Zrachel and @dingsiyu were you able to get speedups using mixed precision?

from openseq2seq.

dingsiyu commented on May 13, 2024

I am not able to get speedups using mixed precision. After upgrading CUDA from 9.0 to 9.2, we have another problem, i,e,:

System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.8.0(it already has the code TF_CALL_half(REGISTER_BATCH_MATMUL_GPU);)
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100

Exact command to reproduce:

with tf.device("/gpu:0"):
a = tf.random_normal(dtype=tf.float16, shape=[5, 2, 3], name='a')
b = tf.random_normal(dtype=tf.float16, shape=[5, 3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, allow_soft_placement=False))
print(sess.run(c).shape)

Describe the problem :
Matrix of fp16 does multiplication on CPU. So we are not able to speedup mixed training

logs
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Registered kernels:
device='GPU'; T in [DT_DOUBLE]
device='GPU'; T in [DT_FLOAT]
device='GPU'; T in [DT_COMPLEX128]
device='GPU'; T in [DT_COMPLEX64]
device='CPU'; T in [DT_INT32]
device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_DOUBLE]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_COMPLEX128]
device='CPU'; T in [DT_COMPLEX64]

[[Node: MatMul_1 = BatchMatMul[T=DT_HALF, adj_x=false, adj_y=false, _device="/device:GPU:0"](a_1, b_1)]]

so, what can we do?

from openseq2seq.

okuchaiev commented on May 13, 2024

Can you please try NVIDIA's TensorFlow container (18.07-py3)? You can get it here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow

I am not sure why upstream TF still doesn't have batched gemm in fp16 ...

from openseq2seq.

Zrachel commented on May 13, 2024

Thank you @okuchaiev . We cannot access this website. Is there any other ways (like GoogleDrive) to access this container?

from openseq2seq.

okuchaiev commented on May 13, 2024

The website seems up and running for me https://ngc.nvidia.com NVIDIA's TF containers are available only from there - it requires registration but it is quick and free

from openseq2seq.

dingsiyu commented on May 13, 2024

we have addressed the problem that FP16 matmul can not run on GPU, but but almost no speedup：

System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.9.0
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100(number : 2)

model ： Openseq2seq---transformer_big.py
FP32 : batch_size = 128, 2 GPU（v100)，time per step = 0.34
mixed : batch_size = 128, 2 GPU（v100)，ime per step = 0.33

the speed of FP32 and mixed almost identical， why the mixed mode can not speedup the transformer model ?

from openseq2seq.

okuchaiev commented on May 13, 2024

@dingsiyu
I tested "transformer-big.py" and I get the following (note increase in global_step/sec):

This is using NVIDIA's TF containers, 2 GPUs, OpenSeq2Seq from master branch and not using Horovod.

One thing I noticed is that my FP32 model reports around 0.424 time per step, while mixed has time closer to 0.33 (same is yours).
Can you please double-check if your FP32 model is actually FP32 and not mixed?
Did you make any changes to the model config or use different dataset?

from openseq2seq.

dingsiyu commented on May 13, 2024

@okuchaiev
I tested "transformer-big.py" many times last weekend, and the speed of FP32 mode always stay in 0.34 time per step. And I have checked the model config and the used dataset, I only changed the parameter of 'dtype' from tf.float32 to mixed when compare the speed between FP32 and mixed.

But i do not know if our System informations are same. I am not sure if CUDA 9.2 will influence the speed of FP32 mode.

so can you test "transformer-big.py" on the newest NVIDIA's TF containers which may have the same config with me.

thanks a lot !

from openseq2seq.

performance---the training time of transformer model is too long in mixed-precision model about openseq2seq HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent