Coder Social home page Coder Social logo

Comments (11)

okuchaiev avatar okuchaiev commented on May 13, 2024

(1) Are you using Volta GPU? (Only Volta has Tensor Cores used in mixed precision)
(2) Which version of CUDA and Tensorflow you have?

from openseq2seq.

Zrachel avatar Zrachel commented on May 13, 2024

1). Yes. We use V100, all the descriptions above is performance in V100.
2). CUDA 9.0, Tensorflow 1.8 and 1.9 are both tried, no difference.

When we use the default Transformer based model(https://github.com/tensorflow/tensor2tensor), we get the following speed:
4.3 global_steps/s with 4 V100 gpus

with OpenSeq2Seq (default configuration), we get
1 step/13 s with 2 V100 gpus in mixed mode (Gpu utility:1%) and
1 step/0.33s with 2 V100 gpu in fp32 mode (Gpu utility:90%) and
1 step/19s with 2 V100 gpus in fp16 mode (Gpu utility:1%)
for the same based model.

from openseq2seq.

okuchaiev avatar okuchaiev commented on May 13, 2024

GPU utilization of 1% means that most of the work falls on CPU. This happens because public Tensorflow + CUDA 9.0 does not have batch gemm in float16 integrated.
This is why we require CUDA 9.1 (see https://nvidia.github.io/OpenSeq2Seq/html/mixed-precision.html) and TF built with this PR included.

I would recommend you just use NVIDIA's Tensorflow container (18.07-py3) which you can get here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow . It contains cublas, cuda, cudnn + TF version tested to work with each other nicely and occasionally some GPU improvements which aren't in TF upstream yet. This way you don't need to worry about details like above.

from openseq2seq.

okuchaiev avatar okuchaiev commented on May 13, 2024

@Zrachel and @dingsiyu were you able to get speedups using mixed precision?

from openseq2seq.

dingsiyu avatar dingsiyu commented on May 13, 2024

I am not able to get speedups using mixed precision. After upgrading CUDA from 9.0 to 9.2, we have another problem, i,e,:

System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.8.0(it already has the code TF_CALL_half(REGISTER_BATCH_MATMUL_GPU);)
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100

Exact command to reproduce:

with tf.device("/gpu:0"):
a = tf.random_normal(dtype=tf.float16, shape=[5, 2, 3], name='a')
b = tf.random_normal(dtype=tf.float16, shape=[5, 3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, allow_soft_placement=False))
print(sess.run(c).shape)

Describe the problem :
Matrix of fp16 does multiplication on CPU. So we are not able to speedup mixed training

logs
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
Registered kernels:
device='GPU'; T in [DT_DOUBLE]
device='GPU'; T in [DT_FLOAT]
device='GPU'; T in [DT_COMPLEX128]
device='GPU'; T in [DT_COMPLEX64]
device='CPU'; T in [DT_INT32]
device='CPU'; T in [DT_HALF]
device='CPU'; T in [DT_DOUBLE]
device='CPU'; T in [DT_FLOAT]
device='CPU'; T in [DT_COMPLEX128]
device='CPU'; T in [DT_COMPLEX64]

[[Node: MatMul_1 = BatchMatMul[T=DT_HALF, adj_x=false, adj_y=false, _device="/device:GPU:0"](a_1, b_1)]]

so, what can we do?

from openseq2seq.

okuchaiev avatar okuchaiev commented on May 13, 2024

Can you please try NVIDIA's TensorFlow container (18.07-py3)? You can get it here for free: https://ngc.nvidia.com/registry/nvidia-tensorflow

I am not sure why upstream TF still doesn't have batched gemm in fp16 ...

from openseq2seq.

Zrachel avatar Zrachel commented on May 13, 2024

Thank you @okuchaiev . We cannot access this website. Is there any other ways (like GoogleDrive) to access this container?

from openseq2seq.

okuchaiev avatar okuchaiev commented on May 13, 2024

The website seems up and running for me https://ngc.nvidia.com NVIDIA's TF containers are available only from there - it requires registration but it is quick and free

from openseq2seq.

dingsiyu avatar dingsiyu commented on May 13, 2024

we have addressed the problem that FP16 matmul can not run on GPU, but but almost no speedup:

System information :
(1)OS Platform and Distribution : centos6.3
(2)TensorFlow installed from: conda
(3)TensorFlow version (use command below): v1.9.0
(4)Python version: 3.6
(5)CUDA version: 9.2
(6)cuDNN version: 7.1.4
(7)GPU model : V100(number : 2)

model : Openseq2seq---transformer_big.py
FP32 : batch_size = 128, 2 GPU(v100),time per step = 0.34
mixed : batch_size = 128, 2 GPU(v100),ime per step = 0.33

the speed of FP32 and mixed almost identical, why the mixed mode can not speedup the transformer model ?

from openseq2seq.

okuchaiev avatar okuchaiev commented on May 13, 2024

@dingsiyu
I tested "transformer-big.py" and I get the following (note increase in global_step/sec):
screenshot from 2018-08-10 11-23-24

This is using NVIDIA's TF containers, 2 GPUs, OpenSeq2Seq from master branch and not using Horovod.

One thing I noticed is that my FP32 model reports around 0.424 time per step, while mixed has time closer to 0.33 (same is yours).
Can you please double-check if your FP32 model is actually FP32 and not mixed?
Did you make any changes to the model config or use different dataset?

from openseq2seq.

dingsiyu avatar dingsiyu commented on May 13, 2024

@okuchaiev
I tested "transformer-big.py" many times last weekend, and the speed of FP32 mode always stay in 0.34 time per step. And I have checked the model config and the used dataset, I only changed the parameter of 'dtype' from tf.float32 to mixed when compare the speed between FP32 and mixed.

But i do not know if our System informations are same. I am not sure if CUDA 9.2 will influence the speed of FP32 mode.

so can you test "transformer-big.py" on the newest NVIDIA's TF containers which may have the same config with me.

thanks a lot !

from openseq2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.