Coder Social home page Coder Social logo

deep-learning-benchmark's Introduction

Benchmark on Deep Learning Frameworks and GPUs

Performance of popular deep learning frameworks and GPUs are compared, including the effect of adjusting the floating point precision (the new Volta architecture allows performance boost by utilizing half/mixed-precision calculations.)

Deep Learning Frameworks

Note: Docker images available from NVIDIA GPU Cloud were used so as to make benchmarking controlled and repeatable by anyone.

  • PyTorch 0.3.0

    • docker pull nvcr.io/nvidia/pytorch:17.12
  • PyTorch 1.0.0 (CUDA 10.0, cuDNN 7.4.2)

    • docker pull nvcr.io/nvidia/pytorch:19.01-py3 (note: requires login API key to NGC registry)
  • Caffe2 0.8.1

    • docker pull nvcr.io/nvidia/caffe2:17.12
  • TensorFlow 1.4.0 (note: this is TensorFlow 1.4.0 compiled against CUDA 9 and CuDNN 7)

    • docker pull nvcr.io/nvidia/tensorflow:17.12
  • TensorFlow 1.5.0

  • TensorFlow 1.12.0 (CUDA 10.0, cuDNN 7.4.2)

    • docker pull nvcr.io/nvidia/tensorflow:19.01-py3 (note: requires login API key to NGC registry)
  • MXNet 1.0.0 (anyone interested?)

    • docker pull nvcr.io/nvidia/mxnet:17.12
  • CNTK (anyone interested?)

    • docker pull nvcr.io/nvidia/cntk:17.12

GPUs

Model Architecture Memory CUDA Cores Tensor Cores F32 TFLOPS F16 TFLOPS* Retail Cloud
Tesla V100 Volta 16GB HBM2 5120 640 15.7 125 $3.06/hr (p3.2xlarge)
Titan V Volta 12GB HBM2 5120 640 15 110 $2999 N/A
1080 Ti Pascal 11GB GDDR5 3584 0 11 N/A $699 N/A
2080 Ti Turing 11GB GDDR6 4352 544 13.4 56.9 $1299 N/A

*: F16 (single precision) TFLOPS on TensorCores.

CUDA / CuDNN

  • CUDA 9.0.176
  • CuDNN 7.0.0.5
  • NVIDIA driver 387.34.

Except where noted.

Networks

  • VGG16
  • Resnet152
  • Densenet161
  • Any others you might be interested in?

Benchmark Results

PyTorch 0.3.0

The results are based on running the models with images of size 224 x 224 x 3 with a batch size of 16. "Eval" shows the duration for a single forward pass averaged over 20 passes. "Train" shows the duration for a pair of forward and backward passes averaged over 20 runs. In both scenarios, 20 runs of warm up is performed and those are not counted towards the measured numbers.

Titan V gets a significant speed up when going to half precision by utilizing its Tensor cores, while 1080 Ti gets a small speed up with half precision computation. Similarly, the numbers from V100 on an Amazon p3 instance is shown. It is faster than Titan V and the speed up when going to half-precision is similar to that of Titan V.

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 31.3ms 108.8ms 48.9ms 180.2ms 52.4ms 174.1ms
1080 Ti 39.3ms 131.9ms 57.8ms 206.4ms 62.9ms 211.9ms
V100 (Amazon p3, CUDA 9.0.176, CuDNN 7.0.0.3) 26.2ms 83.5ms 38.7ms 136.5ms 48.3ms 142.5ms
2080 Ti 30.5ms 102.9ms 41.9ms 157.0ms 47.3ms 160.0ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 14.7ms 74.1ms 26.1ms 115.9ms 32.2ms 118.9ms
1080 Ti 33.5ms 117.6ms 46.9ms 193.5ms 50.1ms 191.0ms
V100 (Amazon p3, CUDA 9.0.176, CuDNN 7.0.0.3) 12.6ms 58.8ms 21.7ms 92.9ms 35.7ms 102.3ms
2080 Ti 23.6ms 99.3ms 31.3ms 133.0ms 35.5ms 135.8ms

PyTorch 1.0.0

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
2080 Ti (CUDA 10.0.130, CuDNN 7.4.2.24) 28.0ms 95.5ms 41.8ms 142.5ms 45.4ms 148.4ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
2080 Ti (CUDA 10.0.130, CuDNN 7.4.2.24) 19.1ms 68.1ms 25.0ms 98.6ms 30.1ms 110.8ms

Tensorflow 1.4.0

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 31.8ms 157.2ms 50.3ms 269.8ms
1080 Ti 43.4ms 131.3ms 69.6ms 300.6ms
2080 Ti 31.3ms 99.4ms 43.2ms 187.7ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 16.1ms 96.7ms 28.4ms 193.3ms
1080 Ti 38.6ms 121.1ms 53.9ms 257.0ms
2080 Ti 24.9ms 81.8ms 31.9ms 155.5ms

TensorFlow 1.5.0

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
V100 24.0ms 71.7ms 39.4ms 199.8ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
V100 13.6ms 49.4ms 22.6ms 147.4ms

TensorFlow 1.12.0

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
2080 Ti (CUDA 10.0.130, CuDNN 7.4.2.24) 28.8ms 90.8ms 43.6ms 191.0ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
2080 Ti (CUDA 10.0.130, CuDNN 7.4.2.24) 18.7ms 58.6ms 25.8ms 133.5ms

Caffe2 0.8.1

32-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 57.5ms 185.4ms 74.4ms 214.1ms
1080 Ti 47.0ms 158.9ms 77.9ms 223.9ms

16-bit

Model vgg16 eval vgg16 train resnet152 eval resnet152 train densenet161 eval densenet161 train
Titan V 41.6ms 156.1ms 56.9ms 172.7ms
1080 Ti 40.1ms 137.8ms 61.7ms 184.1ms

Comparison Graphs

Comparison of Titan V vs 1080 Ti, PyTorch 0.3.0 vs Tensorflow 1.4.0 vs Caffe2 0.8.1, and FP32 vs FP16 in terms of images processed per second:

vgg16-eval vgg16-train resnet152-eval resnet152-train

Contributors

  • Yusaku Sako
  • Bartosz Ludwiczuk (thank you for supplying the V100 numbers)

deep-learning-benchmark's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deep-learning-benchmark's Issues

Cite the benchmark results

Hi! Thanks for your wonderful work. I hope to cite the training speed results in my research on GPU devices, can you let me know if there's paper or report of this work that I can cite?

Tensorflow 1.5

Actually to use fp16 tensor cores you need cuda9 and tensorflow 1.5 which is released today.
from release notes:

Add support for CUBLAS_TENSOR_OP_MATH in fp16 GEMM

Any chances we could see retest with new TF soon?

About the tensorflow benchmark in half-percision

Hi, @u39kun. Thank you for your work!

When I check the way you implement the tensorflow here, I found that there is a notation in the Func get_variable() like following:

  def get_variable(self, name, shape, dtype, cast_dtype, *args, **kwargs):
    # TODO(reedwm): Currently variables and gradients are transferred to other
    # devices and machines as type `dtype`, not `cast_dtype`. In particular,
    # this means in fp16 mode, variables are transferred as fp32 values, not
    # fp16 values, which uses extra bandwidth.

Do you mean that currently, if the model is trained as float32, when the model'll be loaded as float32, but it will compute as float16? Only the bandwidth in the GPU will be effected, but the speed will keep same?

Minibatch size when going to mixed precision

Thank you for excellent data points!

Can you estimate potential increase in minibatch size when going to mixed precision?

Nvidia claims memory usage should go down, but aren't specific.

In my experiments with Titan V (using Tensorflow and home-grown implementation of Transformer model) I can only increase batch size by about 10%, which is much less than I expected.

Thanks!

Running the benchmark with rtx 2080 Ti

Hi,
I have been using your benchmark to run different test and comparison between 10 series cards, now I have received the RTX 2080 ti and when trying to run the benchmark I am getting this:

Running the benchmark $sudo python3 benchmark.py

running benchmark for frameworks ['pytorch', 'tensorflow', 'caffe2']
cuda version= None
cudnn version= 7201
/home/bizon/benchmark/deep-learning-benchmark-master/frameworks/pytorch/models.py:17: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead.
self.eval_input = torch.autograd.Variable(x, volatile=True).cuda() if precision == 'fp32'
Segmentation fault

The benchmark have been running with all the other cards and also the minst benchmark is running perfect, I would like to test all the new RTX cards to see the performance.

Your help here will be really appreciate it.
Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.