tensorflow / benchmarks Goto Github PK

View Code? Open in Web Editor NEW

1.1K 176.0 628.0 3.13 MB

A benchmark framework for Tensorflow

License: Apache License 2.0

Python 97.78% Shell 1.74% Roff 0.48%

benchmarks's People

Contributors

Stargazers

Watchers

Forkers

annarev yuefengz vulcanoio amano-ginji newgenesys joseroubert08 aniruddh-sk alexxnica kryndex solertis vjpai ml-lab elarnon leotam bwry caicloud dl-yc patrick-mcclure shrutiramesh88 grseb9s deepakn94 hexopensource tfboyd i8run cw-delli-bird rodgeliao staugust mohsenmalmir shamoya txdao soralove yongfu-li zqcr visualmemory renato2099 wencongxiao yepman0620 pan463194277 qilewuqiong songvy forkedreposbak cclauss pfcm dinneo shasure zhaojp-frank azrael417 wangdeyu k9sret algoskynet chanyilin seahailang searchingmnist juanerolon chron0werx xiaojingyi supernlogn medivhna hiccup11 xmaxk eggplant60 lynnapan waltersharpwei arg0 zhreshold djangopeng chenghuige yhyu13 carylorrk gabrielcc2 lengjiaxu david20120720 jlewi bindung simplejian zmoon111 mahmoud-abuzaina alumilu reedwm karkadad gt758215 danjust nbalov u39kun ehsanmaag weichienc cymotiffany zhoushaojun honglinwen zhaoguangze elisej agoniii jiyangchen x666633 tjingrant malikmk hiroki-kyoto harsv bherta ourobouros

benchmarks's Issues

Very slow in running AlexNet with Cifar10 dataset

Hi, authors , the speed I achieved on AlexNet with Cifar10 dataset is only ~7000 images/sec using a TITAN X Pascal GPU. May I know what is the speed you have achieved, and is there any setting to achieve better performance? The command I used is:
python tf_cnn_benchmarks.py --learning_rate 0.01 --num_gpus 1 --model alexnet --batch_size 1024 --data_name cifar10 --num_batches 100 --data_dir ~/data/tensorflow/cifar-10-batches-py

The tested version of TensorFlow is 1.2.1.
Thanks!

DenseNet in this benchmark has "naive" implementaiton, will underperform

At the time of writing, DenseNet is implemented in this benchmark with what is described as the "naive" implementation in Memory-Efficient Implementation of DenseNets. This will under perform compared to the ideal implementation because recursive tf.concat will lead to excessive memory usage.

Details are in the upstream tf issuse tensorflow/tensorflow#12948.

[Question] Distributed Training Benchmark

Hi!
I have a quick question on running distributed training benchmark, especially on the meaning of different parameters:

the distributed_replicated parameter resembles "synchronous data parallel training" in the literature, meaning the computation graph is replicated on each worker, and after each machine performs a forward pass with a batch of images, their gradients are averaged and optimized on parameter servers - is this understanding correct? If not, what would be the correct --variable_update option for "synchronous training"? How is "distributed replicated" different from "parameter server"?
the --num_gpus means how many GPUs to use on a single machine, NOT the global GPU count - is that correct?
Should I assign each machine a new task index, if I want to run synchronous data parallel training across machines?

I understand some of the questions may be very basic, but it would be great if quick answers can be provided. Thanks.

Long delays on Titan V

Did anybody experience long (3-4 minutes) delay when starting a task with TensorFlow?

I'm trying to evaluate Titan V with the great tensorflow benchmarks repo (https://github.com/tensorflow/benchmarks/), but during initialization, after this line:
2017-12-22 10:39:03.469975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10744 MB memory) -> physical GPU (device: 0, name: Graphics Device, pci bus id: 0000:01:00.0, compute capability: 7.0)

and after
Running warm-up
, execution seems to stop for 3 minutes.

May be related: when using FP16, which is expected to work well, I get a lot of:
2017-12-22 10:34:58.755828: E tensorflow/core/grappler/optimizers/constant_folding.cc:1272] Unexpected type half

When I perform the same test on the same computer on 1080 Ti, there's no delay!

Used:
Titan V, TensorFlow master, TensorFlow benchmarks master, CUDA 9.0, Driver 387.34, Ubuntu 16.04.
TensorFlow's Compute capability is hopefully 6.1 & 7.0, but I don't know any way to check it. Also, I wanted to attach logs (nvidia-bug-report.log.gz), but I couldn't find a way.

Thanks!

I posted my question also here, as I am not sure if it's a benchmark or a V-issue:
https://devtalk.nvidia.com/default/topic/1027804/cuda-programming-and-performance/titan-v-tensorflow-performance/post/5228518/#5228518

Test Multi Gpu Acceleration ratio,is this Normal?

time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=2 --num_intra_threads=7 --num_inter_threads=7
--batch_size=32 --model=vgg16 --variable_update=parameter_server
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server

total images/sec: 146.38
real 0m59.740s

time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_intra_threads=7 --num_inter_threads=7
--batch_size=64 --model=vgg16 --variable_update=parameter_server
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server

total images/sec: 82.04
real 1m37.090s

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=2 --num_intra_threads=7 --num_inter_threads=7
--batch_size=32 --model=vgg16 --variable_update=replicated --use_nccl=True
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: nccl

total images/sec: 114.40
real 1m13.169s

time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_intra_threads=7 --num_inter_threads=7
--batch_size=64 --model=vgg16 --variable_update=replicated --use_nccl=True
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: nccl

total images/sec: 92.60
real 1m26.327s

Need for a better focus on details.

This issue can be taken as a feature-request or a request related to documentation.
The high-performance benchmarking example is a good effort.
However the code is very fused (combining distributed and multi-gpu examples in the same setting !!). Moreover the code is not properly documented and there is little to no information available on different aspects related to StagingArea ops and how to use them.

It would be worthwhile if efforts can be made to improve the related documentation and improve the code clarity. We are currently working on a very high performance training code but are quite crippled by these debilitating drawbacks.

[Feature request] adding support for "iter_size" like hyperparameters (Caffe)

Hi, thanks a lot for sharing this awesome project.

I wonder if the code currently support the Caffe "iter_size" like hyperparameter? That is, accumulating gradients for "iter_size" number of batches and then apply the gradient. By using this hyperparameter, one can emulate the training with larger batch_size without distributed training. When the bathc_size is set to, let's say 64, and iter_size set to ITER_SIZE, then the effective batch_size will be 64*ITER_SIZE since all the gradients in ITER_SIZE batches are accumulated.

Is this doable in current code? Is there any plan for supporting this feature?

Thank you.

Benchmark crashes when trying to run VGG models

The benchmark crashes when I try to run the VGG models with the following stack trace.

Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 1333, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1329, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 879, in run
    self._benchmark_cnn()
  File "tf_cnn_benchmarks.py", line 919, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "tf_cnn_benchmarks.py", line 1088, in _build_model
    gpu_grad_stage_ops)
  File "tf_cnn_benchmarks.py", line 1238, in add_forward_pass_and_gradients
    self.model_conf.add_inference(network)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/vgg_model.py", line 71, in add_inference
    _construct_vgg(cnn, [2, 2, 3, 3, 3])
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/vgg_model.py", line 51, in _construct_vgg
    cnn.dropout()
  File "tf_cnn_benchmarks.py", line 543, in dropout
    dropout = core_layers.dropout(input_layer, keep_prob_tensor)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/core.py", line 301, in dropout
    layer = Dropout(rate, noise_shape=noise_shape, seed=seed, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/core.py", line 247, in __init__
    self.rate = min(1., max(0., rate))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 578, in __nonzero__
    raise TypeError("Using a `tf.Tensor` as a Python `bool` is not allowed. "
TypeError: Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.

Command:

python tf_cnn_benchmarks.py --model vgg16 --num_gpus 1 --batch_size 32

Default MaxPoolingOp only supports NHWC

In Docker container.
CPU-only.
Docker image: tensorflow/tensorflow:latest (65e150502892)
Updated TensorFlow with pip install -U tf-nightly to fix issue #80 .
In container cloned the benchmarks.
Start benchmarks with # python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50
Output:

TensorFlow:  1.5
Model:       resnet50
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
2017-11-06 03:49:44.378230: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Running warm up
2017-11-06 03:49:46.244696: E tensorflow/core/common_runtime/executor.cc:651] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC.
	 [[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 54, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 50, in main
    bench.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 916, in run
    return self._benchmark_cnn()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1155, in _benchmark_cnn
    fetch_summary)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 530, in benchmark_one_step
    results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Default MaxPoolingOp only supports NHWC.
	 [[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]

Caused by op u'v/tower_0/cg/mpool0/MaxPool', defined at:
  File "tf_cnn_benchmarks.py", line 54, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 50, in main
    bench.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 916, in run
    return self._benchmark_cnn()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1010, in _benchmark_cnn
    (image_producer_ops, enqueue_ops, fetches) = self._build_model()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1260, in _build_model
    gpu_compute_stage_ops, gpu_grad_stage_ops)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1566, in add_forward_pass_and_gradients
    self.model.add_inference(network)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/models/resnet_model.py", line 210, in add_inference
    cnn.mpool(3, 3, 2, 2)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 273, in mpool
    d_height, d_width, mode, input_layer, num_channels_in)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 250, in _pool
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/pooling.py", line 429, in max_pooling2d
    return layer.apply(inputs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 728, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 618, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/pooling.py", line 273, in call
    data_format=utils.convert_data_format(self.data_format, 4))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1958, in max_pool
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 2806, in _max_pool
    data_format=data_format, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3073, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1524, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC.
	 [[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 417, in run
    global_step_val, = self.sess.run([self.global_step_op])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1047, in _run
    raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.

global_step should be protected by a lock?

I have met stuck problem when running tf_cnn_benchmarks.py in distributed mode, I think global_step should be protected by a lock in this line.

L2 weight decay implementation in TF benchmarks

The existing implementation use a L2 regularization on all model variables (including batch norm variables and biases). It's quite different from TF slim models which usually regularizes only conv2d weights, and it hurts quite a lot the training to include all variables.

Maybe we should implement a regularization loss using the common TF api. Something like:

weight_decay = self.params.weight_decay
if rel_device_num == 0 and weight_decay:
    # Regularization losses in the present name scope.
    nm_sc = tf.contrib.framework.get_name_scope()
    reg_losses = tf.losses.get_regularization_losses(nm_sc)
    # TODO: fp16 convertion???
    reg_loss = tf.add_n(reg_losses, name='total_regularization_loss')

More generally, I think it would add a lot of value if this benchmark repo could actually reproduce SOTA training on ImageNet for common architectures.

FP16 support in the benchmark

Hi @tfboyd, I saw the benchmark has --use_fp16 flag now. So does the benchmark and the latest TensorFlow support FP16 now? Can we do the test on Volta GPUs? Thanks.

error validating "STDIN": error validating data: unexpected type: object;

when try

python render_template.py template.yaml.jinja | kubectl create -f -

in alexnet, it fails

error validating "STDIN": error validating data: unexpected type: object; if you choose to ignore these errors, turn validation off with --validate=false

if --validate=false added, output is

NAME READY STATUS RESTARTS AGE alexnet-ps-0-c73m6 0/1 ContainerCreating 0 7s alexnet-ps-1-pm9s6 0/1 ContainerCreating 0 7s alexnet-worker-0-pzbbq 0/1 ContainerCreating 0 7s

Why reshape after bias_add ?

Notice codes HERE

        biased = tf.reshape(
            tf.nn.bias_add(
                conv, biases, data_format=self.data_format),
            conv.get_shape())

I think the output shape of bias_add is exactly the same as conv.get_shape().
So why bother to reshape?
Then I add some slim analyzer codes to print the total_ops
total_ops = slim.model_analyzer.analyze_ops(sess.graph)
I notice that total_ops is different with reshape or not (Run Training)

With reshape:
TOTAL_OPS = 4856506501
Without reshape:
TOTAL_OPS = 3989465117

Of course the performance is also different.

No performance improved on batch 128 ?

I run the script followed this:

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=128 --model=resnet50 --variable_update=replicated --nodistortions --nccl True --trace_file ~/timeline.json

But there is no improvement at all. The speed is equal to the one on batch 64:

Step Img/sec loss
1 images/sec: 725.2 +/- 0.0 (jitter = 0.0) 7.463
10 images/sec: 736.7 +/- 1.4 (jitter = 2.9) 7.180
20 images/sec: 731.7 +/- 2.4 (jitter = 6.7) 7.048
30 images/sec: 723.7 +/- 2.6 (jitter = 19.3) 6.971
40 images/sec: 719.0 +/- 2.4 (jitter = 15.9) 6.929
50 images/sec: 716.0 +/- 2.1 (jitter = 12.2) 6.898

What's the reason? And why is the speed slower and slower when the step is bigger ?

GPU long idle period that makes the train loop slow

Is there any method to let the gpu be at full power?

Besides, when I run the script on one single GPU:
python tf_cnn_benchmarks.py local_parameter_device=cpu --num_gpus= 1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --optimizer=sgd

I got the timeline timeline_benchmark_origin.json.txt when the script had no change and timeline_benchmark_changed.json.txt when this line was replaced by with tf.control_dependencies([]):.

The performance is significantly improved when the operation global_step.assign_add has no dependencies. But the improvement is only useful on the special step and only useful on one GPU:

Starting real work at step 10 at time Wed Jun 28 09:49:39 2017
Done warm up
Step    Img/sec loss
1       images/sec: 544.0 +/- 0.0 (jitter = 0.0)        6.776
10      images/sec: 207.7 +/- 33.2 (jitter = 2.1)       5.985
20      images/sec: 201.4 +/- 17.0 (jitter = 1.8)       5.563
30      images/sec: 199.3 +/- 11.4 (jitter = 1.4)       5.343
40      images/sec: 198.2 +/- 8.6 (jitter = 1.0)        5.255
50      images/sec: 197.5 +/- 6.9 (jitter = 1.0)        5.196
60      images/sec: 197.0 +/- 5.8 (jitter = 1.0)        5.167
70      images/sec: 196.6 +/- 5.0 (jitter = 0.9)        5.145
80      images/sec: 196.3 +/- 4.3 (jitter = 0.9)        5.125
90      images/sec: 196.0 +/- 3.9 (jitter = 1.0)        5.111
Finishing real work at step 109 at time Wed Jun 28 09:50:11 2017
----------------------------------------------------------------
total images/sec: 192.95
----------------------------------------------------------------

What's the reason ?

Confused total images/sec in distributed training

I think the log of distributed training always makes user confused, for instance, total images/sec. I have read and checked the implementation of benchmark_cnn.py, and the evaluation of images_per_sec is as below:

images_per_sec = (num_workers*batch_size) / average_wall_time
average_wall_time = elapsed_time / num_steps
num_steps = global_step_watcher.num_steps()
elapsed_time = global_step_watcher.elapsed_time()

It's obvious that the num_steps is the sum of all workers, but elapsed_time is just the time cost of each worker and num_workers == 1.

I attached my command and results in the following. I added some code to print the extra outputs such as num_worker, num_steps and average_wall_time .

Run command

python tf_cnn_benchmarks.py \
--device=cpu --mkl=True --num_inter_threads=1 \
--num_intra_threads=16 --data_format=NHWC \
--forward_only=True --kmp_blocktime=0 \
--batch_size=32 --model=inception3 \
--worker_hosts=... --ps_hosts=... \
--job_name=... --task_index=...

Test results

1 ps 1 worker, image_per_sec = 22.30

TensorFlow:  1.4
Model:       inception3
Mode:        forward-only
SingleSess:  False
Batch size:  32 global
             32 per device
Devices:     ['/job:worker/task:0/cpu:0']
Data format: NHWC
Optimizer:   sgd
Variables:   parameter_server
Sync:        True
==========
Generating model
Running warm up
Done warm up
Step	Img/sec	loss	top_1_accuracy	top_5_accuracy
1	images/sec: 23.5 +/- 0.0 (jitter = 0.0)	0.000	0.000	0.000
10	images/sec: 22.7 +/- 0.2 (jitter = 0.6)	0.000	0.000	0.000
20	images/sec: 22.7 +/- 0.1 (jitter = 0.7)	0.000	0.000	0.000
30	images/sec: 22.6 +/- 0.1 (jitter = 0.6)	0.000	0.000	0.000
40	images/sec: 22.7 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
50	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.031
60	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
70	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
80	images/sec: 22.4 +/- 0.1 (jitter = 0.6)	0.000	0.000	0.000
90	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
100	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
----------------------------------------------------------------
num_workers:	1
num_steps:	99
average_wall_time:	1.43500526264
total images/sec: 22.30
----------------------------------------------------------------

1 ps 2 worker, image_per_sec = 90.16?

TensorFlow:  1.4
Model:       inception3
Mode:        forward-only
SingleSess:  False
Batch size:  32 global
             32 per device
Devices:     ['/job:worker/task:1/cpu:0']
Data format: NHWC
Optimizer:   sgd
Variables:   parameter_server
Sync:        True
==========
Generating model
Running warm up
Done warm up
Step	Img/sec	loss	top_1_accuracy	top_5_accuracy
1	images/sec: 23.0 +/- 0.0 (jitter = 0.0)	0.000	0.000	0.000
10	images/sec: 22.4 +/- 0.2 (jitter = 0.6)	0.000	0.000	0.000
20	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
30	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
40	images/sec: 22.6 +/- 0.1 (jitter = 0.4)	0.000	0.000	0.000
50	images/sec: 22.6 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.031
60	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
70	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
80	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
90	images/sec: 22.4 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
100	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
----------------------------------------------------------------
num_workers:	1
num_steps:	199
average_wall_time:	0.709418054801
total images/sec: 45.11
----------------------------------------------------------------
TensorFlow:  1.4
Model:       inception3
Mode:        forward-only
SingleSess:  False
Batch size:  32 global
             32 per device
Devices:     ['/job:worker/task:0/cpu:0']
Data format: NHWC
Optimizer:   sgd
Variables:   parameter_server
Sync:        True
==========
Generating model
Running warm up
Done warm up
Step	Img/sec	loss	top_1_accuracy	top_5_accuracy
1	images/sec: 23.1 +/- 0.0 (jitter = 0.0)	0.000	0.000	0.000
10	images/sec: 22.4 +/- 0.2 (jitter = 0.6)	0.000	0.000	0.000
20	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
30	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
40	images/sec: 22.6 +/- 0.1 (jitter = 0.4)	0.000	0.000	0.000
50	images/sec: 22.6 +/- 0.1 (jitter = 0.4)	0.000	0.000	0.031
60	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
70	images/sec: 22.5 +/- 0.1 (jitter = 0.4)	0.000	0.000	0.000
80	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
90	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
100	images/sec: 22.5 +/- 0.1 (jitter = 0.5)	0.000	0.000	0.000
----------------------------------------------------------------
num_workers:	1
num_steps:	199
average_wall_time:	0.710339195165
total images/sec: 45.05
----------------------------------------------------------------

Is the calculation of image_per_sec for 1 ps 2 worker correct?

ImportError: cannot import name interleave_ops

After I pull and merge the latest commit, I got the ImportError. I attached the error log as below:

Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 26, in <module>
    import benchmark_cnn
  File ".../benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 43, in <module>
    import datasets
  File ".../benchmarks/scripts/tf_cnn_benchmarks/datasets.py", line 28, in <module>
    import preprocessing
  File ".../benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 23, in <module>
    from tensorflow.contrib.data.python.ops import interleave_ops
ImportError: cannot import name interleave_ops

I reviewed the commit and found the latest merged code causing the error. You can find my comment here, and the code here

how the sync_queues_and_barrier works when trains in distributed_replicated

When I use --distributed_replicated to train, I confuse how do the PS ensures that it have got all the gradient to average and start next step. I trace the code, and find the server creating the queue_ops operations and entering the sync_queues to the queue_ops after updating the gradient to the variable. But I can't find how the server ensures the synchronization with multi workers. It is strange to set the element in the sync_queues as false but never check and function it.

What factors affect the speedup？

the benchmark of google tell the speedup is 59，but we test only 35，How can we improve the speedup？

Running distributed_all_reduce in only CPU mode

I am running distributed Tensorflow with GRPC protocol on only CPUs. I enabled distributed_all_reduce type of variable update with 'all_reduce_spec = xring':

I am wondering, if this mode is supposed to work for CPU only distributed runs. If yes, then does it need a different controller process in addition to workers.

I am getting errors such as:
Unknown device: /job:worker/replica:0/task:2/device:CPU:0 all devices: CPU:0, /job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/device:CPU:0

How to run the benchmarks on Intel's Xeon Phi?

I have built tensorflow from scratch and have installed it on the Intel's Xeon Phi.
The Xeon Phi does not have graphics card, so how do I run the inceeptionv3 benchmark (tf_cnn_benchmarks.py) on it's CPU?
Can someone provide me the command with the appropriate parameters in order to get it running on the CPU?

Thank you and congrats ppwwyyxx

@ppwwyyxx

I saw your Stanford submission to the dawn benchmark, and want to say thank you for all the work you do on TensorPack. Opening a github issue seemed odd but an easy way to mention you in hopes you see it.

No module named 'data_generator'

When trying to run fresh copy of tensorflow/benchmarks for Keras I am getting the following error:

python3 run_benchmark.py
Using TensorFlow backend.
ModuleNotFoundError: No module named 'data_generator'

Google does not aware of such module as 'data_generator', well, at least I can't explain to it what this module is about.

What do I do wrong?

Global step is 2 * len(workers) * local_step in distributed training

In single worker training, the global step is the same as local step

2017-12-07 19:24:42,741 INFO 35 global step 94840
2017-12-07 19:24:44,858 INFO 35 step 94840 batch 128 train_time 0.197075128555 images/sec: 546.0 +/- 0.3 (jitter = 66.0) 3.512 0.422 0.648

But in distributed training, the global step is 2 * len(workers) * local_step, and I think it should be len(workers) * local_step?

4 ps 4 workers (--num_gpus=4 --summary_verbosity=1 --model=resnet50 --variable_update=distributed_replicated --num_batches=5005000 --batch_group_size=2 --local_parameter_device=gpu --batch_size=32)

2017-12-07 19:34:21,283 INFO 31 global step 381280
2017-12-07 19:34:21,942 INFO 31 47650 images/sec: 198.1 +/- 0.1 (jitter = 18.9) 2.359 0.617 0.781

And images/sec is much slower compared to single worker

ImportError: Cannot import name 'batching'

Hello All,

I've installed tensorflow for GPU and all the dependencies, and I'm trying to run the benchmark by simply cloning this repository and using the suggested command to run the Inception V3 Model:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=inception3 --variable_update=parameter_server

However I'm getting the following error:

from tensorflow.contrib.data.python.ops import batching
ImportError: Cannot import name 'batching'

I'm guessing it can't find the 'batching' script due to some path issues, but I'm not sure

I found the 'batching' script it's looking for here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/data/python/ops/batching.py

I can just place it in that directory, but I trust that this is not the correct approach, and more errors may follow.

Is there something I'm missing ? Is all I need is just clone the repo. then run the script or are there other steps I'm missing ? I'm also getting the same error on both python3.4 & python2.7 on my Ubuntu 14 machine.

Thanks in advance.

Python 3.x environment import error

I cloned this project and want to run it, and I install tensorflow using pip install tensorflow==1.4.0rc1,
but I will get import error in Python 3 env.

[root@hp tf_cnn_benchmarks]$ pwd
/tmp/benchmarks/scripts/tf_cnn_benchmarks
[root@hp tf_cnn_benchmarks]$ ~/miniconda3/bin/pip freeze | grep -w tensorflow
tensorflow==1.4.0rc1
tensorflow-tensorboard==0.4.0rc1
[root@hp tf_cnn_benchmarks]$ ~/miniconda3/bin/python tf_cnn_benchmarks.py 
Traceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 26, in <module>
    import benchmark_cnn
  File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 46, in <module>
    from models import model_config
  File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 19, in <module>
    import alexnet_model
ImportError: No module named 'alexnet_model'

But I test it in Python 2 env, it's ok, and it have result .

[root@hp tf_cnn_benchmarks]$ ~/miniconda2/bin/pip freeze | grep tensorflow
tensorflow==1.4.0rc1
tensorflow-tensorboard==0.4.0rc1
[root@hp tf_cnn_benchmarks]$ ~/miniconda2/bin/python tf_cnn_benchmarks.py
TensorFlow:  1.4
Model:       trivial
Mode:        training
... ...

Someone meet this similar problems ? Thanks !

put fc layers in vgg/alexnet on single gpu/worker?

although we have replicated strategy for single/multi machine training, it's maybe better to put fc layers on single device to avoid too much large weights transmission.

CNN Benchmark for Multi-GPU uses weak scaling.

tf_cnn_benchmarks.py uses weak scaling:

self.batch_size = self.model_conf.get_batch_size() * FLAGS.num_gpus

So that means you're solving a different problem wrt 1 GPU if you are using multiple GPUs. Is this intended?

key not found in checkpoint in distributed mode of tensorflow

when I run the cnn_benchmark function of tf_cnn_benchmark , everything looks fine and checkpoint file is successfully stored on train_dir .But when i run the eval function ,the exception occurs.

……
2017-07-26 15:30:52.072950: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_variance not found in checkpoint
2017-07-26 15:30:52.073198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/beta not found in checkpoint
2017-07-26 15:30:52.073278: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_mean not found in checkpoint
2017-07-26 15:30:52.073406: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_mean not found in checkpoint
2017-07-26 15:30:52.073536: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073577: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/beta not found in checkpoint
2017-07-26 15:30:52.073661: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_variance not found in checkpoint
2017-07-26 15:30:52.073738: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_variance not found in checkpoint
2017-07-26 15:30:52.073810: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073863: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_variance not found in checkpoint
2017-07-26 15:30:52.073957: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_mean not found in checkpoint
2017-07-26 15:30:52.074055: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/gamma not found in checkpoint
2017-07-26 15:30:52.074110: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/beta not found in checkpoint
2017-07-26 15:30:52.074348: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/beta not found in checkpoint
2017-07-26 15:30:52.074395: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_variance not found in checkpoint
2017-07-26 15:30:52.074757: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_mean not found in checkpoint
2017-07-26 15:30:52.074770: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/gamma not found in checkpoint
2017-07-26 15:30:52.074843: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_mean not found in checkpoint

raceback (most recent call last):
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 901, in _eval_cnn
    global_step = load_checkpoint(saver, sess, FLAGS.train_dir)
  File "tf_cnn_benchmarks.py", line 717, in load_checkpoint
    saver.restore(sess, model_checkpoint_path)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1457, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

Caused by op u'save/RestoreV2_369', defined at:
  File "tf_cnn_benchmarks.py", line 1348, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tf_cnn_benchmarks.py", line 1344, in main
    bench.run()
  File "tf_cnn_benchmarks.py", line 885, in run
    self._eval_cnn()
  File "tf_cnn_benchmarks.py", line 892, in _eval_cnn
    saver = tf.train.Saver(tf.global_variables())
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in __init__
    self.build()
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1086, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
	 [[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
	 [[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]

my train worker script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=worker --task_index=0 --num_gpus 4  --local_parameter_device cpu

parameter script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=ps --task_index=0 --num_gpus 0  --local_parameter_device cpu

eval script

python tf_cnn_benchmarks.py  --train_dir /home/sk/test/train_dir --variable_update replicated --model inception3 --batch_size 8 --num_gpus 4 --eval

ll ~/test/train_dir/

total 126348
-rw-rw-r-- 1       143 Jul 26 15:24 checkpoint
-rw-rw-r-- 1       23760967 Jul 26 15:23 graph.pbtxt
-rw-rw-r-- 1       95277612 Jul 26 15:24 model.ckpt-110.data-00000-of-00001
-rw-rw-r-- 1       9461 Jul 26 15:24 model.ckpt-110.index
-rw-rw-r-- 1       10317639 Jul 26 15:24 model.ckpt-110.meta

besides ,I used to run train method in stand-alone mode （ --variable_update replicated ），and the eval function worked well , so I don't know why it doesn't works in distributed_replicated mode. any one who can helps me ? thanks a lot ..

How can I start a benchmark with `distributed_all_reduce` ?

My Env:
TensorFlow: 1.3
CUDA: 8.0
cuDNN: 6.0

I notice an update for distributed_all_reduce so I want to have a try. But I'm not sure what value should controller_host takes...
My args are:

--variable_update=distributed_all_reduce
--all_reduce_spec=pscpu:32k:xring

and I start 3 processes with args:
FIRST:

--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=0

SECONDE:

--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=1

THIRD:

--job_name=controller
--controller_host=??
--task_index=0

When I put 127.0.0.1:50000 or 127.0.0.1:50001 on controller_host, I got:

TensorFlow:  1.3
Model:       resnet50
Mode:        training
SingleSess:  True
Batch size:  128 global
             64 per device
Devices:     ['job:worker/task0/gpu:0', 'job:worker/task1/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   pscpu:32k:xring
Sync:        True
==========
Generating model
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:486: __init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:487: range (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.range()`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:489: zip (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.zip()`.
2017-10-10 14:03:34.183287: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Traceback (most recent call last):
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 46, in <module>
    tf.app.run()
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 42, in main
    bench.run()
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run
    return self._benchmark_cnn()
  File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn
    start_standard_services=start_standard_services) as sess:
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
    config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1482, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 622, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.

Multi-node benchmark issue

Did anyone run the benchmark on multi-node (without gpu)?

The detailed information:
centOS 7.2
git checkout r1.3
bazel build --config=mkl --copt=-DEIGEN_USE_VML -s -c opt //tensorflow/tools/pip_package:build_pip_package

Used commands are as fellows:

python tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=alexnet --variable_update=distributed_replicated --job_name=worker --ps_hosts=192.192.1.1:50000 --worker_hosts=192.192.1.1:50001 --task_index=0

python tf_cnn_benchmarks.py --local_parameter_device=cpu --batch_size=32 --model=alexnet --variable_update=distributed_replicated --job_name=ps --ps_hosts=192.192.1.1:50000 --worker_hosts=192.192.1.1:50001 --task_index=0

-- PS message --
Running parameter server 0

-- Worker error message --
Running warm up
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "tf_cnn_benchmarks.py", line 232, in run
global_step_val, = self.sess.run([self.global_step_op])
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1051, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.

Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1341, in main
bench.run()
File "tf_cnn_benchmarks.py", line 884, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 1026, in _benchmark_cnn
self.trace_filename, fetch_summary)
File "tf_cnn_benchmarks.py", line 660, in benchmark_one_step
results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_in must be 4-dimensional
[[Node: v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad = _MklMaxPoolGrad[T=DT_FLOAT, _kernel="MklOp", data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], workspace_enabled=true, _device="/job:worker/replica:0/task:0/cpu:0"](v0/tower_0/conv4/Relu, v0/tower_0/mpool2/MaxPool, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape, v0/tower_0/mpool2/MaxPool:1, DMT/_57, DMT/_58, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape:1, v0/tower_0/mpool2/MaxPool:3)]]

Caused by op u'v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad', defined at:
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1341, in main
bench.run()
File "tf_cnn_benchmarks.py", line 884, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 924, in _benchmark_cnn
(enqueue_ops, fetches) = self._build_model()
File "tf_cnn_benchmarks.py", line 1095, in _build_model
gpu_grad_stage_ops)
File "tf_cnn_benchmarks.py", line 1262, in add_forward_pass_and_gradients
grads = tf.gradients(loss, params, aggregation_method=aggmeth)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 542, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 348, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 542, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 526, in _MaxPoolGrad
data_format=op.get_attr("data_format"))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1754, in _max_pool_grad
data_format=data_format, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

...which was originally created as op u'v0/tower_0/mpool2/MaxPool', defined at:
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
[elided 4 identical lines from previous traceback]
File "tf_cnn_benchmarks.py", line 1095, in _build_model
gpu_grad_stage_ops)
File "tf_cnn_benchmarks.py", line 1245, in add_forward_pass_and_gradients
self.model_conf.add_inference(network)
File "/home/tina/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
cnn.mpool(3, 3, 2, 2)
File "tf_cnn_benchmarks.py", line 372, in mpool
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/pooling.py", line 426, in max_pooling2d
return layer.apply(inputs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 503, in apply
return self.call(inputs, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 450, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/pooling.py", line 276, in call
data_format=utils.convert_data_format(self.data_format, 4))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1772, in max_pool
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1607, in _max_pool
data_format=data_format, name=name)

InvalidArgumentError (see above for traceback): tensor_in must be 4-dimensional
[[Node: v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad = _MklMaxPoolGrad[T=DT_FLOAT, _kernel="MklOp", data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], workspace_enabled=true, _device="/job:worker/replica:0/task:0/cpu:0"](v0/tower_0/conv4/Relu, v0/tower_0/mpool2/MaxPool, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape, v0/tower_0/mpool2/MaxPool:1, DMT/_57, DMT/_58, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape:1, v0/tower_0/mpool2/MaxPool:3)]]

Set argument zero_debias_moving_mean=True of tf.contrib.layers.batch_norm will get an error

If I set zero_debias_moving_mean=True (see here)，and when update the variable moving_mean , the argument zero_debias will be set to True to call assign_moving_average (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py), and here calls line 173 which use unbiased_var.op.name, but now unbiased_var is a instance of class variable_mgr.StagedModelVariable and do not has argument "op".

tf_cnn_benchmarks.py - UnparsedFlagAccessError

System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary and source (tested on multiple installations)
TensorFlow version (use command below): v1.4.0
Python version: Python 3.5.2 and 2.7.12
Bazel version (if compiling from source): 0.54
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: 9.0 & 8.0 / 6.0 & 7.0
GPU model and memory: gtx 1080 / gtx 1080ti
Exact command to reproduce: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 --variable_update=parameter_server

Describe the problem
When executing above command (after commenting out from tensorflow.contrib.data.python.ops import interleave_ops in preprocessing.py following #80) I am getting the following UnparsedFlagAccessError:

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 --variable_update=parameter_server
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 47, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 35, in main
params = benchmark_cnn.make_params_from_flags()
File "/opt/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 666, in make_params_from_flags
flag_values = {name: getattr(FLAGS, name) for name in _DEFAULT_PARAMS.keys()}
File "/opt/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 666, in
flag_values = {name: getattr(FLAGS, name) for name in _DEFAULT_PARAMS.keys()}
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --trace_file before flags were parsed.

This can be reproduced by everyone using the NGC Tensorflow container (nvcr.io/nvidia/tensorflow:17.12), cloning this rep into the container and running tf_cnn_benchmarks.py.

staged_vars=True slower than staged_vars=False

My Environment:
GPU: K80
OS: redhat7.2
Tensorflow: 1.2

When I run tf_cnn_benchmarks.py, I find it's slower when I enable staged_vars.
In theory, it should be faster when staged_vars is enabled because the main computation is not blocked by memcpy(HtoD) at the beginning of each step.

Here's my running:

python tf_cnn_benchmarks \
--batch_size=32 \
--model=resnet50 \
--data_name=imagenet \
--data_dir=/export1/ImageNet \
--learning_rate=0.1 \
--weight_decay=None \
--num_gpus=2 \
--local_parameter_device=cpu \
--variable_update=parameter_server \
--staged_vars=True \
--use_nccl=False

**Then I enable the trace_file to see the timeline and I find that one of the GPU (GPU:0 in my case) starts its first convolution layer at 40ms, which is 35ms late after GPU:1 starts its first convolution layer (at 5ms)

But when staged_vars is disabled, all GPUs start first convolution layer at the same time (near 10 ms)**

Here is the logs:

staged_vars = True

TensorFlow:  1.2
Model:       resnet50
Mode:        training
Batch size:  64 global
             32 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
Staged vars: True
==========
Generating model
2017-07-24 19:32:36.452301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:32:36.740267: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0xb982000 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-24 19:32:36.742268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:32:36.742961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 
2017-07-24 19:32:36.742976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y 
2017-07-24 19:32:36.742981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y 
2017-07-24 19:32:36.742998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0)
2017-07-24 19:32:36.743006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0)
2017-07-24 19:32:37.881169: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:32:37.881225: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:32:37.884137: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xcb3e400 executing computations on platform Host. Devices:
2017-07-24 19:32:37.884154: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): <undefined>, <undefined>
2017-07-24 19:32:37.885052: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:32:37.885070: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:32:37.888532: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xcb3ef40 executing computations on platform CUDA. Devices:
2017-07-24 19:32:37.888550: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-07-24 19:32:37.888556: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (1): Tesla K80, Compute Capability 3.7
Running warm up
Done warm up
Step	Img/sec	loss
Starting real work at step 10 at time Mon Jul 24 19:32:51 2017
1	images/sec: 86.8 +/- 0.0 (jitter = 0.0)	8.526
10	images/sec: 94.2 +/- 1.9 (jitter = 1.9)	8.512
20	images/sec: 94.4 +/- 1.3 (jitter = 1.9)	8.088
30	images/sec: 94.2 +/- 1.1 (jitter = 2.7)	8.015
40	images/sec: 92.0 +/- 1.0 (jitter = 5.6)	8.086
50	images/sec: 91.1 +/- 0.9 (jitter = 9.6)	7.599
60	images/sec: 90.9 +/- 0.8 (jitter = 9.9)	7.262
70	images/sec: 90.3 +/- 0.8 (jitter = 8.6)	7.153
80	images/sec: 89.7 +/- 0.7 (jitter = 7.7)	7.462
90	images/sec: 89.7 +/- 0.6 (jitter = 9.1)	7.907
Finishing real work at step 109 at time Mon Jul 24 19:34:02 2017
100	images/sec: 89.4 +/- 0.6 (jitter = 9.0)	7.357
----------------------------------------------------------------
total images/sec: 89.06
----------------------------------------------------------------

staged_vars - False

TensorFlow:  1.2
Model:       resnet50
Mode:        training
Batch size:  64 global
             32 per device
Devices:     ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
2017-07-24 19:35:11.740304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:35:12.024853: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0xcebe000 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-24 19:35:12.026114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties: 
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:35:12.026573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1 
2017-07-24 19:35:12.026588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y Y 
2017-07-24 19:35:12.026593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1:   Y Y 
2017-07-24 19:35:12.026618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0)
2017-07-24 19:35:12.026630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0)
2017-07-24 19:35:13.009850: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:35:13.010104: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:35:13.013000: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xb5dc400 executing computations on platform Host. Devices:
2017-07-24 19:35:13.013018: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): <undefined>, <undefined>
2017-07-24 19:35:13.013708: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:35:13.013725: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:35:13.016627: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xb5dc1c0 executing computations on platform CUDA. Devices:
2017-07-24 19:35:13.016642: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-07-24 19:35:13.016648: I tensorflow/compiler/xla/service/service.cc:206]   StreamExecutor device (1): Tesla K80, Compute Capability 3.7
Running warm up
Done warm up
Step	Img/sec	loss
Starting real work at step 10 at time Mon Jul 24 19:35:24 2017
1	images/sec: 99.4 +/- 0.0 (jitter = 0.0)	7.449
10	images/sec: 94.9 +/- 1.7 (jitter = 0.6)	7.422
20	images/sec: 95.7 +/- 1.1 (jitter = 1.0)	7.673
30	images/sec: 96.0 +/- 0.9 (jitter = 1.3)	7.421
40	images/sec: 96.1 +/- 0.8 (jitter = 1.5)	7.639
50	images/sec: 95.3 +/- 0.7 (jitter = 1.7)	7.910
60	images/sec: 95.5 +/- 0.7 (jitter = 1.7)	7.359
70	images/sec: 95.8 +/- 0.6 (jitter = 1.6)	7.666
80	images/sec: 95.6 +/- 0.5 (jitter = 1.8)	7.383
90	images/sec: 95.9 +/- 0.5 (jitter = 1.6)	7.437
Finishing real work at step 109 at time Mon Jul 24 19:36:31 2017
100	images/sec: 95.2 +/- 0.5 (jitter = 1.7)	7.441
----------------------------------------------------------------
total images/sec: 95.20
----------------------------------------------------------------

Attribute error: Assignment not allowed (no field "force_gpu_compatible" in protocol message object)

I would like to use the benchmark on CPU to realize the baseline performance. However, this error showed up when generating the model. This is the command I used:

python tf_cnn_benchmarks.py --device cpu --train_dir /path/to/my/dir --eval-dir /path/to/my/dir --model alexnet

and here is the traceback:

Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1333, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1329, in main
bench.run()
File "tf_cnn_benchmarks.py", line 879, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 969, in _benchmark_cnn
config=create_config_proto(),
File "tf_cnn_benchmarks.py", line 627, in create_config_proto
config.gpu_options.force_gpu_compatible = FLAGS.force_gpu_compatible

It seems like my TensorFlow configuration doesn't have this field. Would it be caused by incompatible TensorFlow version? I directly cloned both the TensorFlow and benchmarks from GitHub.

num_warmup_batches = 0 is ignored

In your code https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py#L734

self.num_warmup_batches = FLAGS.num_warmup_batches if (
        FLAGS.num_warmup_batches) else max(10, min_autotune_warmup)

if I set FLAGS.num_warmup_batches with 0 it goes on the else branch and it sets the num_warmup_batches to 10.

Is this intended behaviour? I think what you want to do is

self.num_warmup_batches = FLAGS.num_warmup_batches if (
        FLAGS.num_warmup_batches is not None) else max(10, min_autotune_warmup)

Feedback for important spots in the code & improving clarity

Thanks for implementing a high performance example! I wouldn't have been able to understand the suggested setup without it.

Based on the performance docs I think this code is to help teach people to write efficient training scripts. I personally ran into some difficulty finding where some of the important sections are, and those important sections are fairly hard coded specifically to bounding box classification.

For those reasons I think it might be broadly beneficial to:

Make the key sections clearer and separate the data distribution parts from the imagenet/bounding box specific parts.
Create a short high level overview list and one line explanation of each key location, including these examples, could be added to the doc and the readme that would be very helpful.

For instance, one of the most important files is preprocessing.py, particularly the parse_example_proto and minibatch functions.

To help make the code a bit cleaner I'd like to give an example of how a better separation of concerns could work. For instance, minibatch could be updated to something like the following where no bounding box specifics actually occur, and those steps could be passed to the function parameters. A second iteration beyond this example would be better (perhaps split into 3 functions so there are no function parameters?), and this code is untested:

    @staticmethod
    def tfrecord_minibatch(
                           tfrecord_path_glob_pattern,
                           gpu_device_count,
                           batch_size,
                           parse_example_proto_fn=None,
                           preprocessing_fn=None,
                           create_data_and_label_op_lists_fn=None,
                           random_seed=301,
                           parallelism=64,
                           buffer_size=10000):
        """TODO: High Performance Distributed Training Batches - Adapt for broader use cases
        """
        with tf.name_scope('batch_processing'):
            # Split the data among all the GPU devices
            if batch_size % gpu_device_count != 0:
                raise ValueError(
                    ('batch_size must be a multiple of gpu_device_count: '
                     'batch_size %d, gpu_device_count: %d') %
                    (batch_size, gpu_device_count))
                batch_size_per_device = batch_size // gpu_device_count
            images = [[] for i in range(gpu_device_count)]
            labels = [[] for i in range(gpu_device_count)]
            record_input = data_flow_ops.RecordInput(
                file_pattern=tfrecord_path_glob_pattern,
                seed=random_seed,
                parallelism=parallelism,
                buffer_size=buffer_size,
                batch_size=batch_size,
                name='record_input')
            records = record_input.get_yield_op()
            records = tf.split(records, batch_size, 0)
            records = [tf.reshape(record, []) for record in records]
            feature_op_dicts = []
            preprocessed_data_ops = []
            for device_i in xrange(batch_size):
                protobuf = records[i]
                feature_op_dict = parse_example_proto_fn(protobuf)
                preprocessed_data = None
                if preprocessing_fn is not None:
                    # thread_id should be distortion_method_id, and calculated later
                    preprocessed_data = preprocessing_fn(feature_op_dict, i)
                device_index = i % gpu_device_count
                feature_op_dict[device_index].append(image)
                labels[device_index].append(label_index)
            label_index_batch = [None] * gpu_device_count
            return create_data_and_label_op_lists_fn(feature_op_dicts, preprocessed_data_ops, gpu_device_count)

I'm not totally sure I got all the variables & code changes right, but stuff like subset from the original code could also be better named and I think tfrecord_path_glob_pattern is a bit better. A gpu might really be some other kind of tensor processing unit, so there might still be a better name but hopefully it is clear there are several of these inside a single computer/server.

Regardless, I appreciate your consideration, and thank you for putting this up, it is a valuable learning tool!

How to run the benchmark in the distributed mode?

Hi,

I followed the instructions from the [performance page]{https://www.tensorflow.org/performance/performance_models}, and run on two EC2 p2.8xlarge instances, using the same benchmark hash (Benchmark GitHub hash: 9165a70).

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

However, the worker failed with:

Generating model
save variable global_step:0
save variable ps_var/v0/conv0/conv2d/kernel:0
save variable ps_var/v0/conv0/biases:0
save variable ps_var/v0/conv1/conv2d/kernel:0
save variable ps_var/v0/conv1/biases:0
save variable ps_var/v0/conv2/conv2d/kernel:0
save variable ps_var/v0/conv2/biases:0
save variable ps_var/v0/conv3/conv2d/kernel:0
save variable ps_var/v0/conv3/biases:0
save variable ps_var/v0/conv4/conv2d/kernel:0
save variable ps_var/v0/conv4/biases:0
save variable ps_var/v0/affine0/weights:0
save variable ps_var/v0/affine0/biases:0
save variable ps_var/v0/affine1/weights:0
save variable ps_var/v0/affine1/biases:0
save variable ps_var/v0/affine2/weights:0
save variable ps_var/v0/affine2/biases:0
Traceback (most recent call last):
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 674, in _benchmark_cnn
    start_standard_services=start_standard_services) as sess:
  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
    init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

Caused by op u'v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform', defined at:
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
    tf.app.run()
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
    bench.run()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
    self._benchmark_cnn()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 620, in _benchmark_cnn
    (enqueue_ops, fetches) = self._build_model()
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 791, in _build_model
    gpu_grad_stage_ops)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 952, in add_forward_pass_and_gradients
    self.model.add_inference(network)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
    cnn.conv(256, 3, 3)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 103, in conv
    use_bias=False)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 551, in conv2d
    return layer.apply(inputs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 503, in apply
    return self.__call__(inputs, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 443, in __call__
    self.build(input_shapes[0])
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 137, in build
    dtype=self.dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 383, in add_variable
    trainable=trainable and self.trainable)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 84, in __call__
    return getter(name, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in __init__
    expected_shape=expected_shape)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 277, in _init_from_args
    initial_value(), name="initial_value", dtype=dtype)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 701, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 441, in __call__
    dtype, seed=self.seed)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 240, in random_uniform
    shape, dtype, seed=seed1, seed2=seed2)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 247, in _random_uniform
    seed=seed, seed2=seed2, name=name)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,384,256]
         [[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
         [[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]

It seems each TF process will allocate all of available GPU memory, so the worker cannot get any memory if I start the parameter server command first.

Likewise, if I run worker first, then the parameter server cannot get any memory.

Image summary is lost after ds_iterator.get_next()

Hi, thanks for the wonderful code!

I found the summaries added during image preprocessing are lost after calling ds_iterator.get_next().

I identify this issues by adding tmp = tf.get_collection(key=tf.GraphKeys.SUMMARIES) right after ds_iterator.get_next(). The returned value of tmp is an enmpty list. However, when adding the same line at the end of def preprocess(self, raw_image), the returned value of tmp is the summary proto for the processed images, which is normal.

I, therefore, guess that the summaries added before calling Dataset iterator will not be retained. Is this true? If so, do we have a way to circumvent this issue?

Many thanks.

Dead link to Dockerfile.alexnet_distributed_test in readme

https://github.com/tensorflow/benchmarks/blob/master/models/Dockerfile.alexnet_distributed_test returns 404.

Probably needed to remove link as part of this PR: e172b60

Set fused=False of batch norm layer get confusing result

I want to compare the speed with fused batch norm and non-fused batch norm, but when I set fused=False of batch_norm layer and data_format=NCHW， the speed of trianing is very very much slower (only about one-six) than the fused batch norm model. What's wrong with the non-fused model？

Distributed performance on better GPUs?

Thanks very much for publishing the code. With this benchmark I've seen very good GPU utilization with single-machine multi-GPU training, however I found that distributed training doesn't scale very well.

The published distributed benchmark performance were only on K80s, so the communication overhead might be less of a problem there. However TitanX/M40 is about twice faster than it, and P100 is about 4x faster, and V100 would be ..

In more details:
Tensorflow version: commit d101472296f88 compiled manually (with -march=native)
Python 2.7, cuda 8.0.44, cudnn 5.1
GPU: 4 Tesla M40s per machine
Latency between the two machines: 0.06~0.08ms given by ping
Bandwidth: 9.3Gbit/s given by iperf

Speed numbers (all with resnet50, batch 64 per GPU):
Single machine: (variable_update=parameter_server)
1GPU: 111 im/s -> 4GPU: 432 im/s
Two machines (variable_update=distributed_replicated):
2x4=8GPU: only 561 im/s

Hope to see some more improvements on it!

Instructions for replicating the benchmarks in the OSDI '16 paper

I am trying to replicate the results in the OSDI 2016 paper. Unfortunately, it's not clear from the documentation what set of arguments to pass to scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py and the source documentation does not specify it either.

Are the parameters and exact commands used in that paper available anywhere?

Wild performance difference in single node single GPU (non-)/distributed training

Distributed training:

# ps
CUDA_VISIBLE_DEVICES= \
python tf_cnn_benchmarks.py \
  --job_name=ps --ps_hosts=10.0.0.1:5000 \
  --worker_hosts=10.0.0.1:5001 --task_index=0

# worker
python tf_cnn_benchmarks.py \
  --job_name=worker --ps_hosts=10.0.0.1:5000 \
  --worker_hosts=10.0.0.1:5001 --task_index=0

Result:

TensorFlow:  1.1
Model:       trivial
Mode:        training
Batch size:  32 global
             32 per device
Devices:     ['/job:worker/task:0/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
Sync:        True
==========
Generating model
Running warm up
Done warm up
Step	Img/sec	loss
Starting real work at step 10 at time Sat May  6 03:59:06 2017
1	images/sec: 238.2 +/- 0.0 (jitter = 0.0)	7.089
10	images/sec: 233.7 +/- 2.4 (jitter = 7.9)	7.088
20	images/sec: 233.6 +/- 1.5 (jitter = 7.9)	7.086
30	images/sec: 232.0 +/- 1.2 (jitter = 8.5)	7.084
40	images/sec: 234.0 +/- 1.2 (jitter = 8.5)	7.082
50	images/sec: 234.0 +/- 1.2 (jitter = 9.8)	7.080
60	images/sec: 233.7 +/- 1.0 (jitter = 8.7)	7.079
70	images/sec: 234.1 +/- 1.0 (jitter = 8.6)	7.077
80	images/sec: 234.0 +/- 0.9 (jitter = 9.1)	7.075
90	images/sec: 234.1 +/- 0.9 (jitter = 9.2)	7.073
Finishing real work at step 109 at time Sat May  6 03:59:20 2017
100	images/sec: 234.0 +/- 0.8 (jitter = 8.6)	7.071
----------------------------------------------------------------
total images/sec: 233.41
----------------------------------------------------------------

Non-distributed training:

python tf_cnn_benchmarks.py

Result:

TensorFlow:  1.1
Model:       trivial
Mode:        training
Batch size:  32 global
             32 per device
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
Running warm up
Done warm up
Step	Img/sec	loss
1	images/sec: 3521.1 +/- 0.0 (jitter = 0.0)	7.089
10	images/sec: 3517.2 +/- 14.3 (jitter = 65.2)	7.087
20	images/sec: 3513.5 +/- 10.0 (jitter = 67.1)	7.085
Starting real work at step 31 at time Sat May  6 04:02:33 2017
30	images/sec: 3504.6 +/- 8.3 (jitter = 52.2)	7.084
40	images/sec: 3503.8 +/- 6.8 (jitter = 49.9)	7.082
50	images/sec: 3509.8 +/- 5.9 (jitter = 47.5)	7.080
60	images/sec: 3505.6 +/- 5.3 (jitter = 54.0)	7.078
70	images/sec: 3502.7 +/- 4.8 (jitter = 50.0)	7.076
80	images/sec: 3502.7 +/- 4.5 (jitter = 50.0)	7.074
90	images/sec: 3503.3 +/- 4.2 (jitter = 50.0)	7.072
100	images/sec: 3503.1 +/- 3.9 (jitter = 46.9)	7.070
Finishing real work at step 113 at time Sat May  6 04:02:34 2017
----------------------------------------------------------------
total images/sec: 3488.15
----------------------------------------------------------------

Nearly 15x (14.9x) performance difference is observed. Please do correct me if I did something terribly wrong; it's just the result is pretty unexpected to me.

NCCL-1.3.5 and tensorflow-0.13.0 can't import all_reduce from tensorflow

Hello
My OS configuration
CUDA8
CUDNN6
NCCL 1.3.5

When I trying to testing TF benchmark on my machine.I found that python can't import all_reduce from "tensorflow.contrib.all_reduce.python"

tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.6 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in
import benchmark_cnn
File "/home/gin/WORK/tensorflow/benchmarks-master/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in
import variable_mgr
File "/home/gin/WORK/tensorflow/benchmarks-master/scripts/tf_cnn_benchmarks/variable_mgr.py", line 29, in
from tensorflow.contrib.all_reduce.python import all_reduce
ImportError: No module named all_reduce.python

Did anyone happen this error before?

Benchmark hangs for non syntetic data

I tried to run

# VGG16 training ImageNet with 8 GPUs using arguments that optimize for
# Google Compute Engine.
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 \
--batch_size=32 --model=vgg16 --data_dir=/home/ubuntu/flowers \
--variable_update=parameter_server --nodistortions

And the data dir has the TF Records inside, generated with bazel as in the models/inception/data tutorial

-rw-rwx--- 1  40 May 11 11:43 labels.txt
drwxrwx--- 7 4096 May 12 11:45 train
-rw-rwx--- 1  102419300 May 11 11:43 train-00000-of-00002
-rw-rwx--- 1   99116804 May 11 11:43 train-00001-of-00002
drwxrwx--- 7  4096 May 12 11:45 validation
-rw-rwx--- 1  16058779 May 11 11:43 validation-00000-of-00002
-rw-rwx--- 1  15919237 May 11 11:43 validation-00001-of-00002

And it hangs like this:

TensorFlow:  1.1
Model:       vgg16
Mode:        training
Batch size:  32 global
             32.0 per device
Devices:     ['/gpu:0']
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating model
2017-05-12 11:57:30.357629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:900] Found device 0 with properties:
....
pciBusID 0002:01:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
2017-05-12 11:57:30.357680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:921] DMA: 0
2017-05-12 11:57:30.357690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:931] 0:   Y
2017-05-12 11:57:30.357707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0002:01:00.0)

But for syntatic data it works. Any idea how to fix this?

Running distributed in CPU mode

Run Tensorflow on K8S:
workerCmdArgs = "cd /opt/benchmarks/scripts/tf_cnn_benchmarks/;CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu --model=alexnet --variable_update=parameter_server"
psCmdArgs = "cd /opt/benchmarks/scripts/tf_cnn_benchmarks/;CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu --model=alexnet --variable_update=parameter_server"

Got error:
2017-11-01 11:41:49.570889: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC.
[[Node: v/tower_0/cg/mpool0/MaxPool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:worker/replica:0/task:0/device:CPU:0"]]

'/device/GPU:0' vs '/gpu:0'

Thanks for such wonderful example code. It helps me a lot to get familiar with TensorFlow.

However, I came across an interesting phenomenon when I visualizing my computation graph via TensorBoard. It's not a problem, but I just don't understand why it should be like this.

My computer has 1 CPU and 2 GPUs installed, but I get 5 devices in my computation graph, which could be easily observed through TensorBoard: (highlighted by red square)

I tried to modify this line

device_name = self.ps_devices[device_index]

device_name = '/device:' + self.ps_devices[device_index].upper()[1:]

to replace '/gpu:0' by '/device/GPU:0'; but the results are not changed. I'm so confused. Could you please explain a little bit on:

what' the difference between '/device/GPU:0' and '/gpu:0'?
or are they actually interchangeable with each other? If so, how can I replace '/gpu:0' by '/device/GPU:0'?

Thank you very much for your time and kind help!

The deployment-related settings are as below for reference:
--num_gpus=2
--local_parameter_device='gpu'
--device='gpu'
--winograd_nonfused=True
--sync_on_finish=False
--staged_vars=False
--force_gpu_compatible=True
--variable_update='parameter_server'
--use_nccl=True
--job_name=''
--ps_hosts=''
--task_index=0
--server_protocol='grpc'
--cross_replica_sync=True

Running tf_cnn_benchmarks.py

Hello,

I have copy the benchmarks folder under tensorflow directory.

(tensorflow) root@P50:/opt/DL/tensorflow# ls -all
total 28
drwxr-xr-x 6 root root 4096 oct 22 13:00 .
drwxr-xr-x 5 root root 4096 oct 22 16:53 ..
drwxr-xr-x 8 root root 4096 oct 22 13:00 benchmarks
drwxr-xr-x 2 root root 4096 oct 22 12:53 bin
drwxr-xr-x 2 root root 4096 oct 22 12:50 include
drwxr-xr-x 3 root root 4096 oct 22 12:50 lib
-rw-r--r-- 1 root root 60 oct 22 12:50 pip-selfcheck.json

When trying to run tf_cnn_benchmark I am getting this error:

(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts# python3 tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --batch_size=16 --model=inception3 --data_dir=/opt/DL/imagenet/datasets/ --variable_update=parameter_server --nodistortions
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in
import benchmark_cnn
File "/opt/DL/tensorflow/benchmarks/scripts/benchmark_cnn.py", line 41, in
import cnn_util
File "/opt/DL/tensorflow/benchmarks/scripts/cnn_util.py", line 40
print log
^
SyntaxError: Missing parentheses in call to 'print'
(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts#

Do I need to do something else before running the benchmark?

Thank you,
Florin