tensorflow / benchmarks Goto Github PK
View Code? Open in Web Editor NEWA benchmark framework for Tensorflow
License: Apache License 2.0
A benchmark framework for Tensorflow
License: Apache License 2.0
Hi, authors , the speed I achieved on AlexNet with Cifar10 dataset is only ~7000 images/sec using a TITAN X Pascal GPU. May I know what is the speed you have achieved, and is there any setting to achieve better performance? The command I used is:
python tf_cnn_benchmarks.py --learning_rate 0.01 --num_gpus 1 --model alexnet --batch_size 1024 --data_name cifar10 --num_batches 100 --data_dir ~/data/tensorflow/cifar-10-batches-py
The tested version of TensorFlow is 1.2.1.
Thanks!
At the time of writing, DenseNet is implemented in this benchmark with what is described as the "naive" implementation in Memory-Efficient Implementation of DenseNets. This will under perform compared to the ideal implementation because recursive tf.concat
will lead to excessive memory usage.
Details are in the upstream tf issuse tensorflow/tensorflow#12948.
Hi!
I have a quick question on running distributed training benchmark, especially on the meaning of different parameters:
the distributed_replicated parameter resembles "synchronous data parallel training" in the literature, meaning the computation graph is replicated on each worker, and after each machine performs a forward pass with a batch of images, their gradients are averaged and optimized on parameter servers - is this understanding correct? If not, what would be the correct --variable_update option for "synchronous training"? How is "distributed replicated" different from "parameter server"?
the --num_gpus means how many GPUs to use on a single machine, NOT the global GPU count - is that correct?
Should I assign each machine a new task index, if I want to run synchronous data parallel training across machines?
I understand some of the questions may be very basic, but it would be great if quick answers can be provided. Thanks.
Did anybody experience long (3-4 minutes) delay when starting a task with TensorFlow?
I'm trying to evaluate Titan V with the great tensorflow benchmarks repo (https://github.com/tensorflow/benchmarks/), but during initialization, after this line:
2017-12-22 10:39:03.469975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:983] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10744 MB memory) -> physical GPU (device: 0, name: Graphics Device, pci bus id: 0000:01:00.0, compute capability: 7.0)
and after
Running warm-up
, execution seems to stop for 3 minutes.
May be related: when using FP16, which is expected to work well, I get a lot of:
2017-12-22 10:34:58.755828: E tensorflow/core/grappler/optimizers/constant_folding.cc:1272] Unexpected type half
When I perform the same test on the same computer on 1080 Ti, there's no delay!
Used:
Titan V, TensorFlow master, TensorFlow benchmarks master, CUDA 9.0, Driver 387.34, Ubuntu 16.04.
TensorFlow's Compute capability is hopefully 6.1 & 7.0, but I don't know any way to check it. Also, I wanted to attach logs (nvidia-bug-report.log.gz), but I couldn't find a way.
Thanks!
I posted my question also here, as I am not sure if it's a benchmark or a V-issue:
https://devtalk.nvidia.com/default/topic/1027804/cuda-programming-and-performance/titan-v-tensorflow-performance/post/5228518/#5228518
time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=2 --num_intra_threads=7 --num_inter_threads=7
--batch_size=32 --model=vgg16 --variable_update=parameter_server
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
total images/sec: 146.38
real 0m59.740s
time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_intra_threads=7 --num_inter_threads=7
--batch_size=64 --model=vgg16 --variable_update=parameter_server
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
total images/sec: 82.04
real 1m37.090s
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=2 --num_intra_threads=7 --num_inter_threads=7
--batch_size=32 --model=vgg16 --variable_update=replicated --use_nccl=True
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: nccl
total images/sec: 114.40
real 1m13.169s
time python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --num_intra_threads=7 --num_inter_threads=7
--batch_size=64 --model=vgg16 --variable_update=replicated --use_nccl=True
TensorFlow: 1.4
Model: vgg16
Mode: training
SingleSess: False
Batch size: 64 global
64 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: replicated
AllReduce: nccl
total images/sec: 92.60
real 1m26.327s
This issue can be taken as a feature-request or a request related to documentation.
The high-performance benchmarking example is a good effort.
However the code is very fused (combining distributed and multi-gpu examples in the same setting !!). Moreover the code is not properly documented and there is little to no information available on different aspects related to StagingArea ops and how to use them.
It would be worthwhile if efforts can be made to improve the related documentation and improve the code clarity. We are currently working on a very high performance training code but are quite crippled by these debilitating drawbacks.
Hi, thanks a lot for sharing this awesome project.
I wonder if the code currently support the Caffe "iter_size" like hyperparameter? That is, accumulating gradients for "iter_size" number of batches and then apply the gradient. By using this hyperparameter, one can emulate the training with larger batch_size without distributed training. When the bathc_size is set to, let's say 64, and iter_size set to ITER_SIZE, then the effective batch_size will be 64*ITER_SIZE since all the gradients in ITER_SIZE batches are accumulated.
Is this doable in current code? Is there any plan for supporting this feature?
Thank you.
The benchmark crashes when I try to run the VGG models with the following stack trace.
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1333, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1329, in main
bench.run()
File "tf_cnn_benchmarks.py", line 879, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 919, in _benchmark_cnn
(enqueue_ops, fetches) = self._build_model()
File "tf_cnn_benchmarks.py", line 1088, in _build_model
gpu_grad_stage_ops)
File "tf_cnn_benchmarks.py", line 1238, in add_forward_pass_and_gradients
self.model_conf.add_inference(network)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/vgg_model.py", line 71, in add_inference
_construct_vgg(cnn, [2, 2, 3, 3, 3])
File "/root/benchmarks/scripts/tf_cnn_benchmarks/vgg_model.py", line 51, in _construct_vgg
cnn.dropout()
File "tf_cnn_benchmarks.py", line 543, in dropout
dropout = core_layers.dropout(input_layer, keep_prob_tensor)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/core.py", line 301, in dropout
layer = Dropout(rate, noise_shape=noise_shape, seed=seed, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/core.py", line 247, in __init__
self.rate = min(1., max(0., rate))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 578, in __nonzero__
raise TypeError("Using a `tf.Tensor` as a Python `bool` is not allowed. "
TypeError: Using a `tf.Tensor` as a Python `bool` is not allowed. Use `if t is not None:` instead of `if t:` to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
Command:
python tf_cnn_benchmarks.py --model vgg16 --num_gpus 1 --batch_size 32
In Docker container.
CPU-only.
Docker image: tensorflow/tensorflow:latest (65e150502892)
Updated TensorFlow with pip install -U tf-nightly
to fix issue #80 .
In container cloned the benchmarks.
Start benchmarks with # python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50
Output:
TensorFlow: 1.5
Model: resnet50
Mode: training
SingleSess: False
Batch size: 32 global
32 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating model
2017-11-06 03:49:44.378230: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Running warm up
2017-11-06 03:49:46.244696: E tensorflow/core/common_runtime/executor.cc:651] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC.
[[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 54, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 50, in main
bench.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 916, in run
return self._benchmark_cnn()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1155, in _benchmark_cnn
fetch_summary)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 530, in benchmark_one_step
results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Default MaxPoolingOp only supports NHWC.
[[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]
Caused by op u'v/tower_0/cg/mpool0/MaxPool', defined at:
File "tf_cnn_benchmarks.py", line 54, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 50, in main
bench.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 916, in run
return self._benchmark_cnn()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1010, in _benchmark_cnn
(image_producer_ops, enqueue_ops, fetches) = self._build_model()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1260, in _build_model
gpu_compute_stage_ops, gpu_grad_stage_ops)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1566, in add_forward_pass_and_gradients
self.model.add_inference(network)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/models/resnet_model.py", line 210, in add_inference
cnn.mpool(3, 3, 2, 2)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 273, in mpool
d_height, d_width, mode, input_layer, num_channels_in)
File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 250, in _pool
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/pooling.py", line 429, in max_pooling2d
return layer.apply(inputs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 728, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 618, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/layers/pooling.py", line 273, in call
data_format=utils.convert_data_format(self.data_format, 4))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_ops.py", line 1958, in max_pool
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 2806, in _max_pool
data_format=data_format, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3073, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1524, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Default MaxPoolingOp only supports NHWC.
[[Node: v/tower_0/cg/mpool0/MaxPool = MaxPool[T=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:localhost/replica:0/task:0/device:CPU:0"](v/tower_0/cg/conv0/Relu)]]
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 417, in run
global_step_val, = self.sess.run([self.global_step_op])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1047, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
I have met stuck problem when running tf_cnn_benchmarks.py
in distributed mode, I think global_step
should be protected by a lock in this line.
The existing implementation use a L2 regularization on all model variables (including batch norm variables and biases). It's quite different from TF slim models which usually regularizes only conv2d weights, and it hurts quite a lot the training to include all variables.
Maybe we should implement a regularization loss using the common TF api. Something like:
weight_decay = self.params.weight_decay
if rel_device_num == 0 and weight_decay:
# Regularization losses in the present name scope.
nm_sc = tf.contrib.framework.get_name_scope()
reg_losses = tf.losses.get_regularization_losses(nm_sc)
# TODO: fp16 convertion???
reg_loss = tf.add_n(reg_losses, name='total_regularization_loss')
More generally, I think it would add a lot of value if this benchmark repo could actually reproduce SOTA training on ImageNet for common architectures.
Hi @tfboyd, I saw the benchmark has --use_fp16
flag now. So does the benchmark and the latest TensorFlow support FP16 now? Can we do the test on Volta GPUs? Thanks.
when try
python render_template.py template.yaml.jinja | kubectl create -f -
in alexnet, it fails
error validating "STDIN": error validating data: unexpected type: object; if you choose to ignore these errors, turn validation off with --validate=false
if --validate=false added, output is
NAME READY STATUS RESTARTS AGE alexnet-ps-0-c73m6 0/1 ContainerCreating 0 7s alexnet-ps-1-pm9s6 0/1 ContainerCreating 0 7s alexnet-worker-0-pzbbq 0/1 ContainerCreating 0 7s
Notice codes HERE
biased = tf.reshape(
tf.nn.bias_add(
conv, biases, data_format=self.data_format),
conv.get_shape())
I think the output shape of bias_add is exactly the same as conv.get_shape().
So why bother to reshape?
Then I add some slim analyzer codes to print the total_ops
total_ops = slim.model_analyzer.analyze_ops(sess.graph)
I notice that total_ops
is different with reshape or not (Run Training)
With reshape:
TOTAL_OPS = 4856506501
Without reshape:
TOTAL_OPS = 3989465117
Of course the performance is also different.
I run the script followed this:
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=128 --model=resnet50 --variable_update=replicated --nodistortions --nccl True --trace_file ~/timeline.json
But there is no improvement at all. The speed is equal to the one on batch 64:
Step Img/sec loss
1 images/sec: 725.2 +/- 0.0 (jitter = 0.0) 7.463
10 images/sec: 736.7 +/- 1.4 (jitter = 2.9) 7.180
20 images/sec: 731.7 +/- 2.4 (jitter = 6.7) 7.048
30 images/sec: 723.7 +/- 2.6 (jitter = 19.3) 6.971
40 images/sec: 719.0 +/- 2.4 (jitter = 15.9) 6.929
50 images/sec: 716.0 +/- 2.1 (jitter = 12.2) 6.898
What's the reason? And why is the speed slower and slower when the step is bigger ?
Is there any method to let the gpu be at full power?
Besides, when I run the script on one single GPU:
python tf_cnn_benchmarks.py local_parameter_device=cpu --num_gpus= 1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --optimizer=sgd
I got the timeline timeline_benchmark_origin.json.txt when the script had no change and timeline_benchmark_changed.json.txt when this line was replaced by with tf.control_dependencies([]):
.
The performance is significantly improved when the operation global_step.assign_add
has no dependencies. But the improvement is only useful on the special step and only useful on one GPU:
Starting real work at step 10 at time Wed Jun 28 09:49:39 2017
Done warm up
Step Img/sec loss
1 images/sec: 544.0 +/- 0.0 (jitter = 0.0) 6.776
10 images/sec: 207.7 +/- 33.2 (jitter = 2.1) 5.985
20 images/sec: 201.4 +/- 17.0 (jitter = 1.8) 5.563
30 images/sec: 199.3 +/- 11.4 (jitter = 1.4) 5.343
40 images/sec: 198.2 +/- 8.6 (jitter = 1.0) 5.255
50 images/sec: 197.5 +/- 6.9 (jitter = 1.0) 5.196
60 images/sec: 197.0 +/- 5.8 (jitter = 1.0) 5.167
70 images/sec: 196.6 +/- 5.0 (jitter = 0.9) 5.145
80 images/sec: 196.3 +/- 4.3 (jitter = 0.9) 5.125
90 images/sec: 196.0 +/- 3.9 (jitter = 1.0) 5.111
Finishing real work at step 109 at time Wed Jun 28 09:50:11 2017
----------------------------------------------------------------
total images/sec: 192.95
----------------------------------------------------------------
What's the reason ?
I think the log of distributed training always makes user confused, for instance, total images/sec
. I have read and checked the implementation of benchmark_cnn.py
, and the evaluation of images_per_sec
is as below:
images_per_sec = (num_workers*batch_size) / average_wall_time
average_wall_time = elapsed_time / num_steps
num_steps = global_step_watcher.num_steps()
elapsed_time = global_step_watcher.elapsed_time()
It's obvious that the num_steps
is the sum of all workers, but elapsed_time
is just the time cost of each worker and num_workers == 1
.
I attached my command and results in the following. I added some code to print the extra outputs such as num_worker
, num_steps
and average_wall_time
.
python tf_cnn_benchmarks.py \
--device=cpu --mkl=True --num_inter_threads=1 \
--num_intra_threads=16 --data_format=NHWC \
--forward_only=True --kmp_blocktime=0 \
--batch_size=32 --model=inception3 \
--worker_hosts=... --ps_hosts=... \
--job_name=... --task_index=...
1 ps 1 worker, image_per_sec = 22.30
TensorFlow: 1.4
Model: inception3
Mode: forward-only
SingleSess: False
Batch size: 32 global
32 per device
Devices: ['/job:worker/task:0/cpu:0']
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
Sync: True
==========
Generating model
Running warm up
Done warm up
Step Img/sec loss top_1_accuracy top_5_accuracy
1 images/sec: 23.5 +/- 0.0 (jitter = 0.0) 0.000 0.000 0.000
10 images/sec: 22.7 +/- 0.2 (jitter = 0.6) 0.000 0.000 0.000
20 images/sec: 22.7 +/- 0.1 (jitter = 0.7) 0.000 0.000 0.000
30 images/sec: 22.6 +/- 0.1 (jitter = 0.6) 0.000 0.000 0.000
40 images/sec: 22.7 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
50 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.031
60 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
70 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
80 images/sec: 22.4 +/- 0.1 (jitter = 0.6) 0.000 0.000 0.000
90 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
100 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
----------------------------------------------------------------
num_workers: 1
num_steps: 99
average_wall_time: 1.43500526264
total images/sec: 22.30
----------------------------------------------------------------
1 ps 2 worker, image_per_sec = 90.16?
TensorFlow: 1.4
Model: inception3
Mode: forward-only
SingleSess: False
Batch size: 32 global
32 per device
Devices: ['/job:worker/task:1/cpu:0']
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
Sync: True
==========
Generating model
Running warm up
Done warm up
Step Img/sec loss top_1_accuracy top_5_accuracy
1 images/sec: 23.0 +/- 0.0 (jitter = 0.0) 0.000 0.000 0.000
10 images/sec: 22.4 +/- 0.2 (jitter = 0.6) 0.000 0.000 0.000
20 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
30 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
40 images/sec: 22.6 +/- 0.1 (jitter = 0.4) 0.000 0.000 0.000
50 images/sec: 22.6 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.031
60 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
70 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
80 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
90 images/sec: 22.4 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
100 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
----------------------------------------------------------------
num_workers: 1
num_steps: 199
average_wall_time: 0.709418054801
total images/sec: 45.11
----------------------------------------------------------------
TensorFlow: 1.4
Model: inception3
Mode: forward-only
SingleSess: False
Batch size: 32 global
32 per device
Devices: ['/job:worker/task:0/cpu:0']
Data format: NHWC
Optimizer: sgd
Variables: parameter_server
Sync: True
==========
Generating model
Running warm up
Done warm up
Step Img/sec loss top_1_accuracy top_5_accuracy
1 images/sec: 23.1 +/- 0.0 (jitter = 0.0) 0.000 0.000 0.000
10 images/sec: 22.4 +/- 0.2 (jitter = 0.6) 0.000 0.000 0.000
20 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
30 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
40 images/sec: 22.6 +/- 0.1 (jitter = 0.4) 0.000 0.000 0.000
50 images/sec: 22.6 +/- 0.1 (jitter = 0.4) 0.000 0.000 0.031
60 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
70 images/sec: 22.5 +/- 0.1 (jitter = 0.4) 0.000 0.000 0.000
80 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
90 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
100 images/sec: 22.5 +/- 0.1 (jitter = 0.5) 0.000 0.000 0.000
----------------------------------------------------------------
num_workers: 1
num_steps: 199
average_wall_time: 0.710339195165
total images/sec: 45.05
----------------------------------------------------------------
Is the calculation of image_per_sec
for 1 ps 2 worker
correct?
After I pull and merge the latest commit, I got the ImportError
. I attached the error log as below:
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in <module>
import benchmark_cnn
File ".../benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 43, in <module>
import datasets
File ".../benchmarks/scripts/tf_cnn_benchmarks/datasets.py", line 28, in <module>
import preprocessing
File ".../benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 23, in <module>
from tensorflow.contrib.data.python.ops import interleave_ops
ImportError: cannot import name interleave_ops
I reviewed the commit and found the latest merged code causing the error. You can find my comment here, and the code here
When I use --distributed_replicated to train, I confuse how do the PS ensures that it have got all the gradient to average and start next step. I trace the code, and find the server creating the queue_ops operations and entering the sync_queues to the queue_ops after updating the gradient to the variable. But I can't find how the server ensures the synchronization with multi workers. It is strange to set the element in the sync_queues as false but never check and function it.
the benchmark of google tell the speedup is 59,but we test only 35,How can we improve the speedup?
I am running distributed Tensorflow with GRPC protocol on only CPUs. I enabled distributed_all_reduce type of variable update with 'all_reduce_spec = xring':
I am wondering, if this mode is supposed to work for CPU only distributed runs. If yes, then does it need a different controller process in addition to workers.
I am getting errors such as:
Unknown device: /job:worker/replica:0/task:2/device:CPU:0 all devices: CPU:0, /job:worker/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/device:CPU:0
I have built tensorflow from scratch and have installed it on the Intel's Xeon Phi.
The Xeon Phi does not have graphics card, so how do I run the inceeptionv3 benchmark (tf_cnn_benchmarks.py) on it's CPU?
Can someone provide me the command with the appropriate parameters in order to get it running on the CPU?
I saw your Stanford submission to the dawn benchmark, and want to say thank you for all the work you do on TensorPack. Opening a github issue seemed odd but an easy way to mention you in hopes you see it.
When trying to run fresh copy of tensorflow/benchmarks for Keras I am getting the following error:
python3 run_benchmark.py
Using TensorFlow backend.
ModuleNotFoundError: No module named 'data_generator'
Google does not aware of such module as 'data_generator', well, at least I can't explain to it what this module is about.
What do I do wrong?
In single worker training, the global step is the same as local step
2017-12-07 19:24:42,741 INFO 35 global step 94840
2017-12-07 19:24:44,858 INFO 35 step 94840 batch 128 train_time 0.197075128555 images/sec: 546.0 +/- 0.3 (jitter = 66.0) 3.512 0.422 0.648
But in distributed training, the global step is 2 * len(workers) * local_step, and I think it should be len(workers) * local_step?
4 ps 4 workers (--num_gpus=4 --summary_verbosity=1 --model=resnet50 --variable_update=distributed_replicated --num_batches=5005000 --batch_group_size=2 --local_parameter_device=gpu --batch_size=32)
2017-12-07 19:34:21,283 INFO 31 global step 381280
2017-12-07 19:34:21,942 INFO 31 47650 images/sec: 198.1 +/- 0.1 (jitter = 18.9) 2.359 0.617 0.781
And images/sec is much slower compared to single worker
Hello All,
I've installed tensorflow for GPU and all the dependencies, and I'm trying to run the benchmark by simply cloning this repository and using the suggested command to run the Inception V3 Model:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=inception3 --variable_update=parameter_server
However I'm getting the following error:
from tensorflow.contrib.data.python.ops import batching
ImportError: Cannot import name 'batching'
I'm guessing it can't find the 'batching' script due to some path issues, but I'm not sure
I found the 'batching' script it's looking for here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/data/python/ops/batching.py
I can just place it in that directory, but I trust that this is not the correct approach, and more errors may follow.
Is there something I'm missing ? Is all I need is just clone the repo. then run the script or are there other steps I'm missing ? I'm also getting the same error on both python3.4 & python2.7 on my Ubuntu 14 machine.
Thanks in advance.
I cloned this project and want to run it, and I install tensorflow using pip install tensorflow==1.4.0rc1
,
but I will get import error in Python 3 env.
[root@hp tf_cnn_benchmarks]$ pwd
/tmp/benchmarks/scripts/tf_cnn_benchmarks
[root@hp tf_cnn_benchmarks]$ ~/miniconda3/bin/pip freeze | grep -w tensorflow
tensorflow==1.4.0rc1
tensorflow-tensorboard==0.4.0rc1
[root@hp tf_cnn_benchmarks]$ ~/miniconda3/bin/python tf_cnn_benchmarks.py
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in <module>
import benchmark_cnn
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 46, in <module>
from models import model_config
File "/tmp/benchmarks/scripts/tf_cnn_benchmarks/models/model_config.py", line 19, in <module>
import alexnet_model
ImportError: No module named 'alexnet_model'
But I test it in Python 2 env, it's ok, and it have result .
[root@hp tf_cnn_benchmarks]$ ~/miniconda2/bin/pip freeze | grep tensorflow
tensorflow==1.4.0rc1
tensorflow-tensorboard==0.4.0rc1
[root@hp tf_cnn_benchmarks]$ ~/miniconda2/bin/python tf_cnn_benchmarks.py
TensorFlow: 1.4
Model: trivial
Mode: training
... ...
Someone meet this similar problems ? Thanks !
although we have replicated
strategy for single/multi machine training, it's maybe better to put fc layers on single device to avoid too much large weights transmission.
tf_cnn_benchmarks.py
uses weak scaling:
self.batch_size = self.model_conf.get_batch_size() * FLAGS.num_gpus
So that means you're solving a different problem wrt 1 GPU if you are using multiple GPUs. Is this intended?
when I run the cnn_benchmark function of tf_cnn_benchmark , everything looks fine and checkpoint file is successfully stored on train_dir .But when i run the eval function ,the exception occurs.
……
2017-07-26 15:30:52.072950: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_variance not found in checkpoint
2017-07-26 15:30:52.073198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/beta not found in checkpoint
2017-07-26 15:30:52.073278: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_mean not found in checkpoint
2017-07-26 15:30:52.073406: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_mean not found in checkpoint
2017-07-26 15:30:52.073536: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073577: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/beta not found in checkpoint
2017-07-26 15:30:52.073661: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/moving_variance not found in checkpoint
2017-07-26 15:30:52.073738: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv85/batchnorm85/moving_variance not found in checkpoint
2017-07-26 15:30:52.073810: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/conv2d/kernel not found in checkpoint
2017-07-26 15:30:52.073863: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_variance not found in checkpoint
2017-07-26 15:30:52.073957: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/moving_mean not found in checkpoint
2017-07-26 15:30:52.074055: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/gamma not found in checkpoint
2017-07-26 15:30:52.074110: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv84/batchnorm84/beta not found in checkpoint
2017-07-26 15:30:52.074348: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv87/batchnorm87/beta not found in checkpoint
2017-07-26 15:30:52.074395: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_variance not found in checkpoint
2017-07-26 15:30:52.074757: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/moving_mean not found in checkpoint
2017-07-26 15:30:52.074770: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv88/batchnorm88/gamma not found in checkpoint
2017-07-26 15:30:52.074843: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key v3/incept_v3_e0/conv86/batchnorm86/moving_mean not found in checkpoint
raceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1348, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1344, in main
bench.run()
File "tf_cnn_benchmarks.py", line 885, in run
self._eval_cnn()
File "tf_cnn_benchmarks.py", line 901, in _eval_cnn
global_step = load_checkpoint(saver, sess, FLAGS.train_dir)
File "tf_cnn_benchmarks.py", line 717, in load_checkpoint
saver.restore(sess, model_checkpoint_path)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1457, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
[[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
[[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
Caused by op u'save/RestoreV2_369', defined at:
File "tf_cnn_benchmarks.py", line 1348, in <module>
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1344, in main
bench.run()
File "tf_cnn_benchmarks.py", line 885, in run
self._eval_cnn()
File "tf_cnn_benchmarks.py", line 892, in _eval_cnn
saver = tf.train.Saver(tf.global_variables())
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in __init__
self.build()
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
dtypes=dtypes, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key v0/incept_v3_d0/conv73/batchnorm73/gamma not found in checkpoint
[[Node: save/RestoreV2_369 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_369/tensor_names, save/RestoreV2_369/shape_and_slices)]]
[[Node: save/RestoreV2_809/_2199 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:1", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_15876_save/RestoreV2_809", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:1"]()]]
my train worker script
python tf_cnn_benchmarks.py --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=worker --task_index=0 --num_gpus 4 --local_parameter_device cpu
parameter script
python tf_cnn_benchmarks.py --train_dir /home/sk/test/train_dir --variable_update distributed_replicated --model inception3 --batch_size 8 --ps_hosts=127.0.0.1:13555 --worker_hosts=127.0.0.1:13600 --job_name=ps --task_index=0 --num_gpus 0 --local_parameter_device cpu
eval script
python tf_cnn_benchmarks.py --train_dir /home/sk/test/train_dir --variable_update replicated --model inception3 --batch_size 8 --num_gpus 4 --eval
ll ~/test/train_dir/
total 126348
-rw-rw-r-- 1 143 Jul 26 15:24 checkpoint
-rw-rw-r-- 1 23760967 Jul 26 15:23 graph.pbtxt
-rw-rw-r-- 1 95277612 Jul 26 15:24 model.ckpt-110.data-00000-of-00001
-rw-rw-r-- 1 9461 Jul 26 15:24 model.ckpt-110.index
-rw-rw-r-- 1 10317639 Jul 26 15:24 model.ckpt-110.meta
besides ,I used to run train method in stand-alone mode ( --variable_update replicated ),and the eval function worked well , so I don't know why it doesn't works in distributed_replicated mode. any one who can helps me ? thanks a lot ..
My Env:
TensorFlow: 1.3
CUDA: 8.0
cuDNN: 6.0
I notice an update for distributed_all_reduce so I want to have a try. But I'm not sure what value should controller_host takes...
My args are:
--variable_update=distributed_all_reduce
--all_reduce_spec=pscpu:32k:xring
and I start 3 processes with args:
FIRST:
--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=0
SECONDE:
--job_name=worker
--worker_hosts=127.0.0.1:50001,127.0.0.1:50002
--task_index=1
THIRD:
--job_name=controller
--controller_host=??
--task_index=0
When I put 127.0.0.1:50000 or 127.0.0.1:50001 on controller_host, I got:
TensorFlow: 1.3
Model: resnet50
Mode: training
SingleSess: True
Batch size: 128 global
64 per device
Devices: ['job:worker/task0/gpu:0', 'job:worker/task1/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: pscpu:32k:xring
Sync: True
==========
Generating model
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:486: __init__ (from tensorflow.contrib.data.python.ops.readers) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:487: range (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.range()`.
WARNING:tensorflow:From /home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py:489: zip (from tensorflow.contrib.data.python.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.zip()`.
2017-10-10 14:03:34.183287: E tensorflow/core/common_runtime/session.cc:69] Not found: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Traceback (most recent call last):
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 46, in <module>
tf.app.run()
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 42, in main
bench.run()
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 892, in run
return self._benchmark_cnn()
File "/home/zzy/workspace/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1068, in _benchmark_cnn
start_standard_services=start_standard_services) as sess:
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 273, in prepare_session
config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 178, in _restore_checkpoint
sess = session.Session(self._target, graph=self._graph, config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1482, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 622, in __init__
self._session = tf_session.TF_NewDeprecatedSession(opts, status)
File "/home/zzy/anaconda2/envs/tf-1.3/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: No session factory registered for the given session options: {target: "127.0.0.1:50001" config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
Did anyone run the benchmark on multi-node (without gpu)?
The detailed information:
centOS 7.2
git checkout r1.3
bazel build --config=mkl --copt=-DEIGEN_USE_VML -s -c opt //tensorflow/tools/pip_package:build_pip_package
Used commands are as fellows:
-- PS message --
Running parameter server 0
-- Worker error message --
Running warm up
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
self.run()
File "tf_cnn_benchmarks.py", line 232, in run
global_step_val, = self.sess.run([self.global_step_op])
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1051, in _run
raise RuntimeError('Attempted to use a closed Session.')
RuntimeError: Attempted to use a closed Session.
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1341, in main
bench.run()
File "tf_cnn_benchmarks.py", line 884, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 1026, in _benchmark_cnn
self.trace_filename, fetch_summary)
File "tf_cnn_benchmarks.py", line 660, in benchmark_one_step
results = sess.run(fetches, options=run_options, run_metadata=run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_in must be 4-dimensional
[[Node: v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad = _MklMaxPoolGrad[T=DT_FLOAT, _kernel="MklOp", data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], workspace_enabled=true, _device="/job:worker/replica:0/task:0/cpu:0"](v0/tower_0/conv4/Relu, v0/tower_0/mpool2/MaxPool, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape, v0/tower_0/mpool2/MaxPool:1, DMT/_57, DMT/_58, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape:1, v0/tower_0/mpool2/MaxPool:3)]]
Caused by op u'v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad', defined at:
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1341, in main
bench.run()
File "tf_cnn_benchmarks.py", line 884, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 924, in _benchmark_cnn
(enqueue_ops, fetches) = self._build_model()
File "tf_cnn_benchmarks.py", line 1095, in _build_model
gpu_grad_stage_ops)
File "tf_cnn_benchmarks.py", line 1262, in add_forward_pass_and_gradients
grads = tf.gradients(loss, params, aggregation_method=aggmeth)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 542, in gradients
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 348, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 542, in
grad_scope, op, func_call, lambda: grad_fn(op, *out_grads))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/nn_grad.py", line 526, in _MaxPoolGrad
data_format=op.get_attr("data_format"))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1754, in _max_pool_grad
data_format=data_format, name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2628, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
...which was originally created as op u'v0/tower_0/mpool2/MaxPool', defined at:
File "tf_cnn_benchmarks.py", line 1345, in
tf.app.run()
[elided 4 identical lines from previous traceback]
File "tf_cnn_benchmarks.py", line 1095, in _build_model
gpu_grad_stage_ops)
File "tf_cnn_benchmarks.py", line 1245, in add_forward_pass_and_gradients
self.model_conf.add_inference(network)
File "/home/tina/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
cnn.mpool(3, 3, 2, 2)
File "tf_cnn_benchmarks.py", line 372, in mpool
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/pooling.py", line 426, in max_pooling2d
return layer.apply(inputs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 503, in apply
return self.call(inputs, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 450, in call
outputs = self.call(inputs, *args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/layers/pooling.py", line 276, in call
data_format=utils.convert_data_format(self.data_format, 4))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/nn_ops.py", line 1772, in max_pool
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1607, in _max_pool
data_format=data_format, name=name)
InvalidArgumentError (see above for traceback): tensor_in must be 4-dimensional
[[Node: v0/tower_0/gradients/v0/tower_0/mpool2/MaxPool_grad/MaxPoolGrad = _MklMaxPoolGrad[T=DT_FLOAT, _kernel="MklOp", data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], workspace_enabled=true, _device="/job:worker/replica:0/task:0/cpu:0"](v0/tower_0/conv4/Relu, v0/tower_0/mpool2/MaxPool, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape, v0/tower_0/mpool2/MaxPool:1, DMT/_57, DMT/_58, v0/tower_0/gradients/v0/tower_0/Reshape_grad/Reshape:1, v0/tower_0/mpool2/MaxPool:3)]]
If I set zero_debias_moving_mean=True (see here),and when update the variable moving_mean , the argument zero_debias will be set to True to call assign_moving_average (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/moving_averages.py), and here calls line 173 which use unbiased_var.op.name, but now unbiased_var is a instance of class variable_mgr.StagedModelVariable and do not has argument "op".
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary and source (tested on multiple installations)
TensorFlow version (use command below): v1.4.0
Python version: Python 3.5.2 and 2.7.12
Bazel version (if compiling from source): 0.54
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: 9.0 & 8.0 / 6.0 & 7.0
GPU model and memory: gtx 1080 / gtx 1080ti
Exact command to reproduce: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 --variable_update=parameter_server
Describe the problem
When executing above command (after commenting out from tensorflow.contrib.data.python.ops import interleave_ops
in preprocessing.py following #80) I am getting the following UnparsedFlagAccessError:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8 --model=resnet50 --variable_update=parameter_server
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 47, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 35, in main
params = benchmark_cnn.make_params_from_flags()
File "/opt/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 666, in make_params_from_flags
flag_values = {name: getattr(FLAGS, name) for name in _DEFAULT_PARAMS.keys()}
File "/opt/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 666, in
flag_values = {name: getattr(FLAGS, name) for name in _DEFAULT_PARAMS.keys()}
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --trace_file before flags were parsed.
This can be reproduced by everyone using the NGC Tensorflow container (nvcr.io/nvidia/tensorflow:17.12), cloning this rep into the container and running tf_cnn_benchmarks.py.
My Environment:
GPU: K80
OS: redhat7.2
Tensorflow: 1.2
When I run tf_cnn_benchmarks.py, I find it's slower when I enable staged_vars.
In theory, it should be faster when staged_vars is enabled because the main computation is not blocked by memcpy(HtoD) at the beginning of each step.
Here's my running:
python tf_cnn_benchmarks \
--batch_size=32 \
--model=resnet50 \
--data_name=imagenet \
--data_dir=/export1/ImageNet \
--learning_rate=0.1 \
--weight_decay=None \
--num_gpus=2 \
--local_parameter_device=cpu \
--variable_update=parameter_server \
--staged_vars=True \
--use_nccl=False
**Then I enable the trace_file to see the timeline and I find that one of the GPU (GPU:0 in my case) starts its first convolution layer at 40ms, which is 35ms late after GPU:1 starts its first convolution layer (at 5ms)
But when staged_vars is disabled, all GPUs start first convolution layer at the same time (near 10 ms)**
Here is the logs:
staged_vars = True
TensorFlow: 1.2
Model: resnet50
Mode: training
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
Staged vars: True
==========
Generating model
2017-07-24 19:32:36.452301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:32:36.740267: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0xb982000 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-24 19:32:36.742268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:32:36.742961: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1
2017-07-24 19:32:36.742976: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y
2017-07-24 19:32:36.742981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y
2017-07-24 19:32:36.742998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0)
2017-07-24 19:32:36.743006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0)
2017-07-24 19:32:37.881169: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:32:37.881225: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:32:37.884137: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xcb3e400 executing computations on platform Host. Devices:
2017-07-24 19:32:37.884154: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (0): <undefined>, <undefined>
2017-07-24 19:32:37.885052: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:32:37.885070: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:32:37.888532: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xcb3ef40 executing computations on platform CUDA. Devices:
2017-07-24 19:32:37.888550: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-07-24 19:32:37.888556: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
Running warm up
Done warm up
Step Img/sec loss
Starting real work at step 10 at time Mon Jul 24 19:32:51 2017
1 images/sec: 86.8 +/- 0.0 (jitter = 0.0) 8.526
10 images/sec: 94.2 +/- 1.9 (jitter = 1.9) 8.512
20 images/sec: 94.4 +/- 1.3 (jitter = 1.9) 8.088
30 images/sec: 94.2 +/- 1.1 (jitter = 2.7) 8.015
40 images/sec: 92.0 +/- 1.0 (jitter = 5.6) 8.086
50 images/sec: 91.1 +/- 0.9 (jitter = 9.6) 7.599
60 images/sec: 90.9 +/- 0.8 (jitter = 9.9) 7.262
70 images/sec: 90.3 +/- 0.8 (jitter = 8.6) 7.153
80 images/sec: 89.7 +/- 0.7 (jitter = 7.7) 7.462
90 images/sec: 89.7 +/- 0.6 (jitter = 9.1) 7.907
Finishing real work at step 109 at time Mon Jul 24 19:34:02 2017
100 images/sec: 89.4 +/- 0.6 (jitter = 9.0) 7.357
----------------------------------------------------------------
total images/sec: 89.06
----------------------------------------------------------------
staged_vars - False
TensorFlow: 1.2
Model: resnet50
Mode: training
Batch size: 64 global
32 per device
Devices: ['/gpu:0', '/gpu:1']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating model
2017-07-24 19:35:11.740304: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:06:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:35:12.024853: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0xcebe000 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2017-07-24 19:35:12.026114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:07:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-07-24 19:35:12.026573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1
2017-07-24 19:35:12.026588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y Y
2017-07-24 19:35:12.026593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: Y Y
2017-07-24 19:35:12.026618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:06:00.0)
2017-07-24 19:35:12.026630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:07:00.0)
2017-07-24 19:35:13.009850: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:35:13.010104: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:35:13.013000: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xb5dc400 executing computations on platform Host. Devices:
2017-07-24 19:35:13.013018: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (0): <undefined>, <undefined>
2017-07-24 19:35:13.013708: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 2 visible devices
2017-07-24 19:35:13.013725: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 48 visible devices
2017-07-24 19:35:13.016627: I tensorflow/compiler/xla/service/service.cc:198] XLA service 0xb5dc1c0 executing computations on platform CUDA. Devices:
2017-07-24 19:35:13.016642: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2017-07-24 19:35:13.016648: I tensorflow/compiler/xla/service/service.cc:206] StreamExecutor device (1): Tesla K80, Compute Capability 3.7
Running warm up
Done warm up
Step Img/sec loss
Starting real work at step 10 at time Mon Jul 24 19:35:24 2017
1 images/sec: 99.4 +/- 0.0 (jitter = 0.0) 7.449
10 images/sec: 94.9 +/- 1.7 (jitter = 0.6) 7.422
20 images/sec: 95.7 +/- 1.1 (jitter = 1.0) 7.673
30 images/sec: 96.0 +/- 0.9 (jitter = 1.3) 7.421
40 images/sec: 96.1 +/- 0.8 (jitter = 1.5) 7.639
50 images/sec: 95.3 +/- 0.7 (jitter = 1.7) 7.910
60 images/sec: 95.5 +/- 0.7 (jitter = 1.7) 7.359
70 images/sec: 95.8 +/- 0.6 (jitter = 1.6) 7.666
80 images/sec: 95.6 +/- 0.5 (jitter = 1.8) 7.383
90 images/sec: 95.9 +/- 0.5 (jitter = 1.6) 7.437
Finishing real work at step 109 at time Mon Jul 24 19:36:31 2017
100 images/sec: 95.2 +/- 0.5 (jitter = 1.7) 7.441
----------------------------------------------------------------
total images/sec: 95.20
----------------------------------------------------------------
I would like to use the benchmark on CPU to realize the baseline performance. However, this error showed up when generating the model. This is the command I used:
python tf_cnn_benchmarks.py --device cpu --train_dir /path/to/my/dir --eval-dir /path/to/my/dir --model alexnet
and here is the traceback:
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 1333, in
tf.app.run()
File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tf_cnn_benchmarks.py", line 1329, in main
bench.run()
File "tf_cnn_benchmarks.py", line 879, in run
self._benchmark_cnn()
File "tf_cnn_benchmarks.py", line 969, in _benchmark_cnn
config=create_config_proto(),
File "tf_cnn_benchmarks.py", line 627, in create_config_proto
config.gpu_options.force_gpu_compatible = FLAGS.force_gpu_compatible
It seems like my TensorFlow configuration doesn't have this field. Would it be caused by incompatible TensorFlow version? I directly cloned both the TensorFlow and benchmarks from GitHub.
In your code https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py#L734
self.num_warmup_batches = FLAGS.num_warmup_batches if (
FLAGS.num_warmup_batches) else max(10, min_autotune_warmup)
if I set FLAGS.num_warmup_batches with 0 it goes on the else branch and it sets the num_warmup_batches to 10.
Is this intended behaviour? I think what you want to do is
self.num_warmup_batches = FLAGS.num_warmup_batches if (
FLAGS.num_warmup_batches is not None) else max(10, min_autotune_warmup)
Thanks for implementing a high performance example! I wouldn't have been able to understand the suggested setup without it.
Based on the performance docs I think this code is to help teach people to write efficient training scripts. I personally ran into some difficulty finding where some of the important sections are, and those important sections are fairly hard coded specifically to bounding box classification.
For those reasons I think it might be broadly beneficial to:
For instance, one of the most important files is preprocessing.py
, particularly the parse_example_proto and minibatch functions.
To help make the code a bit cleaner I'd like to give an example of how a better separation of concerns could work. For instance, minibatch could be updated to something like the following where no bounding box specifics actually occur, and those steps could be passed to the function parameters. A second iteration beyond this example would be better (perhaps split into 3 functions so there are no function parameters?), and this code is untested:
@staticmethod
def tfrecord_minibatch(
tfrecord_path_glob_pattern,
gpu_device_count,
batch_size,
parse_example_proto_fn=None,
preprocessing_fn=None,
create_data_and_label_op_lists_fn=None,
random_seed=301,
parallelism=64,
buffer_size=10000):
"""TODO: High Performance Distributed Training Batches - Adapt for broader use cases
"""
with tf.name_scope('batch_processing'):
# Split the data among all the GPU devices
if batch_size % gpu_device_count != 0:
raise ValueError(
('batch_size must be a multiple of gpu_device_count: '
'batch_size %d, gpu_device_count: %d') %
(batch_size, gpu_device_count))
batch_size_per_device = batch_size // gpu_device_count
images = [[] for i in range(gpu_device_count)]
labels = [[] for i in range(gpu_device_count)]
record_input = data_flow_ops.RecordInput(
file_pattern=tfrecord_path_glob_pattern,
seed=random_seed,
parallelism=parallelism,
buffer_size=buffer_size,
batch_size=batch_size,
name='record_input')
records = record_input.get_yield_op()
records = tf.split(records, batch_size, 0)
records = [tf.reshape(record, []) for record in records]
feature_op_dicts = []
preprocessed_data_ops = []
for device_i in xrange(batch_size):
protobuf = records[i]
feature_op_dict = parse_example_proto_fn(protobuf)
preprocessed_data = None
if preprocessing_fn is not None:
# thread_id should be distortion_method_id, and calculated later
preprocessed_data = preprocessing_fn(feature_op_dict, i)
device_index = i % gpu_device_count
feature_op_dict[device_index].append(image)
labels[device_index].append(label_index)
label_index_batch = [None] * gpu_device_count
return create_data_and_label_op_lists_fn(feature_op_dicts, preprocessed_data_ops, gpu_device_count)
I'm not totally sure I got all the variables & code changes right, but stuff like subset
from the original code could also be better named and I think tfrecord_path_glob_pattern
is a bit better. A gpu
might really be some other kind of tensor processing unit, so there might still be a better name but hopefully it is clear there are several of these inside a single computer/server.
Regardless, I appreciate your consideration, and thank you for putting this up, it is a valuable learning tool!
Hi,
I followed the instructions from the [performance page]{https://www.tensorflow.org/performance/performance_models}, and run on two EC2 p2.8xlarge instances, using the same benchmark hash (Benchmark GitHub hash: 9165a70).
# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0
# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1
However, the worker failed with:
Generating model
save variable global_step:0
save variable ps_var/v0/conv0/conv2d/kernel:0
save variable ps_var/v0/conv0/biases:0
save variable ps_var/v0/conv1/conv2d/kernel:0
save variable ps_var/v0/conv1/biases:0
save variable ps_var/v0/conv2/conv2d/kernel:0
save variable ps_var/v0/conv2/biases:0
save variable ps_var/v0/conv3/conv2d/kernel:0
save variable ps_var/v0/conv3/biases:0
save variable ps_var/v0/conv4/conv2d/kernel:0
save variable ps_var/v0/conv4/biases:0
save variable ps_var/v0/affine0/weights:0
save variable ps_var/v0/affine0/biases:0
save variable ps_var/v0/affine1/weights:0
save variable ps_var/v0/affine1/biases:0
save variable ps_var/v0/affine2/weights:0
save variable ps_var/v0/affine2/biases:0
Traceback (most recent call last):
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
tf.app.run()
File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
bench.run()
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
self._benchmark_cnn()
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 674, in _benchmark_cnn
start_standard_services=start_standard_services) as sess:
File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 953, in managed_session
start_standard_services=start_standard_services)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
sess.run(init_op, feed_dict=init_feed_dict)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3,3,384,256]
[[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
[[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]
Caused by op u'v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform', defined at:
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1096, in <module>
tf.app.run()
File "/usr/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 1092, in main
bench.run()
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 573, in run
self._benchmark_cnn()
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 620, in _benchmark_cnn
(enqueue_ops, fetches) = self._build_model()
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 791, in _build_model
gpu_grad_stage_ops)
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 952, in add_forward_pass_and_gradients
self.model.add_inference(network)
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/alexnet_model.py", line 42, in add_inference
cnn.conv(256, 3, 3)
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 103, in conv
use_bias=False)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 551, in conv2d
return layer.apply(inputs)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 503, in apply
return self.__call__(inputs, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 443, in __call__
self.build(input_shapes[0])
File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/convolutional.py", line 137, in build
dtype=self.dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/layers/base.py", line 383, in add_variable
trainable=trainable and self.trainable)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/home/ec2-user/benchmarks/scripts/tf_cnn_benchmarks/variable_mgr.py", line 84, in __call__
return getter(name, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
validate_shape=validate_shape)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 199, in __init__
expected_shape=expected_shape)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 277, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 701, in <lambda>
shape.as_list(), dtype=dtype, partition_info=partition_info)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/init_ops.py", line 441, in __call__
dtype, seed=self.seed)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/random_ops.py", line 240, in random_uniform
shape, dtype, seed=seed1, seed2=seed2)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/ops/gen_random_ops.py", line 247, in _random_uniform
seed=seed, seed2=seed2, name=name)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[3,3,384,256]
[[Node: v0/conv4/conv2d/kernel/Initializer/random_uniform/RandomUniform = RandomUniform[T=DT_INT32, _class=["loc:@v0/conv4/conv2d/kernel"], dtype=DT_FLOAT, seed=1234, seed2=132, _device="/job:worker/replica:0/task:0/gpu:0"](v0/conv4/conv2d/kernel/Initializer/random_uniform/shape)]]
[[Node: v0/conv2/biases/Initializer/Const_S21 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/gpu:0", send_device="/job:worker/replica:0/task:0/gpu:0", send_device_incarnation=-2694717678558735913, tensor_name="edge_53_v0/conv2/biases/Initializer/Const", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/gpu:0"]()]]
It seems each TF process will allocate all of available GPU memory, so the worker cannot get any memory if I start the parameter server command first.
Likewise, if I run worker first, then the parameter server cannot get any memory.
Hi, thanks for the wonderful code!
I found the summaries added during image preprocessing are lost after calling ds_iterator.get_next().
I identify this issues by adding tmp = tf.get_collection(key=tf.GraphKeys.SUMMARIES)
right after ds_iterator.get_next()
. The returned value of tmp is an enmpty list. However, when adding the same line at the end of def preprocess(self, raw_image), the returned value of tmp is the summary proto for the processed images, which is normal.
I, therefore, guess that the summaries added before calling Dataset iterator will not be retained. Is this true? If so, do we have a way to circumvent this issue?
Many thanks.
https://github.com/tensorflow/benchmarks/blob/master/models/Dockerfile.alexnet_distributed_test returns 404.
Probably needed to remove link as part of this PR: e172b60
I want to compare the speed with fused batch norm and non-fused batch norm, but when I set fused=False of batch_norm layer and data_format=NCHW, the speed of trianing is very very much slower (only about one-six) than the fused batch norm model. What's wrong with the non-fused model?
Thanks very much for publishing the code. With this benchmark I've seen very good GPU utilization with single-machine multi-GPU training, however I found that distributed training doesn't scale very well.
The published distributed benchmark performance were only on K80s, so the communication overhead might be less of a problem there. However TitanX/M40 is about twice faster than it, and P100 is about 4x faster, and V100 would be ..
In more details:
Tensorflow version: commit d101472296f88 compiled manually (with -march=native)
Python 2.7, cuda 8.0.44, cudnn 5.1
GPU: 4 Tesla M40s per machine
Latency between the two machines: 0.06~0.08ms given by ping
Bandwidth: 9.3Gbit/s given by iperf
Speed numbers (all with resnet50, batch 64 per GPU):
Single machine: (variable_update=parameter_server)
1GPU: 111 im/s -> 4GPU: 432 im/s
Two machines (variable_update=distributed_replicated):
2x4=8GPU: only 561 im/s
Hope to see some more improvements on it!
I am trying to replicate the results in the OSDI 2016 paper. Unfortunately, it's not clear from the documentation what set of arguments to pass to scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
and the source documentation does not specify it either.
Are the parameters and exact commands used in that paper available anywhere?
Distributed training:
# ps
CUDA_VISIBLE_DEVICES= \
python tf_cnn_benchmarks.py \
--job_name=ps --ps_hosts=10.0.0.1:5000 \
--worker_hosts=10.0.0.1:5001 --task_index=0
# worker
python tf_cnn_benchmarks.py \
--job_name=worker --ps_hosts=10.0.0.1:5000 \
--worker_hosts=10.0.0.1:5001 --task_index=0
Result:
TensorFlow: 1.1
Model: trivial
Mode: training
Batch size: 32 global
32 per device
Devices: ['/job:worker/task:0/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
Sync: True
==========
Generating model
Running warm up
Done warm up
Step Img/sec loss
Starting real work at step 10 at time Sat May 6 03:59:06 2017
1 images/sec: 238.2 +/- 0.0 (jitter = 0.0) 7.089
10 images/sec: 233.7 +/- 2.4 (jitter = 7.9) 7.088
20 images/sec: 233.6 +/- 1.5 (jitter = 7.9) 7.086
30 images/sec: 232.0 +/- 1.2 (jitter = 8.5) 7.084
40 images/sec: 234.0 +/- 1.2 (jitter = 8.5) 7.082
50 images/sec: 234.0 +/- 1.2 (jitter = 9.8) 7.080
60 images/sec: 233.7 +/- 1.0 (jitter = 8.7) 7.079
70 images/sec: 234.1 +/- 1.0 (jitter = 8.6) 7.077
80 images/sec: 234.0 +/- 0.9 (jitter = 9.1) 7.075
90 images/sec: 234.1 +/- 0.9 (jitter = 9.2) 7.073
Finishing real work at step 109 at time Sat May 6 03:59:20 2017
100 images/sec: 234.0 +/- 0.8 (jitter = 8.6) 7.071
----------------------------------------------------------------
total images/sec: 233.41
----------------------------------------------------------------
Non-distributed training:
python tf_cnn_benchmarks.py
Result:
TensorFlow: 1.1
Model: trivial
Mode: training
Batch size: 32 global
32 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating model
Running warm up
Done warm up
Step Img/sec loss
1 images/sec: 3521.1 +/- 0.0 (jitter = 0.0) 7.089
10 images/sec: 3517.2 +/- 14.3 (jitter = 65.2) 7.087
20 images/sec: 3513.5 +/- 10.0 (jitter = 67.1) 7.085
Starting real work at step 31 at time Sat May 6 04:02:33 2017
30 images/sec: 3504.6 +/- 8.3 (jitter = 52.2) 7.084
40 images/sec: 3503.8 +/- 6.8 (jitter = 49.9) 7.082
50 images/sec: 3509.8 +/- 5.9 (jitter = 47.5) 7.080
60 images/sec: 3505.6 +/- 5.3 (jitter = 54.0) 7.078
70 images/sec: 3502.7 +/- 4.8 (jitter = 50.0) 7.076
80 images/sec: 3502.7 +/- 4.5 (jitter = 50.0) 7.074
90 images/sec: 3503.3 +/- 4.2 (jitter = 50.0) 7.072
100 images/sec: 3503.1 +/- 3.9 (jitter = 46.9) 7.070
Finishing real work at step 113 at time Sat May 6 04:02:34 2017
----------------------------------------------------------------
total images/sec: 3488.15
----------------------------------------------------------------
Nearly 15x (14.9x) performance difference is observed. Please do correct me if I did something terribly wrong; it's just the result is pretty unexpected to me.
Hello
My OS configuration
CUDA8
CUDNN6
NCCL 1.3.5
When I trying to testing TF benchmark on my machine.I found that python can't import all_reduce from "tensorflow.contrib.all_reduce.python"
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.6 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in
import benchmark_cnn
File "/home/gin/WORK/tensorflow/benchmarks-master/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 44, in
import variable_mgr
File "/home/gin/WORK/tensorflow/benchmarks-master/scripts/tf_cnn_benchmarks/variable_mgr.py", line 29, in
from tensorflow.contrib.all_reduce.python import all_reduce
ImportError: No module named all_reduce.python
Did anyone happen this error before?
I tried to run
# VGG16 training ImageNet with 8 GPUs using arguments that optimize for
# Google Compute Engine.
python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 \
--batch_size=32 --model=vgg16 --data_dir=/home/ubuntu/flowers \
--variable_update=parameter_server --nodistortions
And the data dir has the TF Records inside, generated with bazel as in the models/inception/data tutorial
-rw-rwx--- 1 40 May 11 11:43 labels.txt
drwxrwx--- 7 4096 May 12 11:45 train
-rw-rwx--- 1 102419300 May 11 11:43 train-00000-of-00002
-rw-rwx--- 1 99116804 May 11 11:43 train-00001-of-00002
drwxrwx--- 7 4096 May 12 11:45 validation
-rw-rwx--- 1 16058779 May 11 11:43 validation-00000-of-00002
-rw-rwx--- 1 15919237 May 11 11:43 validation-00001-of-00002
And it hangs like this:
TensorFlow: 1.1
Model: vgg16
Mode: training
Batch size: 32 global
32.0 per device
Devices: ['/gpu:0']
Data format: NCHW
Optimizer: sgd
Variables: parameter_server
==========
Generating model
2017-05-12 11:57:30.357629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:900] Found device 0 with properties:
....
pciBusID 0002:01:00.0
Total memory: 15.89GiB
Free memory: 15.61GiB
2017-05-12 11:57:30.357680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:921] DMA: 0
2017-05-12 11:57:30.357690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:931] 0: Y
2017-05-12 11:57:30.357707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0002:01:00.0)
But for syntatic data it works. Any idea how to fix this?
Run Tensorflow on K8S:
workerCmdArgs = "cd /opt/benchmarks/scripts/tf_cnn_benchmarks/;CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu --model=alexnet --variable_update=parameter_server"
psCmdArgs = "cd /opt/benchmarks/scripts/tf_cnn_benchmarks/;CUDA_VISIBLE_DEVICES='' python tf_cnn_benchmarks.py --local_parameter_device=cpu --model=alexnet --variable_update=parameter_server"
Got error:
2017-11-01 11:41:49.570889: E tensorflow/core/common_runtime/executor.cc:643] Executor failed to create kernel. Invalid argument: Default MaxPoolingOp only supports NHWC.
[[Node: v/tower_0/cg/mpool0/MaxPool = MaxPoolT=DT_FLOAT, data_format="NCHW", ksize=[1, 1, 3, 3], padding="VALID", strides=[1, 1, 2, 2], _device="/job:worker/replica:0/task:0/device:CPU:0"]]
Thanks for such wonderful example code. It helps me a lot to get familiar with TensorFlow.
However, I came across an interesting phenomenon when I visualizing my computation graph via TensorBoard. It's not a problem, but I just don't understand why it should be like this.
My computer has 1 CPU and 2 GPUs installed, but I get 5 devices in my computation graph, which could be easily observed through TensorBoard: (highlighted by red square)
I tried to modify this line
as
device_name = '/device:' + self.ps_devices[device_index].upper()[1:]
to replace '/gpu:0' by '/device/GPU:0'; but the results are not changed. I'm so confused. Could you please explain a little bit on:
Thank you very much for your time and kind help!
The deployment-related settings are as below for reference:
--num_gpus=2
--local_parameter_device='gpu'
--device='gpu'
--winograd_nonfused=True
--sync_on_finish=False
--staged_vars=False
--force_gpu_compatible=True
--variable_update='parameter_server'
--use_nccl=True
--job_name=''
--ps_hosts=''
--task_index=0
--server_protocol='grpc'
--cross_replica_sync=True
Hello,
I have copy the benchmarks folder under tensorflow directory.
(tensorflow) root@P50:/opt/DL/tensorflow# ls -all
total 28
drwxr-xr-x 6 root root 4096 oct 22 13:00 .
drwxr-xr-x 5 root root 4096 oct 22 16:53 ..
drwxr-xr-x 8 root root 4096 oct 22 13:00 benchmarks
drwxr-xr-x 2 root root 4096 oct 22 12:53 bin
drwxr-xr-x 2 root root 4096 oct 22 12:50 include
drwxr-xr-x 3 root root 4096 oct 22 12:50 lib
-rw-r--r-- 1 root root 60 oct 22 12:50 pip-selfcheck.json
When trying to run tf_cnn_benchmark I am getting this error:
(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts# python3 tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=1 --batch_size=16 --model=inception3 --data_dir=/opt/DL/imagenet/datasets/ --variable_update=parameter_server --nodistortions
Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 26, in
import benchmark_cnn
File "/opt/DL/tensorflow/benchmarks/scripts/benchmark_cnn.py", line 41, in
import cnn_util
File "/opt/DL/tensorflow/benchmarks/scripts/cnn_util.py", line 40
print log
^
SyntaxError: Missing parentheses in call to 'print'
(tensorflow) root@P50:/opt/DL/tensorflow/benchmarks/scripts#
Do I need to do something else before running the benchmark?
Thank you,
Florin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.