awslabs / deeplearning-benchmark Goto Github PK

License: Apache License 2.0

Python 87.10% Shell 7.19% Scala 3.57% Java 1.54% Starlark 0.60%

deeplearning-benchmark's Introduction

Scalability Comparison Scripts for Deep Learning Frameworks

This repository contains scripts that compares the scalability of deep learning frameworks.

The scripts train Inception v3 and AlexNet using synchronous stochastic gradient descent (SGD). To run the comparison in reasonable time, we run few tens of iterations of SGD and compute the throughput as images processed per second.

Comparisons can be done on clusters created with AWS CloudFormation using the Amazon Deep Learning AMI.

###To run comparisons in a deep learning cluster created with CloudFormation

Step 1: Create a deep learning cluster using CloudFormation.

Step 2: Log in to the master instance using SSH, including the -A option to enable SSH agent forwarding. Example: ssh -A masternode

Step 3: Run the following command: git clone https://github.com/awslabs/deeplearning-benchmark.git && cd deeplearning-benchmark/benchmark/ && bash runscalabilitytest.sh

The runscalabilitytest.sh script runs scalability tests and records the throughput as images/sec in CSV files under 'csv_*' directories. Each line in the CSV file contains a key-value pair, where the key is the number of GPUs the test was run on and the value is the images processed per second. The script also plots this data in a SVG file named comparison_graph.svg.

Note: The following mini-batch sizes are used by default:

	P2 Instance	G2 Instance
Inception v3	32	8
Alexnet	512	128

Mini-batch size can be changed using the --models switch. For example to run Inception-v3 with a batch size of 16 and AlexNet with a batch size of 256, run the following: bash runscalabilitytest.sh --models "Inceptionv3:16,Alexnet:256".

To run training across multiple machines, the scripts use parameter servers to update parameters. It is possible to get better performance on a single machine by not using the parameter servers. For simplicity, these scripts don't run different code optimized for a single machine for tests that run on single machine, given that we are interested only in distributed performance across multiple machines. This should not affect the results for distributed training.

deeplearning-benchmark's People

Stargazers

Watchers

Forkers

ganeshrajulinaro yyri yingjin-us mtgran timetobehappy tfboyd panyx0718 ml-lab minskybelieve ldmtwo cristidruta leoshan divayjindal95 shamoya yangjunpro da-molchanov gregwchase gridl hunslater-deeplearning spencerx antman5755 parivasha garfield2005 leleamol larroy vishaalkapoor laurii mseth10 kshar19 ourobouros roshrini piyushghai ishitori vandanavk lanking520 yutinghu cimomo lebeg andrewfayres yuhonghong7035 access2rohit seemant86 stu1130 sandeep-krishnamurthy tomz jlcontreras lxx719 apeforest tangbohu grez72 dilinwang820 vijayvee haojin2 vdantu lordzth666 ssharma101 fuweijie amirunpri2018 chaibapchya austinzh tmolayo asiancary dhanainme surajkota sachinmittal2212 dhaniram-kshirsagar xiaoqun2011 logichen pertiashwin waytrue17 chenloveheimei alexhokalok jeet4320 deplay rymarinelli jose-ext-0101

deeplearning-benchmark's Issues

AssertionError in How to Retrain a Trained Model on the Flowers Data of Inception

I just downloaded all the files from
https://github.com/awslabs/deeplearning-benchmark/tree/master/tensorflow/inception
followed
`# Build the model. Note that we need to make sure the TensorFlow is ready to

use before this as this command will not build TensorFlow.

bazel build inception/flowers_train

Path to the downloaded Inception-v3 model.

MODEL_PATH="${INCEPTION_MODEL_DIR}/model.ckpt-157585"

Directory where the flowers data resides.

FLOWERS_DATA_DIR=/tmp/flowers-data/

Directory where to save the checkpoint and events files.

TRAIN_DIR=/tmp/flowers_train/

Run the fine-tuning on the flowers data set starting from the pre-trained

Imagenet-v3 model.

bazel-bin/inception/flowers_train
--train_dir="${TRAIN_DIR}"
--data_dir="${FLOWERS_DATA_DIR}"
--pretrained_model_checkpoint_path="${MODEL_PATH}"
--fine_tune=True
--initial_learning_rate=0.001
--input_queue_memory_factor=1but it showedbazel-bin/inception/flowers_train \

--train_dir="${TRAIN_DIR}"
--data_dir="${FLOWERS_DATA_DIR}"
--pretrained_model_checkpoint_path="${MODEL_PATH}"
--fine_tune=True
--initial_learning_rate=0.001
--input_queue_memory_factor=1
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.835
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.40GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
File "/home/lunasdejavu/Downloads/InceptionNet/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py", line 41, in
tf.app.run()
File "/home/lunasdejavu/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/lunasdejavu/Downloads/InceptionNet/bazel-bin/inception/flowers_train.runfiles/inception/inception/flowers_train.py", line 37, in main
inception_train.train(dataset)
File "/home/lunasdejavu/Downloads/InceptionNet/bazel-bin/inception/flowers_train.runfiles/inception/inception/inception_train.py", line 321, in train
assert tf.gfile.Exists(FLAGS.pretrained_model_checkpoint_path)
AssertionError
`
my enviroment is gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
64 bit
tensorflow 1.0.0with GPU
CUDA8.0 cudnn5.1
python 2.7
I am really new to tensorflow and python,can someone help me to fix this problem please?

ImportError: No module named object_detection.data_decoders

When I run tf_cnn_benchmarks.py, I am getting the following error:

Traceback (most recent call last):
File "tf_cnn_benchmarks.py", line 27, in
import benchmark_cnn
File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 51, in
import datasets
File "/root/benchmarks/scripts/tf_cnn_benchmarks/datasets.py", line 31, in
import preprocessing
File "/root/benchmarks/scripts/tf_cnn_benchmarks/preprocessing.py", line 31, in
from object_detection.data_decoders import tf_example_decoder
ImportError: No module named object_detection.data_decoders

Duplicate of benchmark_runner.py and benchmark_driver.py

Which one is needed?

python3 cannot parse log from executed command

running python3 with benchmark_runner.py caused the following error:

Traceback (most recent call last):
  File "benchmark_runner.py", line 61, in <module>
    framework=args.framework
  File "/home/ubuntu/efs/deeplearning-benchmark/utils/metrics_manager.py", line 154, in benchmark
    result.parse_log()
  File "/home/ubuntu/efs/deeplearning-benchmark/utils/metrics_manager.py", line 88, in parse_log
    metric = re.findall(pattern, self.log_file)
  File "/home/ubuntu/anaconda3/lib/python3.6/re.py", line 222, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

python2 works fine. There is something wrong with reg expression handling.

Python3 object of type map has no len()

ERROR:root:Fatal error in run_benchmark
  Traceback (most recent call last):
    File "benchmark_driver.py", line 67, in <module>
      run_benchmark(args)
    File "benchmark_driver.py", line 45, in run_benchmark
      framework=args.framework
    File "/home/ubuntu/workspace/chai-dl-benchmark-fork/utils/metrics_manager.py", line 156, in benchmark
      result.parse_log()
    File "/home/ubuntu/workspace/chai-dl-benchmark-fork/utils/metrics_manager.py", line 94, in parse_log
      metric=metric
    File "/home/ubuntu/workspace/chai-dl-benchmark-fork/utils/metrics_manager.py", line 23, in compute
      return 1.0 * sum(metric) / len(metric)
TypeError: object of type 'map' has no len()

Python 2 works fine. But not python3

utils.errors.MetricPatternError: Can not locate provided metric pattern.

The command
python benchmark_driver.py --task-name resnet50_cifar10_symbolic --num-gpus 1,

The result
INFO:root:Executing Command: python image_classification/image_classification.py --model resnet50_v1 --dataset cifar10 --mode symbolic --gpus 1 --epochs 20 --log-interval 50 --kvstore device Traceback (most recent call last): File "benchmark_driver.py", line 59, in <module> framework=args.framework File "/root/benchmark/utils/metrics_manager.py", line 150, in benchmark result.parse_log() File "/root/benchmark/utils/metrics_manager.py", line 86, in parse_log raise utils.errors.MetricPatternError("Can not locate provided metric pattern.") _utils.errors.MetricPatternError: Can not locate provided metric pattern._
I used my own device rather than AWS. What's the reason for this problem?