rossumai / keras-multi-gpu Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 19.0 5.37 MB

Multi-GPU data-parallel training in Keras

License: MIT License

Python 8.03% Shell 0.45% Makefile 0.02% Jupyter Notebook 91.49%

keras-multi-gpu's People

Contributors

Stargazers

Watchers

Forkers

scp-173-cool bityangke devswitchbot rollingstone fitrialif hassanmohsin friendshipity jaykimbravekjh jchen46 tongluoiupui shurain sandeep-vykuntam tony32769 fhxzh daaliang tonylibing hanshbs ljcoopz nada-bu

keras-multi-gpu's Issues

Argument constraints will be removed from Optimizer.get_updates() in Keras 2.0.7

We call Optimizer.get_updates with self.constraints as an argument from overriden Model._make_train_function(). However, in master (that will be released in 2.0.7) this argument is removed. Although there's some legacy interface adaptor it fails with:

  File "keras/keras/engine/training.py", line 1412, in fit
    self._make_train_function()
  File "rossum-multi-gpu/data_parallel_model.py", line 161, in _make_train_function
    self.constraints,
AttributeError: 'DataParallelModel' object has no attribute 'constraints'

Possible to use NCCL for optimized inter-GPU communication?

NCCL claims to provide optimized collective operations for multi-GPU communication. It's available via TensorFlow as well. In our case we could use:

all-gather for gradient averaging (sum of gradients normalized by number of replicas)
broadcast for propagating weights
all-scatter for providing input slices to replicas

We could use TF NCCL operation tf.contrib.nccl.all_sum. It's all-reduce with sum reduction, ie. reduce followed by broadcast of the result. We can use that for gradient averaging. The gradients are available on all devices so that weights can be located and updated on all devices and do not need to be broadcast.

Operation all-scatter is not provided in tf.contrib.nccl. Instead we could utilize TF queue mechanism.

Can't convert Operation 'StagingArea_put' to Tensor

When running using tensoflow-1.12.0 and Keras-2.2.4:

CUDA_VISIBLE_DEVICES=3 python keras_staging_area_cifar10.py

I get the following error:

training pipelined model:
Traceback (most recent call last):
  File "keras_staging_area_cifar10.py", line 73, in <module>
    callbacks=[staging_area_callback, gauge])
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/keras/engine/training.py", line 1010, in fit
    self._make_train_function()
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/keras/engine/training.py", line 519, in _make_train_function
    **self._function_kwargs)
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2744, in function
    return Function(inputs, outputs, updates=updates, **kwargs)
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2567, in __init__
    self.fetches = [tf.identity(x) for x in self.fetches]
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 81, in identity
    return gen_array_ops.identity(input, name=name)
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3454, in identity
    "Identity", input=input, name=name)
  File "/home/bzamecnik/.virtualenvs/rossum/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 513, in _apply_op_helper
    raise err
TypeError: Can't convert Operation 'StagingArea_put' to Tensor (target dtype=None, name=u'input', as_ref=False)

The cause looks like the StagingArea.put operation is wrapped via tf.identity():

 # (since the outputs of fetches are never returned).
   2566         # This requires us to wrap fetches in `identity` ops.
-> 2567         self.fetches = [tf.identity(x) for x in self.fetches]

a thought

hey guys,

first I wanna say that it's so nice to see people sharing their thoughts and work like this.

I just wanted to ask, wrt. to the Keras distributed tests, are you scaling batch size with the number of gpus? (as Keras just splits the given batchsize, across the cards. so for a batchsize of 256 on 4 cards the real batchsize is 64 per card. (I honestly think this should be changed, but c'est la vie)

so this may be why you see less efficiency on the cards.

here's a plot from my tests, that shows quasilinear speedups on EC2 instances.

hope this helps!!

How to use multi-gpu in Keras with shared weights applications model

System information

Linux Ubuntu 16.04)
TensorFlow backend
TensorFlow version: 1.10.0

I want to use the keras in multi-gpus with the applications (such VGG16). But there are some error.

I try to use the single-gpus it is correct. But the multi-gpus is wrong.
The code like this:

import keras
    with tf.device('/cpu:0'):
        input1 = keras.layers.Input(config.input_shape)
        input2 = keras.layers.Input(config.input_shape)
        sub_model = keras.applications.VGG16(include_top=False, weights=config.VGG_MODEL_PATH,
                                             input_shape=config.input_shape)
        output1 = sub_model(input1)
        output2 = sub_model(input1)
        model = keras.Model(inputs=[input1, input2], outputs=[output1, output2])
    parallel_model = keras.utils.multi_gpu_model(model, gpus=3)
    parallel_model.compile('sgd', loss=['mse', 'mse'])
    parallel_model.fit((np.random.random([10, 128, 128, 3]), np.random.random([10, 128, 128, 3])),
                       (np.random.random([10, 4, 4, 512]), np.random.random([10, 4, 4, 512])))

The error message is

Traceback (most recent call last):
  File "/data00/home/liangdong.tony/PycharmProject/RetrievalCCWebVideo/AE/demo.py", line 145, in <module>
    parallel_model = keras.utils.multi_gpu_model(model, gpus=3)
  File "/data00/home/liangdong.tony/.local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 177, in multi_gpu_model
    return Model(model.inputs, merged)
  File "/data00/home/liangdong.tony/.local/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/data00/home/liangdong.tony/.local/lib/python2.7/site-packages/keras/engine/topology.py", line 1811, in __init__
    'Layer names: ', all_names)
RuntimeError: ('The name "vgg16" is used 2 times in the model. All layer names should be unique. Layer names: ', ['input_1', 'input_2', 'lambda_1', 'lambda_2', 'lambda_3', 'lambda_4', 'lambda_5', 'lambda_6', 'model_1', 'vgg16', 'vgg16'])

In a word, I want to use the VGG16 as backbone. I have to input, which used as the input of VGG16 which shared weights between the two inputs.
Do you have any suggestion about that?
Thank you, looking for your reply!

tensorflow : 'NoneType' object has no attribute 'update'

I tried running all your example on 8 Nvidia V100 however I get this error across all of them:

  File "/opt/conda/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/opt/conda/lib/python3.5/copy.py", line 297, in _reconstruct
    state = deepcopy(state, memo)
  File "/opt/conda/lib/python3.5/copy.py", line 155, in deepcopy
    y = copier(x, memo)
  File "/opt/conda/lib/python3.5/copy.py", line 243, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/opt/conda/lib/python3.5/copy.py", line 182, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/opt/conda/lib/python3.5/copy.py", line 306, in _reconstruct
    y.__dict__.update(state)
AttributeError: 'NoneType' object has no attribute 'update'

This is in a docker environment with the following versions:
**

Tensorflow 1.4.0
Keras 2.1.2
Python 3.5.4
CUDA 9.0

**
Could this be version compatibility or GPU device compatibility? Thanks for any pointers.

Shape [-1] has negative dimensions

Running on 2 GPUs (GTX 1070):

CUDA_VISIBLE_DEVICES=0,1 python data_parallel_mnist_cnn.py

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
2017-08-10 14:55:47.483599: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 14:55:47.483631: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 14:55:48.831409: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1,-1] has negative dimensions
2017-08-10 14:55:48.831460: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1,-1] has negative dimensions
	 [[Node: replica_1_1/model_1_target = Placeholder[dtype=DT_FLOAT, shape=[?,?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
2017-08-10 14:55:48.849021: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1,-1] has negative dimensions
2017-08-10 14:55:48.849064: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1,-1] has negative dimensions
	 [[Node: replica_0_1/model_1_target = Placeholder[dtype=DT_FLOAT, shape=[?,?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
2017-08-10 14:55:48.865190: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1] has negative dimensions
2017-08-10 14:55:48.865233: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1] has negative dimensions
	 [[Node: replica_0_1/model_1_sample_weights = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/Users/bzamecnik/anaconda/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape [-1] has negative dimensions
	 [[Node: replica_0_1/model_1_sample_weights = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

loss stuck when using multi_gpu

I'm trying to use make_parallel() using Keras XCeption, and a generator which yields two classes, batch_size=2.

When using one gpu without make_parallel, the model gets to loss=0 acc=1 in 2 epochs.
However, when using multi_gpu with gpus=2, the model gets stuck in acc=0.5 with loss=8.0591.

I'm guessing this is related somehow to the loss aggregation being collected only from one GPU instead of both, but I am not sure why.

When trying to train 4 classes, batch_size=4, the training gets to acc=0.97 after 11 epochs, while single gpu gets acc=1 within 2 epochs.

Any idea?

rossumai / keras-multi-gpu Goto Github PK

keras-multi-gpu's People

Contributors

Stargazers

Watchers

Forkers

keras-multi-gpu's Issues

Argument constraints will be removed from Optimizer.get_updates() in Keras 2.0.7

Possible to use NCCL for optimized inter-GPU communication?

Can't convert Operation 'StagingArea_put' to Tensor

a thought

How to use multi-gpu in Keras with shared weights applications model

tensorflow : 'NoneType' object has no attribute 'update'

Shape [-1] has negative dimensions

loss stuck when using multi_gpu

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent