Comments (6)
Hi @mingxingtan, first thanks for sharing the code of EfficientDet. It would be fantastic if you could say when you believe that the current code will support training with multiple GPUs or, if this will not be added soon, any suggestions on how it could be implemented, I am investigating TensorFlow more deeply in this period and I still have no idea about how difficult it would be for me to do it.
Thanks,
Mario
from automl.
I also want to know how to train with multiple GPUs?
from automl.
same question, it seems like the code only supports single GPU train
from automl.
I've tried following this tutorial on doing distributed training with estimator. I changed the run config to use a MirroredStrategy like so:
strategy = tf.distribute.MirroredStrategy()
run_config = tf.estimator.tpu.RunConfig(
cluster=tpu_cluster_resolver,
evaluation_master=FLAGS.eval_master,
model_dir=FLAGS.model_dir,
log_step_count_steps=FLAGS.iterations_per_loop,
session_config=config_proto,
tpu_config=tpu_config,
train_distribute=strategy,
)
However, when I ran the code I got this stack trace:
Traceback (most recent call last):
File "main.py", line 394, in <module>
tf.app.run(main)
File "/usr/lib/python3/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/lucas/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/lucas/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 362, in main
steps=int(FLAGS.num_examples_per_epoch / FLAGS.train_batch_size))
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/home/lucas/.local/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1159, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1222, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1258, in _actual_train_model_distributed
input_fn, ModeKeys.TRAIN, strategy)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1012, in _get_iterator_from_input_fn
lambda input_context: self._call_input_fn(input_fn, mode,
File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 1050, in make_input_fn_iterator
input_fn, replication_mode)
File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/distribute_lib.py", line 577, in make_input_fn_iterator
input_fn, replication_mode=replication_mode)
File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 552, in _make_input_fn_iterator
self._container_strategy())
File "/usr/lib/python3/dist-packages/tensorflow_core/python/distribute/input_lib.py", line 719, in __init__
result = input_fn(ctx)
File "/usr/lib/python3/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1013, in <lambda>
input_context))
I thought that might because the dataloader.py's InputReader.call function didn't take in an input context, but I fixed that also following the guide and got the same stack trace.
from automl.
Tried again on tf 2.1, and got a different error:
File "main.py", line 394, in <module>
tf.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "main.py", line 362, in main
steps=int(FLAGS.num_examples_per_epoch / FLAGS.train_batch_size))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3054, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 149, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3049, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 376, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1171, in _train_model
return self._train_model_distributed(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1234, in _train_model_distributed
self._config._train_distribute, input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1314, in _actual_train_model_distributed
self.config))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2095, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 763, in _call_for_each_replica
fn, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 201, in _call_for_each_replica
coord.join(threads)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
yield
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 986, in run
self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/impl/api.py", line 265, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.pyct.error_utils.KeyError: in user code:
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2884 _call_model_fn *
return super(TPUEstimator, self)._call_model_fn(features, labels, mode,
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1161 _call_model_fn *
model_fn_results = self._model_fn(features=features, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:3144 _model_fn *
estimator_spec = model_fn_wrapper.call_without_tpu(
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:1678 call_without_tpu *
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:2011 _call_model_fn *
estimator_spec = self._model_fn(features=features, **kwargs)
/efficientdet/det_model_fn.py:604 efficientdet_model_fn *
return _model_fn(
/efficientdet/det_model_fn.py:402 _model_outputs *
return model(features, config=hparams_config.Config(params))
/efficientdet/efficientdet_arch.py:543 efficientdet *
if not config and not model_name:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/autograph/operators/logical.py:28 not_
if tensor_util.is_tensor(a):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:1000 is_tensor
getattr(x, "is_tensor_like", False))
/efficientdet/hparams_config.py:48 __getattr__
return self.__dict__[k]
KeyError: 'is_tensor_like'
from automl.
Hi @LucasSloan, fsx950223 has an open PR about this and, in another issue of this repo, he said that he was able to train with multiple GPUs. You can find his code in his fork of this repo, I didn’t tried it yet. He use Horvod in his implementation.
from automl.
Related Issues (20)
- More inplace ops for pytorch lion's impl
- ERROR : 'ImageFont' object has no attribute 'getbbox' HOT 2
- Potentially wrong type inference
- How to apply quantization aware training on EfficientDet keras model?
- How to train ViT image classification model on our dataset using LION optimizer
- how to train model by lion optimizer with fp16? HOT 1
- how to fix (terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[48], line 3 1 #!rm summary.h5 2 #!rm statepoint.*.h5 ----> 3 sp_filename = model.run() 5 sp = openmc.StatePoint(sp_filename)?
- Error during prediction within coreML framework of the converted Efficientdet-lite0 model
- why the text label is not showing on the bounding box HOT 1
- Question about Lion HOT 1
- TypeError: The `filenames` argument must contain `tf.string` elements. Got `tf.float32` elements error HOT 1
- buffer_size must be greater than zero error when use custom dataset HOT 1
- p.add_(..., inplace=True) error
- efficientnetv2-bn parameters for progressive learning
- How to add class weights?
- Error reading original efficientdet-d3_frozen.pb on openCV`s readNetFromTensorflow HOT 2
- EfficienDet output format question
- Recommended way for EfficientDet-Lite Quantization
- Training on custom dataset of EfficientDet-0 model crash : TypeError: 'NoneType' object is not callable
- Exported tflite model is incompatible with MediaPipe
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from automl.