Comments (4)
Adding the argument find_unused_parameters=True
, as suggested in the error message to the line model = MMDistributedDataParallel(model.cuda(), find_unused_parameters=True)
in trainer.py gives
python -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config ./config/cfg_kitti_fm.py --work_dir logs --gpus "0,1"
./config/cfg_kitti_fm.py
cfg is Config (path: ./config/cfg_kitti_fm.py): {'DEPTH_LAYERS': 18, 'POSE_LAYERS': 18, 'FRAME_IDS': [0, -1, 1], 'IMGS_PER_GPU': 1, 'HEIGHT': 192, 'WIDTH': 640, 'data': {'name': 'kitti', 'split': 'exp', 'height': 192, 'width': 640, 'frame_ids': [0, -1, 1], 'in_path': '../data/kitti-raw', 'gt_depth_path': '../easy2ride_pipeline/monodepth2/splits/eigen/gt_depths.npz', 'png': True, 'stereo_scale': False}, 'model': {'name': 'mono_fm', 'depth_num_layers': 18, 'pose_num_layers': 18, 'extractor_num_layers': 50, 'frame_ids': [0, -1, 1], 'imgs_per_gpu': 1, 'height': 192, 'width': 640, 'scales': [0, 1, 2, 3], 'min_depth': 0.1, 'max_depth': 100.0, 'depth_pretrained_path': None, 'pose_pretrained_path': None, 'extractor_pretrained_path': '/home/e2r/Downloads/autoencoder.pth', 'automask': True, 'disp_norm': True, 'perception_weight': 0.001, 'smoothness_weight': 0.001}, 'resume_from': None, 'finetune': None, 'total_epochs': 40, 'imgs_per_gpu': 1, 'learning_rate': 0.0001, 'workers_per_gpu': 4, 'validate': True, 'optimizer': {'type': 'Adam', 'lr': 0.0001, 'weight_decay': 0}, 'optimizer_config': {'grad_clip': {'max_norm': 35, 'norm_type': 2}}, 'lr_config': {'policy': 'step', 'warmup': 'linear', 'warmup_iters': 500, 'warmup_ratio': 0.3333333333333333, 'step': [20, 30], 'gamma': 0.5}, 'checkpoint_config': {'interval': 1}, 'log_config': {'interval': 50, 'hooks': [{'type': 'TextLoggerHook'}]}, 'dist_params': {'backend': 'nccl'}, 'log_level': 'INFO', 'load_from': None, 'workflow': [('train', 1)], 'work_dir': 'logs', 'gpus': [0, 1]}
2020-11-20 14:57:12,848 - INFO - Distributed training: True
2020-11-20 14:57:12,848 - INFO - Set random seed to 1024
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/distributed.py:364: UserWarning: Single-Process Multi-GPU is not the recommended mode for DDP. In this mode, each DDP instance operates on multiple devices and creates multiple module replicas within one process. The overhead of scatter/gather and GIL contention in every forward pass can slow down training. Please consider using one DDP instance per device or per module replica by explicitly setting device_ids or CUDA_VISIBLE_DEVICES.
"Single-Process Multi-GPU is not the recommended mode for "
cfg work dir is logs
validate........................
2020-11-20 14:57:20,817 - INFO - Start running, host: e2r@e2r-Super-Server, work_dir: /home/e2r/Desktop/e2r/featdepth/logs
2020-11-20 14:57:20,817 - INFO - workflow: [('train', 1)], max: 40 epochs
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/functional.py:3384: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:61: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
Traceback (most recent call last):
File "train.py", line 105, in <module>
main()
File "train.py", line 101, in main
logger=logger)
File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 68, in train_mono
_dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
File "/home/e2r/Desktop/e2r/featdepth/mono/apis/trainer.py", line 177, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 380, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 285, in train
self.call_hook('after_train_iter')
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/mmcv/runner/runner.py", line 241, in call_hook
getattr(hook, fn_name)(self)
File "/home/e2r/Desktop/e2r/featdepth/mono/core/utils/dist_utils.py", line 56, in after_train_iter
runner.outputs['loss'].backward()
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/tensor.py", line 185, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/autograd/__init__.py", line 127, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered
Exception raised from launch_vectorized_kernel at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/ATen/native/cuda/CUDALoops.cuh:146 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feb7c69177d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> const&) + 0x5cb (0x7feb1d621c0b in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, false, __nv_dl_tag<void (*)(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&), &(void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&)), 2u>, float (float), __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const, float> const&) + 0x11b (0x7feb1d62369b in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: void at::native::gpu_kernel_with_scalars<__nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> >(at::TensorIterator&, __nv_hdl_wrapper_t<false, true, __nv_dl_tag<void (*)(at::TensorIterator&), &at::native::mul_kernel_cuda, 5u>, float (float, float)> const&) + 0x3a7 (0x7feb1d6269c7 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::mul_kernel_cuda(at::TensorIterator&) + 0x167 (0x7feb1d5883b7 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x92ab03 (0x7feb4e3f0b03 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: at::native::mul_out(at::Tensor&, at::Tensor const&, at::Tensor const&) + 0x41 (0x7feb4e3e3181 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xabbf9c (0x7feb55041f9c in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: c10d::Reducer::mark_variable_ready_dense(c10d::Reducer::VariableIndex) + 0x87 (0x7feb5503e497 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: c10d::Reducer::mark_variable_ready(c10d::Reducer::VariableIndex) + 0x111 (0x7feb55042fe1 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: c10d::Reducer::autograd_hook(c10d::Reducer::VariableIndex) + 0xeb (0x7feb55043bdb in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0xabdd16 (0x7feb55043d16 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0xac4dc6 (0x7feb5504adc6 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x4dd (0x7feb50b9193d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7feb50b93401 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7feb50b8b579 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7feb54ab099a in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0xc8163 (0x7feb870a4163 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #18: <unknown function> + 0x76db (0x7feb8d5d36db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #19: clone + 0x3f (0x7feb8d2fca3f in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629403081/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7feb7c69177d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7feb7c8e1d9d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7feb7c67db1d in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x25a (0x7feb5504b70a in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10d::Reducer::~Reducer() + 0x2a3 (0x7feb550409f3 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7feb5501f172 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7feb547e3346 in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xa99efb (0x7feb5501fefb in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x26519e (0x7feb547eb19e in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x26676e (0x7feb547ec76e in /home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x10e978 (0x5607dec85978 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #11: <unknown function> + 0x1a3100 (0x5607ded1a100 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #12: <unknown function> + 0x10e888 (0x5607dec85888 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #13: <unknown function> + 0x1a3100 (0x5607ded1a100 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #14: <unknown function> + 0xfdfc8 (0x5607dec74fc8 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #15: <unknown function> + 0x10f147 (0x5607dec86147 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #16: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #17: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #18: <unknown function> + 0x10f15d (0x5607dec8615d in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #19: PyDict_SetItem + 0x502 (0x5607decdb172 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #20: PyDict_SetItemString + 0x4f (0x5607decdbc4f in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #21: PyImport_Cleanup + 0xa0 (0x5607ded20760 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #22: Py_FinalizeEx + 0x67 (0x5607ded9b817 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #23: <unknown function> + 0x2373d3 (0x5607dedae3d3 in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #24: _Py_UnixMain + 0x3c (0x5607dedae6fc in /home/e2r/anaconda3/envs/e2r/bin/python)
frame #25: __libc_start_main + 0xe7 (0x7feb8d1fcb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: <unknown function> + 0x1dc3c0 (0x5607ded533c0 in /home/e2r/anaconda3/envs/e2r/bin/python)
Traceback (most recent call last):
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/home/e2r/anaconda3/envs/e2r/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/e2r/anaconda3/envs/e2r/bin/python', '-u', 'train.py', '--local_rank=0', '--config', './config/cfg_kitti_fm.py', '--work_dir', 'logs', '--gpus', '0,1']' died with <Signals.SIGABRT: 6>.
from featdepth.
In your commmand '--nproc_per_node=1' this setting indicates you only have 1 GPU, but you set '--gpus "0,1"' which requires 2 GPUs.
from featdepth.
Did you fix your problem, can I close this issue now?
from featdepth.
Hi, I think I ran into the same problem here.
My command is
python3 -m torch.distributed.launch --master_port=9900 --nproc_per_node=1 train.py --config config/cfg_kitti_fm.py --work_dir /data/featdepth_logs
and my error log is
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
config/cfg_kitti_fm.py
cfg is Config (path: config/cfg_kitti_fm.py): {'DEPTH_LAYERS': 50, 'POSE_LAYERS': 18, 'FRAME_IDS': [0, -1, 1, 's'], 'IMGS_PER_GPU': 2, 'HEIGHT': 320, 'WIDTH': 1024, 'data': {'name': 'kitti', 'split': 'exp', 'height': 320, 'width': 1024, 'frame_ids': [0, -1, 1, 's'], 'in_path': '/data/kitti_data', 'gt_depth_path': '/data//kitti_data/gt_depths.npz', 'png': False, 'stereo_scale': True}, 'model': {'name': 'mono_fm', 'depth_num_layers': 50, 'pose_num_layers': 18, 'frame_ids': [0, -1, 1, 's'], 'imgs_per_gpu': 2, 'height': 320, 'width': 1024, 'scales': [0, 1, 2, 3], 'min_depth': 0.1, 'max_depth': 100.0, 'depth_pretrained_path': '/data/weights/resnet50.pth', 'pose_pretrained_path': '/data/weights/resnet18.pth', 'extractor_pretrained_path': '/data/autoencoder.pth', 'automask': False, 'disp_norm': False, 'perception_weight': 0.001, 'smoothness_weight': 0.001}, 'resume_from': None, 'finetune': None, 'total_epochs': 40, 'imgs_per_gpu': 2, 'learning_rate': 0.0001, 'workers_per_gpu': 4, 'validate': True, 'optimizer': {'type': 'Adam', 'lr': 0.0001, 'weight_decay': 0}, 'optimizer_config': {'grad_clip': {'max_norm': 35, 'norm_type': 2}}, 'lr_config': {'policy': 'step', 'warmup': 'linear', 'warmup_iters': 500, 'warmup_ratio': 0.3333333333333333, 'step': [20, 30], 'gamma': 0.5}, 'checkpoint_config': {'interval': 1}, 'log_config': {'interval': 50, 'hooks': [{'type': 'TextLoggerHook'}]}, 'dist_params': {'backend': 'nccl'}, 'log_level': 'INFO', 'load_from': None, 'workflow': [('train', 1)], 'work_dir': '/data/featdepth_logs', 'gpus': [0]}
2021-11-13 16:12:56,698 - INFO - Distributed training: True
2021-11-13 16:12:56,699 - INFO - Set random seed to 1024
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/transforms.py:287: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
warnings.warn(
cfg work dir is /data/featdepth_logs
validate........................
2021-11-13 16:13:04,115 - INFO - Start running, host: root@5efa5949cbf5, work_dir: /data/featdepth_logs
2021-11-13 16:13:04,116 - INFO - workflow: [('train', 1)], max: 40 epochs
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:4003: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn(
Traceback (most recent call last):
File "train.py", line 103, in <module>
main()
File "train.py", line 93, in main
train_mono(model,
File "/home/FeatDepth/mono/apis/trainer.py", line 68, in train_mono
_dist_train(model, dataset_train, dataset_val, cfg, validate=validate)
File "/home/FeatDepth/mono/apis/trainer.py", line 177, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/runner.py", line 380, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.8/dist-packages/mmcv/runner/runner.py", line 277, in train
outputs = self.batch_processor(
File "/home/FeatDepth/mono/apis/trainer.py", line 29, in batch_processor
model_out, losses = model(data)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 873, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 159 160 265 266
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 5459) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2021-11-13_16:13:14
host : 5efa5949cbf5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5459)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I'm running in a docker environment with 4 gpus, but it does not even work for the single gpu setting. Please help.
from featdepth.
Related Issues (20)
- running infer_singleimage.py in win10 ModuleNotFoundError: No module named 'resource' HOT 1
- it seems that the decoder of the auto-encoder network is not trained in the code HOT 1
- About the training and testing setting. HOT 4
- Is there a way to visually check the output of the FeatureNet HOT 1
- Question about MS evaluation HOT 5
- dataset.flag HOT 1
- An error occurred while training my own dataset HOT 2
- How to start non-distributed training HOT 2
- Why use DistOptimizerHook? HOT 1
- Problems using DDP HOT 2
- Train only Image Reconstruction Model HOT 2
- cfg_kitti_fm.py stops while training, expected 4 input channels, but got 3 channels instead
- train on own dataset with fm_joint.cfg HOT 1
- pose question HOT 1
- pose question HOT 2
- feature-metric loss only use the first output of the Autoencoder HOT 2
- Evaluation Issue HOT 1
- how to use multi-gpus training?
- Weights for Monocular-only training
- 关于online refinement的疑问
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from featdepth.