Coder Social home page Coder Social logo

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered about slidr HOT 13 CLOSED

jaycheney avatar jaycheney commented on September 4, 2024
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

from slidr.

Comments (13)

CSautier avatar CSautier commented on September 4, 2024

Hi, have you been modifying the code? Because Line 124 in "SLidR/pretrain/lightning_trainer.py" isn't supposed to be
"k = one_hot_P @ output_points[batch["pairing_points"]]".
This error is typical of an incorrect indexing. That can happen in the superpixels indices, the pairing of the points or the creation of the sparse matrices.

from slidr.

jaycheney avatar jaycheney commented on September 4, 2024

Thank you for your reply. I didn't modify the original code (only some comments), but when I run the pretrain.py code with the slidr_minkunet.yaml, the problem RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered still occurs. I checked the k = one_hot_P @ output_points[batch["pairing_points"]], the dimension is one_hot_P(3600,45771), output_points(44516,64) and pairing_points(45771,).
But when I run the the pretrain.py code with the config/slidr_voxelnet.yaml, doesn't have any problem.

BTW, could you kindly share the code via email used to finetune object detection models from OpenPCDet? I'm still reproducing the results of SLidR. It'll help me a lot.
Originally posted by @CSautier in #3 (comment)

from slidr.

ZhengLeon avatar ZhengLeon commented on September 4, 2024

I met this problem too, did you solve it? @JakeVander

from slidr.

andrewcaunes avatar andrewcaunes commented on September 4, 2024

Same problem here.

from slidr.

modifierT avatar modifierT commented on September 4, 2024

Same problem here.

from slidr.

CSautier avatar CSautier commented on September 4, 2024

Could you please tell me exactly what you are running? I will try to add a dockerfile to setup a working environment, and see if I can either reproduce the issue, or specify an environment so that you won't have any. However I'm not sure a single dockerfile will suit every configuration since MinkowskiEngine can be particularly painful to compile.

from slidr.

andrewcaunes avatar andrewcaunes commented on September 4, 2024

I managed to fix this error but don't remember exactly how, sorry.
I'm pretty sure it was by changing pytorch version, and I ended up using these :
pytorch=1.12.1
MinkowskiEngine=0.5.4
along with a nvidia V100 with cuda=12.0

Thanks for the amazing work by the way !

from slidr.

modifierT avatar modifierT commented on September 4, 2024

Thanks for reopening the issue!
I have ran
python pretrain.py --cfg config/slidr_minkunet.yaml
in terminal and got
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
almost the same as πŸ‘‡

This problem occurs when I run the code pretrain.py. I tried a lot of methods, but do not know how to deal with them. Could you help me?

I run with 1GPU, ubuntu 18.04, cudnn8, cuda11.1 and other requirements same like requirements.txt.

Training: -1it [00:00, ?it/s]
Training: 0%| | 0/7033 [00:00<00:00, 22671.91it/s]
Epoch 0: 0%| | 0/7033 [00:00<00:01, 3584.88it/s] Traceback (most recent call last):
File "pretrain.py", line 61, in
main()
File "pretrain.py", line 57, in main
trainer.fit(module, dm)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 131, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 402, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/sgd.py", line 87, in step
loss = closure()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 533, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 386, in training_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in training_step
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 124, in loss_superpixels_average
k = one_hot_P @ output_points[batch["pairing_points"]]
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Epoch 0: 0%| | 0/7033 [00:37<74:08:49, 37.95s/it]

in which

k = one_hot_P @ output_points[pairing_points]
one_hot_P: torch.Size([7200, 104372])
output_points[pairing_points]: torch.Size([104372, 64])

And this problem won't occurred when running python pretrain.py --cfg config/slidr_voxelnet.yaml

my env is

torch == 1.10.0+cu113
numpy == 1.24.1
MinkowskiEngine == 0.5.4
nuscenes-devkit == 1.1.10
pytorch_lightning == 1.4.0
multiprocess == 0.70.15
scikit-image == 0.21.0
torchvision == 0.11.1+cu113
spconv == 2.3.6
torchmetrics == 0.4.0

along with nvidia A800

from slidr.

CSautier avatar CSautier commented on September 4, 2024

Ok, I was actually able to re-produce the issue now. I'll see what I can do.

from slidr.

CSautier avatar CSautier commented on September 4, 2024

The issue is indeed a compatibility issue between MinkowskiEngine, and some versions of Pytorch+CUDA (see for instance NVIDIA/MinkowskiEngine#299).

I was able to run the code again using CUDA 11.3, torch 1.12.0 cudnn 8, the latest commit of MinkowskiEngine and pytorch_lightning 1.6.0

I will add a Dockerfile with this config to the repo.

from slidr.

modifierT avatar modifierT commented on September 4, 2024

It works! Thanks a lot!

from slidr.

CSautier avatar CSautier commented on September 4, 2024

Ok, I'm closing the issue. For future reference, if problems with MinkowskiEngine reappear, it could be easier to switch to torchsparse see for instance.
That would require some rewriting, especially in the dataloader, and would break compatibility with published results' weights.

from slidr.

Eaphan avatar Eaphan commented on September 4, 2024

I have the same error too when I use the MinkowskiEngine with tag v0.5.4.
Then I checkout commit 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228 and re-install the MinkowskiEngine, and the error disappears.

from slidr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.