Coder Social home page Coder Social logo

Possible bottleneck? about assem-vc HOT 3 OPEN

Vadim2S avatar Vadim2S commented on June 16, 2024
Possible bottleneck?

from assem-vc.

Comments (3)

wookladin avatar wookladin commented on June 16, 2024

Hi. In my case, I tried using distributed_backend='ddp' as that warning recommended.
However, multi-GPU training error occurs in the following situations:

  • when the first GPU (i.e. ID 0) is not included in the GPUs list. For example:
    python synthesizer_trainer.py -g 1,2,3
  • when the GPUs list is not sequential. For example:
    python synthesizer_trainer.py -g 0,2,3

About the issue mentioned above, see Lightning-AI/pytorch-lightning#4171

This error is caused by pytorch-lightning and can be resolved by upgrading the version.

As the error said, using DDP and num_workers>0 at once makes initializing and training speed faster.
If you want speed-up in the current setting,

  1. Change accelerator='None' to 'ddp' in synthesizer_trainer.py and cotatron_trainer.py
  2. After that, if you want to use GPU number 1, 2, 4,
    You can use it like CUDA_VISIBLE_DEVICES=1,2,4 python3 synthesizer_trainer.py instead of the gpu option using -g 1,2,4.

from assem-vc.

wookladin avatar wookladin commented on June 16, 2024

In order to completely solve this problem, we need a version up of the PyTorch Lightning module. However, there are conflicts between the pl versions, so we plan to check them carefully.
Thank you for sharing the issue!

from assem-vc.

Vadim2S avatar Vadim2S commented on June 16, 2024

Unfortunate, accelerator='ddp' is not stable. Accelerator='None' is OK.

File "/home/assem-vc/synthesizer_trainer.py", line 85, in
main(args)
File "/home/assem-vc/synthesizer_trainer.py", line 64, in main
trainer.fit(model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 439, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 146, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
results = self.train_or_test()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
results = self.trainer.train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 482, in train
self.train_loop.run_training_epoch()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 541, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/training_loop.py", line 704, in run_training_batch
self.accumulated_loss.append(opt_closure_result.loss)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/supporters.py", line 64, in append
x = x.to(self.memory)
RuntimeError: CUDA error: the launch timed out and was terminated
Exception in thread Thread-22:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, *self._kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/pin_memory.py", line 25, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:764)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3f13358193 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: + 0x17f66 (0x7f3f13595f66 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x19cbd (0x7f3f13597cbd in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7f3f1334863d in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #4: c10d::Reducer::~Reducer() + 0x449 (0x7f3eff7e9b89 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer
, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f3eff7cb592 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f3eff034e56 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #7: + 0x9e813b (0x7f3eff7cc13b in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #8: + 0x293f30 (0x7f3eff077f30 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2951ce (0x7f3eff0791ce in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python3() [0x5d1ca7]
frame #11: /usr/bin/python3() [0x5a605d]
frame #12: /usr/bin/python3() [0x5d1ca7]
frame #13: /usr/bin/python3() [0x5a3132]
frame #14: /usr/bin/python3() [0x4ef828]
frame #15: _PyGC_CollectNoFail + 0x2f (0x6715cf in /usr/bin/python3)
frame #16: PyImport_Cleanup + 0x244 (0x683bf4 in /usr/bin/python3)
frame #17: Py_FinalizeEx + 0x7f (0x67eaef in /usr/bin/python3)
frame #18: Py_RunMain + 0x32d (0x6b624d in /usr/bin/python3)
frame #19: Py_BytesMain + 0x2d (0x6b64bd in /usr/bin/python3)
frame #20: __libc_start_main + 0xf3 (0x7f3f1f2e30b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: _start + 0x2e (0x5f927e in /usr/bin/python3)

from assem-vc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.