Coder Social home page Coder Social logo

Comments (7)

Adel-Moumen avatar Adel-Moumen commented on June 3, 2024 1

Could you please try with the SpeechBrain version available in the develop branch and get back to me with the results? We fixed several issues with DDP in this new version.

You can install it with the following command:

pip install git+https://github.com/speechbrain/speechbrain.git@develop

from speechbrain.

pplantinga avatar pplantinga commented on June 3, 2024 1

Hi, thanks for your very detailed investigation of this issue, this makes it much easier to debug and fix on our side. To address these three issues, let me respond below:

  1. Yes this was an issue and we have fixed it.
  2. This approach should be unnecessary, it should "just work" as the default saving function is marked with @main_process_only see this line. However, I have opened a PR #2404 based on this feedback to enable this approach to work, though you'd have to use a @main_process_only function rather than if_main_process.
  3. I don't think this is the right place to insert the print statement. Instead, try putting it inside the default saving function (same line as above). The issue should no longer occur, if it does please let us know.

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 3, 2024

Hello @kokamido, thanks for opening this issue! Could you please let us know if your speechbrain version is from the main branch or the develop branch? How did you installed SpeechBrain ? Through pip install speechbrain or git clone ? Thanks.

I'm pinging again @pplantinga as this is a very important issue.

from speechbrain.

kokamido avatar kokamido commented on June 3, 2024

I installed speechbrain==0.5.16 via pip.
In order to add a "print" described in the "Multiple writings of the same checkpoint" section I modified /usr/local/lib/python3.10/dist-packages/speechbrain/utils/checkpoints.py file of the speechbrain package installed via pip.

from speechbrain.

kokamido avatar kokamido commented on June 3, 2024

I tested develop version of the speechbrain package installed as pip install git+https://github.com/speechbrain/speechbrain.git@develop

1. Write intra-epoch checkpoints only

Seems fixed. It takes a few epochs to crash if I use speechbrain==0.5.16 from pip, but it worked well for 100 epochs if I use develop version. I think it means that this issue is fixed in the develop branch

2. Write end-of-epoch checkpoints in main thread only.

No changes. Both setups (with and without TORCH_DISTRIBUTED_DEBUG=DETAIL) behave as described in the issue

3. Write end-of-epoch checkpoints in all threads.

No changes. Both DDP-workers write a checkpoint according to logs from print(f'{os.environ.get("LOCAL_RANK")}\t{ckpt_dir}/{name}') injected to this line.

100%|██████████| 160/160 [00:01<00:00, 153.53it/s, train_loss=0.68] 
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/counter
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/brain
1       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer
0       experiments/ddp_crash_repro/save/CKPT+2024-02-10+13-30-56+00/optimizer

from speechbrain.

kokamido avatar kokamido commented on June 3, 2024

Thanks for the clarification. Now I understand how the checkpoints should be saved, and I have no more questions.

from speechbrain.

Adel-Moumen avatar Adel-Moumen commented on June 3, 2024

Solved in #2404

from speechbrain.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.