Coder Social home page Coder Social logo

Comments (7)

kai-0430 avatar kai-0430 commented on May 18, 2024 1

No, I haven't. Maybe I'll try it these days.

from deepspeed.

jacklanda avatar jacklanda commented on May 18, 2024

The same problem

from deepspeed.

jomayeri avatar jomayeri commented on May 18, 2024

@kai-0430 Can you provide the output of nvidia-smi topo -m

from deepspeed.

kai-0430 avatar kai-0430 commented on May 18, 2024

@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur.
topo_on_TWCC

Here is another issue.
After I turn off the validation (evaluation), similar situation happens at the end of an epoch. It hangs at the following moment
(1) At the last step
(2) After the last step, the training loss is shown, but the program hangs and fails to complete.
(3) At the second-to-last step

For (1), setting dataloader_drop_last=True can seemingly solve.
For (2), I set NCCL_IB_DISABLE="1" according to this, and set report_to="none" in the training argument due to the logger sync issue according to this and this.
After solving (1) and (2), it appears that the training can be completed.
But I found when I enlarge the dataset size from 0.01B tokens to 0.1B tokens, case (3) happen. Here is the output of py-spy dump. It seems to get stuck at the backward. Then after 10 minute, NCCL timeout with opType ALLREDUCE.

Thread 36455 (idle): "MainThread"
    backward (torch/autograd/__init__.py:266)
    backward (torch/_tensor.py:522)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage_1_and_2.py:2051)
    backward (deepspeed/runtime/engine.py:1976)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:1995)
    training_step (transformers/trainer.py:3045)
    _inner_training_loop (transformers/trainer.py:2118)
    train (transformers/trainer.py:1780)
    main (llama2_ds_v3.py:232)
    <module> (llama2_ds_v3.py:240)
Thread 36635 (idle): "Thread-1"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 1020 (idle): "Thread-11"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2867 (idle): "Thread-15 (_pin_memory_loop)"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:947)
    _poll (multiprocessing/connection.py:440)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:30)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:53)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2938 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2939 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2940 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2941 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2942 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2943 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2944 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2945 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2963 (idle)
Thread 2964 (idle)
Thread 2962 (active)
    _flatten_dense_tensors (torch/_utils.py:526)
    allreduce_bucket (deepspeed/runtime/zero/stage_1_and_2.py:1477)
    allreduce_and_copy_with_multiple_ranks (deepspeed/runtime/zero/stage_1_and_2.py:1000)
    allreduce_and_scatter (deepspeed/runtime/zero/stage_1_and_2.py:1027)
    average_tensor (deepspeed/runtime/zero/stage_1_and_2.py:1123)
    reduce_ipg_grads (deepspeed/runtime/zero/stage_1_and_2.py:1363)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:928)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:1412)
    reduce_partition_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:899)
    backward (torch/autograd/__init__.py:266)
    backward (torch/utils/checkpoint.py:319)
    apply (torch/autograd/function.py:289)
Thread 2965 (idle)

from deepspeed.

jomayeri avatar jomayeri commented on May 18, 2024

This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?

from deepspeed.

kai-0430 avatar kai-0430 commented on May 18, 2024

Thank for your reply! @jomayeri
If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur.
I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case.
So, what system issues could be causing it? I want to figure out possible soultions to make it work.

from deepspeed.

jacklanda avatar jacklanda commented on May 18, 2024

Thank for your reply! @jomayeri If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur. I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case. So, what system issues could be causing it? I want to figure out possible soultions to make it work.

I guess it could be the issue happened in the accelerate / transformer. Hence, I filed a related issue here.

Have you tried to use the original FSDP API of Pytorch to conduct parallel training in DDP?

from deepspeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.