Edit - 1 The same problem occurs when using ZeRO2 with offloading

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thank for your reply! <a class="user-mention notranslate" data-hovercard-type="user" d

Thank for your reply! <a class="user-mention notranslate" data-hovercard-

[BUG] DeepSpeed hangs during evaluation under multi-GPU about deepspeed HOT 7 OPEN

kai-0430 commented on May 18, 2024

[BUG] DeepSpeed hangs during evaluation under multi-GPU

from deepspeed.

Comments (7)

kai-0430 commented on May 18, 2024 1

No, I haven't. Maybe I'll try it these days.

from deepspeed.

jacklanda commented on May 18, 2024

The same problem

from deepspeed.

jomayeri commented on May 18, 2024

@kai-0430 Can you provide the output of nvidia-smi topo -m

from deepspeed.

kai-0430 commented on May 18, 2024

@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur.

Here is another issue.
After I turn off the validation (evaluation), similar situation happens at the end of an epoch. It hangs at the following moment
(1) At the last step
(2) After the last step, the training loss is shown, but the program hangs and fails to complete.
(3) At the second-to-last step

For (1), setting dataloader_drop_last=True can seemingly solve.
For (2), I set NCCL_IB_DISABLE="1" according to this, and set report_to="none" in the training argument due to the logger sync issue according to this and this.
After solving (1) and (2), it appears that the training can be completed.
But I found when I enlarge the dataset size from 0.01B tokens to 0.1B tokens, case (3) happen. Here is the output of py-spy dump. It seems to get stuck at the backward. Then after 10 minute, NCCL timeout with opType ALLREDUCE.

Thread 36455 (idle): "MainThread"
    backward (torch/autograd/__init__.py:266)
    backward (torch/_tensor.py:522)
    backward (deepspeed/runtime/fp16/loss_scaler.py:63)
    backward (deepspeed/runtime/zero/stage_1_and_2.py:2051)
    backward (deepspeed/runtime/engine.py:1976)
    wrapped_fn (deepspeed/utils/nvtx.py:15)
    backward (accelerate/utils/deepspeed.py:166)
    backward (accelerate/accelerator.py:1995)
    training_step (transformers/trainer.py:3045)
    _inner_training_loop (transformers/trainer.py:2118)
    train (transformers/trainer.py:1780)
    main (llama2_ds_v3.py:232)
    <module> (llama2_ds_v3.py:240)
Thread 36635 (idle): "Thread-1"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 1020 (idle): "Thread-11"
    wait (threading.py:331)
    wait (threading.py:629)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2867 (idle): "Thread-15 (_pin_memory_loop)"
    select (selectors.py:415)
    wait (multiprocessing/connection.py:947)
    _poll (multiprocessing/connection.py:440)
    poll (multiprocessing/connection.py:257)
    get (multiprocessing/queues.py:113)
    do_one_step (torch/utils/data/_utils/pin_memory.py:30)
    _pin_memory_loop (torch/utils/data/_utils/pin_memory.py:53)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2938 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2939 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2940 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2941 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2942 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2943 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2944 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2945 (idle): "QueueFeederThread"
    wait (threading.py:327)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:982)
    _bootstrap_inner (threading.py:1045)
    _bootstrap (threading.py:1002)
Thread 2963 (idle)
Thread 2964 (idle)
Thread 2962 (active)
    _flatten_dense_tensors (torch/_utils.py:526)
    allreduce_bucket (deepspeed/runtime/zero/stage_1_and_2.py:1477)
    allreduce_and_copy_with_multiple_ranks (deepspeed/runtime/zero/stage_1_and_2.py:1000)
    allreduce_and_scatter (deepspeed/runtime/zero/stage_1_and_2.py:1027)
    average_tensor (deepspeed/runtime/zero/stage_1_and_2.py:1123)
    reduce_ipg_grads (deepspeed/runtime/zero/stage_1_and_2.py:1363)
    reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:928)
    reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:1412)
    reduce_partition_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:899)
    backward (torch/autograd/__init__.py:266)
    backward (torch/utils/checkpoint.py:319)
    apply (torch/autograd/function.py:289)
Thread 2965 (idle)

from deepspeed.

jomayeri commented on May 18, 2024

This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?

from deepspeed.

kai-0430 commented on May 18, 2024

Thank for your reply! @jomayeri
If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur.
I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case.
So, what system issues could be causing it? I want to figure out possible soultions to make it work.

from deepspeed.

jacklanda commented on May 18, 2024

Thank for your reply! @jomayeri If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur. I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case. So, what system issues could be causing it? I want to figure out possible soultions to make it work.

I guess it could be the issue happened in the accelerate / transformer. Hence, I filed a related issue here.

Have you tried to use the original FSDP API of Pytorch to conduct parallel training in DDP?

from deepspeed.

[BUG] DeepSpeed hangs during evaluation under multi-GPU about deepspeed HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent