Comments (7)
No, I haven't. Maybe I'll try it these days.
from deepspeed.
The same problem
from deepspeed.
@kai-0430 Can you provide the output of nvidia-smi topo -m
from deepspeed.
@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur.
Here is another issue.
After I turn off the validation (evaluation), similar situation happens at the end of an epoch. It hangs at the following moment
(1) At the last step
(2) After the last step, the training loss is shown, but the program hangs and fails to complete.
(3) At the second-to-last step
For (1), setting dataloader_drop_last=True
can seemingly solve.
For (2), I set NCCL_IB_DISABLE="1" according to this, and set report_to="none"
in the training argument due to the logger sync issue according to this and this.
After solving (1) and (2), it appears that the training can be completed.
But I found when I enlarge the dataset size from 0.01B tokens to 0.1B tokens, case (3) happen. Here is the output of py-spy dump
. It seems to get stuck at the backward. Then after 10 minute, NCCL timeout with opType ALLREDUCE.
Thread 36455 (idle): "MainThread"
backward (torch/autograd/__init__.py:266)
backward (torch/_tensor.py:522)
backward (deepspeed/runtime/fp16/loss_scaler.py:63)
backward (deepspeed/runtime/zero/stage_1_and_2.py:2051)
backward (deepspeed/runtime/engine.py:1976)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:1995)
training_step (transformers/trainer.py:3045)
_inner_training_loop (transformers/trainer.py:2118)
train (transformers/trainer.py:1780)
main (llama2_ds_v3.py:232)
<module> (llama2_ds_v3.py:240)
Thread 36635 (idle): "Thread-1"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 1020 (idle): "Thread-11"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2867 (idle): "Thread-15 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:947)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:30)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:53)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2938 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2939 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2940 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2941 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2942 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2943 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2944 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2945 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2963 (idle)
Thread 2964 (idle)
Thread 2962 (active)
_flatten_dense_tensors (torch/_utils.py:526)
allreduce_bucket (deepspeed/runtime/zero/stage_1_and_2.py:1477)
allreduce_and_copy_with_multiple_ranks (deepspeed/runtime/zero/stage_1_and_2.py:1000)
allreduce_and_scatter (deepspeed/runtime/zero/stage_1_and_2.py:1027)
average_tensor (deepspeed/runtime/zero/stage_1_and_2.py:1123)
reduce_ipg_grads (deepspeed/runtime/zero/stage_1_and_2.py:1363)
reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:928)
reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:1412)
reduce_partition_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:899)
backward (torch/autograd/__init__.py:266)
backward (torch/utils/checkpoint.py:319)
apply (torch/autograd/function.py:289)
Thread 2965 (idle)
from deepspeed.
This seems to be a systems issue. If you run without DeepSpeed does the hang also occur?
from deepspeed.
Thank for your reply! @jomayeri
If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur.
I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case.
So, what system issues could be causing it? I want to figure out possible soultions to make it work.
from deepspeed.
Thank for your reply! @jomayeri If I run training without DeepSpeed (use 4 V100 but only one is active at a time), the hang won't occur. I was curious about whether this is an issue of DeepSpeed or not, so I tried another distributed training method, the FSDP integrated in Accelerate package. Surprisingly, the hang also occurs! This issue doesn't solely occur in DeepSpeed in my case. So, what system issues could be causing it? I want to figure out possible soultions to make it work.
I guess it could be the issue happened in the accelerate / transformer. Hence, I filed a related issue here.
Have you tried to use the original FSDP API of Pytorch to conduct parallel training in DDP?
from deepspeed.
Related Issues (20)
- [BUG] import deepspeed, MissingCUDAException HOT 2
- [REQUEST] Add documentation on how to run fast inference of `transformers` models with ZeRO-3
- [REQUEST] Any arguments for disabling saving global steps?
- [BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs
- [BUG] 3 GPUs is not as good as expectation compare with 2 GPUs; NV vs AMD performace; flash attention not support for AMD GPUs
- [BUG] Unexpected High Memory Usage (OOM) when finetuning Llama2-7B
- [REQUEST] Enable both CPU and NVMe for optimizer
- [BUG] Mismatch between dtype settings in model and ds_config results in NaN loss
- [REQUEST] Launcher mode with SSH bypass HOT 1
- FileNotFoundError: [Errno 2] No such file or directory: ':/usr/local/cuda/bin/nvcc' HOT 3
- [BUG] Uneven work distribution caused by get_shard_size changes
- [BUG] When initializing model_engine, if an mpu is specified, it can lead to an excessively large checkpoint size, and the checkpoint may not be convertible through the `zero_to_fp32.py` script.
- [BUG] Uneven work distribution caused by get_shard_size changes HOT 7
- [REQUEST] pynvml package seems to be deprecated in favor of nvidia-ml-py HOT 1
- [BUG] BertLMHeadModel.from_pretrained hangs when using zero-3 / zero3-offload HOT 1
- [BUG]Why ZeroOneAdam is easy to OOM compared to Adam optimizer?
- [BUG] Why the results were inconsistent in two identical tests with config zero2 + overlap_comm HOT 3
- [BUG] Zero3: Post backward hook is not triggered for submodules whose inputs have .required_grad=False HOT 1
- Get a error when use deepspeed training with torch.compile HOT 1
- [BUG]AttributeError: module 'torch.nn.functional' has no attribute 'scaled_dot_product_attention'
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepspeed.