I'm trying to run the TP code, using this . <div class="snippet-clipboard-co

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Tensor parallel hangs on call to model about gpt-fast HOT 6 CLOSED

pytorch-labs commented on June 28, 2024

Tensor parallel hangs on call to model

from gpt-fast.

Comments (6)

yifuwang commented on June 28, 2024 1

Hey @briandw, can you try this small script to see if the issue reproduces?

import os

import torch
import torch.distributed as dist


if __name__ == "__main__":
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    pid = os.getpid()

    def log(msg) -> None:
        print(f"[rank {rank}, pid {pid}] {msg}")

    torch.cuda.set_device(f"cuda:{local_rank}")

    log("Initializing process group...")
    dist.init_process_group(backend="nccl")
    log("Process group initialization completed")

    log("Testing all_reduce...")
    t = torch.full((8, 8), rank, device="cuda")
    dist.all_reduce(t)
    assert t.eq(world_size * (world_size - 1) // 2).all()
    log("All_reduce completed")

Run it with torchrun --nproc_per_node=2 --monitor-interval=1 [name].py. If it hangs, it would be very helpful if you can provide the stacktrace - find the pids in the log and run gdb -p [pid] then bt.

from gpt-fast.

yifuwang commented on June 28, 2024 1

@briandw looks like it got past process group initialization but stuck in all_reduce. Curious have you had success with nccl before on your dual 4090 setup?

I don't have access to such a setup so I can only provide some random ideas:

Try setting NCCL_P2P_DISABLE=1 (IIRC p2p access is locked on 4090s. Not sure if nccl handles it correctly for 4090s)
Check nvidia-smi topo -m to see how the cards are connected
Poke around nccl-tests and make sure it works

from gpt-fast.

briandw commented on June 28, 2024 1

Wow looks like NCCL_P2P_DISABLE=1 did the trick!
Result for the test code:

All_reduce completed[rank 1, pid 151015] All_reduce completed.
The tp_example.py also works now.

Thanks for your help @yifuwang

BTW I found good discussion on this issue: NVIDIA/nccl-tests#117
and
https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/34

from gpt-fast.

Chillee commented on June 28, 2024

This seems like some configuration issue with using NCCL. Generally, communication collectives (and thus the process) hang when they're unable to connect to each other in some manner.

Not sure if PyTorch distributed folks have any ideas offhand (cc: @yifuwang ).

from gpt-fast.

briandw commented on June 28, 2024

@Chillee Do you have a version / githash of PyTorch working with TP?
I’ve filed a issue with PyTorch also, pytorch/pytorch#115964

from gpt-fast.

briandw commented on June 28, 2024

@yifuwang
Thanks for having a look at this.

I ran the code and it hung.

Here's the stack trace:

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 136014
[New LWP 136019]
[New LWP 136020]
[New LWP 136021]
[New LWP 136039]
[New LWP 136040]
[New LWP 136041]
[New LWP 136042]
[New LWP 136049]
[New LWP 136052]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffe1c5ae6e8 in ?? ()
(gdb) bt
#0  0x00007ffe1c5ae6e8 in ?? ()
#1  0x00007ffe1c5ae84a in ?? ()
#2  0x00007fc91c8e566d in __GI___clock_gettime (clock_id=<optimized out>, tp=<optimized out>) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#3  0x00007fc8718b6e24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fc87177cf56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fc871b01e8a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fc87188cb66 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fc8718747be in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fc871877140 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fc8718d8d24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007fc91ba37c5d in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#11 0x00007fc91ba383a0 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#12 0x00007fc91ba383ff in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#13 0x00007fc91ba3af84 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#14 0x00007fc91ba14930 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#15 0x00007fc91ba6bf5e in cudaLaunchKernel ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#16 0x00007fc8d1ae2f7b in void at::native::gpu_kernel_impl_nocast<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007fc8d1ae3575 in void at::native::gpu_kernel_impl<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007fc8d1ae3b0b in void at::native::gpu_kernel<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007fc8d1ae3c69 in void at::native::opmath_symmetric_gpu_kernel_with_scalars<long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> >(at::TensorIteratorBase&, at::native::(anonymous namespace)::CompareEqFunctor<long> const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007fc8d1ab22a9 in at::native::compare_eq_ne_kernel(at::TensorIteratorBase&, at::native::(anonymous namespace)::EqOpType) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007fc8d34a3ddb in at::(anonymous namespace)::wrapper_CUDA_eq_Scalar(at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#22 0x00007fc8d34a3e70 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tens--Type <RET> for more, q to quit, c to continue --Type <--Type--Type--Ty--T--T----------Type <RET> for more, q to quit, c to continue without paging--
or (at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CUDA_eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Scalar const&> >, at::Tensor (at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#23 0x00007fc904c0a3ce in at::_ops::eq_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fc906aab99a in torch::autograd::VariableType::(anonymous namespace)::eq_Scalar(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fc906aab9e3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fc904c5ecc1 in at::_ops::eq_Scalar::call(at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fc91a2c2850 in torch::autograd::THPVariable_eq(_object*, _object*, _object*) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#28 0x000056486a93bdb0 in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/descrobject.c:364
#29 0x000056486a927e91 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fc91bd9ee80, tstate=0x56486acb1d98 <_PyRuntime+166328>)
    at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_call.h:92
#30 PyObject_Vectorcall (callable=0x7fc91bd9ee80, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/call.c:299
#31 0x000056486a91ac62 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:4772
#32 0x000056486a9d805e in _PyEval_EvalFrame (throwflag=0, frame=0x7fc91cc0b020, tstate=0x56486acb1d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_ceval.h:73
#33 _PyEval_Vector (tstate=0x56486acb1d98 <_PyRuntime+166328>, func=0x7fc91c1d1f80, locals=0x7fc91c1f2580, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/ceval.c:6428
#34 0x000056486a9d75ef in PyEval_EvalCode (co=<optimized out>, globals=0x7fc91c1f2580, locals=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:1154
#35 0x000056486a9fa12c in run_eval_code_obj (tstate=0x56486acb1d98 <_PyRuntime+166328>, co=0x7fc91c0ea670, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1714
#36 0x000056486a9f63a4 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580, flags=<optimized out>, arena=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1735
#37 0x000056486aa0b372 in pyrun_file (fp=fp@entry=0x56486ca57030, filename=filename@entry=0x7fc91c01eba0, start=start@entry=257, globals=globals@entry=0x7fc91c1f2580, locals=locals@entry=0x7fc91c1f2580, closeit=closeit@entry=1, 
    flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1630
#38 0x000056486aa0aca5 in _PyRun_SimpleFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:440
#39 0x000056486aa0aa73 in _PyRun_AnyFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:79
#40 0x000056486aa04b76 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7fc91c01eba0, program_name=0x7fc91c0d7b10) at /usr/local/src/conda/python-3.11.0/Modules/main.c:360
#41 pymain_run_file (config=0x56486ac97de0 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:379
#42 pymain_run_python (exitcode=0x7ffe1c4698a0) at /usr/local/src/conda/python-3.11.0/Modules/main.c:601
#43 Py_RunMain () at /usr/local/src/conda/python-3.11.0/Modules/main.c:680
#44 0x000056486a9c5e19 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:734
#45 0x00007fc91c829d90 in __libc_start_call_main (main=main@entry=0x56486a9c5d70 <main>, argc=argc@entry=3, argv=argv@entry=0x7ffe1c469af8) at ../sysdeps/nptl/libc_start_call_main.h:58
#46 0x00007fc91c829e40 in __libc_start_main_impl (main=0x56486a9c5d70 <main>, argc=3, argv=0x7ffe1c469af8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe1c469ae8) at ../csu/libc-start.c:392
#47 0x000056486a9c5cb1 in _start ()

from gpt-fast.

Tensor parallel hangs on call to model about gpt-fast HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent