Coder Social home page Coder Social logo

Comments (6)

yifuwang avatar yifuwang commented on June 28, 2024 1

Hey @briandw, can you try this small script to see if the issue reproduces?

import os

import torch
import torch.distributed as dist


if __name__ == "__main__":
    rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    local_rank = int(os.environ["LOCAL_RANK"])
    pid = os.getpid()

    def log(msg) -> None:
        print(f"[rank {rank}, pid {pid}] {msg}")

    torch.cuda.set_device(f"cuda:{local_rank}")

    log("Initializing process group...")
    dist.init_process_group(backend="nccl")
    log("Process group initialization completed")

    log("Testing all_reduce...")
    t = torch.full((8, 8), rank, device="cuda")
    dist.all_reduce(t)
    assert t.eq(world_size * (world_size - 1) // 2).all()
    log("All_reduce completed")

Run it with torchrun --nproc_per_node=2 --monitor-interval=1 [name].py. If it hangs, it would be very helpful if you can provide the stacktrace - find the pids in the log and run gdb -p [pid] then bt.

from gpt-fast.

yifuwang avatar yifuwang commented on June 28, 2024 1

@briandw looks like it got past process group initialization but stuck in all_reduce. Curious have you had success with nccl before on your dual 4090 setup?

I don't have access to such a setup so I can only provide some random ideas:

  • Try setting NCCL_P2P_DISABLE=1 (IIRC p2p access is locked on 4090s. Not sure if nccl handles it correctly for 4090s)
  • Check nvidia-smi topo -m to see how the cards are connected
  • Poke around nccl-tests and make sure it works

from gpt-fast.

briandw avatar briandw commented on June 28, 2024 1

Wow looks like NCCL_P2P_DISABLE=1 did the trick!
Result for the test code:

All_reduce completed[rank 1, pid 151015] All_reduce completed.
The tp_example.py also works now.

Thanks for your help @yifuwang

BTW I found good discussion on this issue: NVIDIA/nccl-tests#117
and
https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/34

from gpt-fast.

Chillee avatar Chillee commented on June 28, 2024

This seems like some configuration issue with using NCCL. Generally, communication collectives (and thus the process) hang when they're unable to connect to each other in some manner.

Not sure if PyTorch distributed folks have any ideas offhand (cc: @yifuwang ).

from gpt-fast.

briandw avatar briandw commented on June 28, 2024

@Chillee Do you have a version / githash of PyTorch working with TP?
I’ve filed a issue with PyTorch also, pytorch/pytorch#115964

from gpt-fast.

briandw avatar briandw commented on June 28, 2024

@yifuwang
Thanks for having a look at this.

I ran the code and it hung.

Here's the stack trace:

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 136014
[New LWP 136019]
[New LWP 136020]
[New LWP 136021]
[New LWP 136039]
[New LWP 136040]
[New LWP 136041]
[New LWP 136042]
[New LWP 136049]
[New LWP 136052]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007ffe1c5ae6e8 in ?? ()
(gdb) bt
#0  0x00007ffe1c5ae6e8 in ?? ()
#1  0x00007ffe1c5ae84a in ?? ()
#2  0x00007fc91c8e566d in __GI___clock_gettime (clock_id=<optimized out>, tp=<optimized out>) at ../sysdeps/unix/sysv/linux/clock_gettime.c:42
#3  0x00007fc8718b6e24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fc87177cf56 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fc871b01e8a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#6  0x00007fc87188cb66 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#7  0x00007fc8718747be in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#8  0x00007fc871877140 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007fc8718d8d24 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#10 0x00007fc91ba37c5d in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#11 0x00007fc91ba383a0 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#12 0x00007fc91ba383ff in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#13 0x00007fc91ba3af84 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#14 0x00007fc91ba14930 in ?? ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#15 0x00007fc91ba6bf5e in cudaLaunchKernel ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/../../nvidia/cuda_runtime/lib/libcudart.so.12
#16 0x00007fc8d1ae2f7b in void at::native::gpu_kernel_impl_nocast<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#17 0x00007fc8d1ae3575 in void at::native::gpu_kernel_impl<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#18 0x00007fc8d1ae3b0b in void at::native::gpu_kernel<at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > >(at::TensorIteratorBase&, at::native::AUnaryFunctor<long, long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> > const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#19 0x00007fc8d1ae3c69 in void at::native::opmath_symmetric_gpu_kernel_with_scalars<long, bool, at::native::(anonymous namespace)::CompareEqFunctor<long> >(at::TensorIteratorBase&, at::native::(anonymous namespace)::CompareEqFunctor<long> const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#20 0x00007fc8d1ab22a9 in at::native::compare_eq_ne_kernel(at::TensorIteratorBase&, at::native::(anonymous namespace)::EqOpType) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#21 0x00007fc8d34a3ddb in at::(anonymous namespace)::wrapper_CUDA_eq_Scalar(at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#22 0x00007fc8d34a3e70 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tens--Type <RET> for more, q to quit, c to continue --Type <--Type--Type--Ty--T--T----------Type <RET> for more, q to quit, c to continue without paging--
or (at::Tensor const&, c10::Scalar const&), &at::(anonymous namespace)::wrapper_CUDA_eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::Scalar const&> >, at::Tensor (at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
#23 0x00007fc904c0a3ce in at::_ops::eq_Scalar::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#24 0x00007fc906aab99a in torch::autograd::VariableType::(anonymous namespace)::eq_Scalar(c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) ()
   from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#25 0x00007fc906aab9e3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&), &torch::autograd::VariableType::(anonymous namespace)::eq_Scalar>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#26 0x00007fc904c5ecc1 in at::_ops::eq_Scalar::call(at::Tensor const&, c10::Scalar const&) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
#27 0x00007fc91a2c2850 in torch::autograd::THPVariable_eq(_object*, _object*, _object*) () from /home/brian/miniconda3/envs/vartia/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#28 0x000056486a93bdb0 in method_vectorcall_VARARGS_KEYWORDS (func=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/descrobject.c:364
#29 0x000056486a927e91 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7fc91bd9ee80, tstate=0x56486acb1d98 <_PyRuntime+166328>)
    at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_call.h:92
#30 PyObject_Vectorcall (callable=0x7fc91bd9ee80, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.11.0/Objects/call.c:299
#31 0x000056486a91ac62 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:4772
#32 0x000056486a9d805e in _PyEval_EvalFrame (throwflag=0, frame=0x7fc91cc0b020, tstate=0x56486acb1d98 <_PyRuntime+166328>) at /usr/local/src/conda/python-3.11.0/Include/internal/pycore_ceval.h:73
#33 _PyEval_Vector (tstate=0x56486acb1d98 <_PyRuntime+166328>, func=0x7fc91c1d1f80, locals=0x7fc91c1f2580, args=<optimized out>, argcount=<optimized out>, kwnames=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/ceval.c:6428
#34 0x000056486a9d75ef in PyEval_EvalCode (co=<optimized out>, globals=0x7fc91c1f2580, locals=<optimized out>) at /usr/local/src/conda/python-3.11.0/Python/ceval.c:1154
#35 0x000056486a9fa12c in run_eval_code_obj (tstate=0x56486acb1d98 <_PyRuntime+166328>, co=0x7fc91c0ea670, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1714
#36 0x000056486a9f63a4 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7fc91c1f2580, locals=0x7fc91c1f2580, flags=<optimized out>, arena=<optimized out>)
    at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1735
#37 0x000056486aa0b372 in pyrun_file (fp=fp@entry=0x56486ca57030, filename=filename@entry=0x7fc91c01eba0, start=start@entry=257, globals=globals@entry=0x7fc91c1f2580, locals=locals@entry=0x7fc91c1f2580, closeit=closeit@entry=1, 
    flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:1630
#38 0x000056486aa0aca5 in _PyRun_SimpleFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:440
#39 0x000056486aa0aa73 in _PyRun_AnyFileObject (fp=0x56486ca57030, filename=0x7fc91c01eba0, closeit=1, flags=0x7ffe1c4698a8) at /usr/local/src/conda/python-3.11.0/Python/pythonrun.c:79
#40 0x000056486aa04b76 in pymain_run_file_obj (skip_source_first_line=0, filename=0x7fc91c01eba0, program_name=0x7fc91c0d7b10) at /usr/local/src/conda/python-3.11.0/Modules/main.c:360
#41 pymain_run_file (config=0x56486ac97de0 <_PyRuntime+59904>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:379
#42 pymain_run_python (exitcode=0x7ffe1c4698a0) at /usr/local/src/conda/python-3.11.0/Modules/main.c:601
#43 Py_RunMain () at /usr/local/src/conda/python-3.11.0/Modules/main.c:680
#44 0x000056486a9c5e19 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at /usr/local/src/conda/python-3.11.0/Modules/main.c:734
#45 0x00007fc91c829d90 in __libc_start_call_main (main=main@entry=0x56486a9c5d70 <main>, argc=argc@entry=3, argv=argv@entry=0x7ffe1c469af8) at ../sysdeps/nptl/libc_start_call_main.h:58
#46 0x00007fc91c829e40 in __libc_start_main_impl (main=0x56486a9c5d70 <main>, argc=3, argv=0x7ffe1c469af8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe1c469ae8) at ../csu/libc-start.c:392
#47 0x000056486a9c5cb1 in _start ()

from gpt-fast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.