Coder Social home page Coder Social logo

Comments (4)

sjeaugey avatar sjeaugey commented on June 30, 2024 1

Ah. Good point. Indeed, in that case, as we're creating connections lazily, one side will try to connect both the send and recv connection, while the other side will only try to connect one, causing a mismatch in the amount of metadata they send to each other.

from nccl.

sjeaugey avatar sjeaugey commented on June 30, 2024

Yes it should be considered as undefined.

This would probably work (at this small scale), but in general, it is not safe to assume that all calls within a group will progress without blocking each other.

If rank 0 was communicating with hundreds or other ranks, then its recv operation could be blocked by the send operation, which would block waiting for the Recv on rank 1.

from nccl.

constroy avatar constroy commented on June 30, 2024

Thanks for the explanation!
Actually, in my experiment, not only is blocking observed, but also NCCL error.
(the experiment is conducted using PyTorch)

torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3608, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Message truncated : received 4096 bytes instead of 2048

I guess that the two ranks are using different communication protocols (or schemes) according to group patterns. Is that right?

from nccl.

constroy avatar constroy commented on June 30, 2024

Very clear explanation. Thank you very much.

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.