Comments (4)
Ah. Good point. Indeed, in that case, as we're creating connections lazily, one side will try to connect both the send and recv connection, while the other side will only try to connect one, causing a mismatch in the amount of metadata they send to each other.
from nccl.
Yes it should be considered as undefined.
This would probably work (at this small scale), but in general, it is not safe to assume that all calls within a group will progress without blocking each other.
If rank 0 was communicating with hundreds or other ranks, then its recv operation could be blocked by the send operation, which would block waiting for the Recv on rank 1.
from nccl.
Thanks for the explanation!
Actually, in my experiment, not only is blocking observed, but also NCCL error.
(the experiment is conducted using PyTorch)
torch.distributed.DistBackendError: NCCL error in: /root/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3608, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
ncclInternalError: Internal check failed.
Last error:
Message truncated : received 4096 bytes instead of 2048
I guess that the two ranks are using different communication protocols (or schemes) according to group patterns. Is that right?
from nccl.
Very clear explanation. Thank you very much.
from nccl.
Related Issues (20)
- Profiling Tools for NCCL collective operations
- Local user buffer registration for NVLink SHARP HOT 1
- Some questions about selecting NET when searching channels. HOT 12
- Compute time in the reduction operation
- Understanding LL, LL128, and Simple Protocols
- Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 HOT 5
- NCCL2.21 hangs at cudaLaunchKernelExC() HOT 6
- How are threads in different channels parallelized
- How sendProxyProgress() in net.cc works HOT 2
- Execute all_reduce_perf block HOT 1
- Has NCCL support inter-node through NVswitch and NVlink? HOT 6
- For channel computing, why nvlinkBw is accumulated, but pciBw is not? Is this a BUG? HOT 2
- nccl with specified pkey_index HOT 1
- How to locate the hanging node? HOT 1
- Why dose theoretical busBw multiply by the ratio 5/6?
- how double binary tree communicate
- NCCL error "receiving 524288 bytes instead of 65536" HOT 1
- Why can't two GPUs in a virtual machine communicate using P2Pīŧ HOT 1
- The variable NCCL_IB_ADDR_RANGE did not work properly after being configured
- GID index change cause training to stop on ConnectX-7 400G Adapters when traing LLM HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ī¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.