Comments (6)
FWIW - we found when developing Caffe2 that if you happen to have multi-threading going on and one thread tries to call cudaMallocHost() or cudaFree, then that causes nccl to freeze. Essentially, what helped us solving the problem is to guard all malloc, free and nccl calls so they don't overlap. That needed some good care (thanks to @akyrola ). An example can be seen here:
https://github.com/caffe2/caffe2/blob/master/caffe2/core/context_gpu.cu#L284
from nccl.
First, are you running all the processes on the same node ?
In any case, you should not need a barrier in the case of MPI, only with multiple threads. A small code sample to reproduce could help.
from nccl.
Yes,it happens on a single node.
The hang project is caffe with ncclallreduce during backward. I will try to reproduce it with a small demo.
thanks.
from nccl.
hi @Yangqing. Thank you very much. I am working on our own branch of a very old version of caffe, with OpenMPI Integrated. In other words, our case is multi-process on a single node, one GPU per process. I think no thread-based lock is needed. Am I right?
from nccl.
ps: it hangs every 1 day or 1.5 days. I have to keep training with snapshot after freezon. The trained moel performance is quite good
from nccl.
Closing old issue. Please re-open if you still have issues with 2.3.
from nccl.
Related Issues (20)
- Why P2P requires more channels ?
- nccltest allreduce is with a lot of wrongs with the NCCL_P2P_DISABLE=1 env or NCCL_PXN_DISABLE=1 env HOT 13
- How to use the API ncclReduceScatter? HOT 3
- Duplicated ncclCommRegister in nccl.h.in? HOT 2
- Question about ring performance between intra-node and inter-node HOT 7
- NCCL WARN NET/Socket : message truncated in PyTorch multiple machines and multiple GPUs HOT 1
- `thrust::partition` failed to compile on CUDA 12.2 HOT 1
- Can NCCL_IB_PCI_RELAXED_ORDERING only be used in virtualized environments? HOT 2
- Why { "16 GT/s", 120 } paired in kvDictPciGen? HOT 2
- How much can 512 H100 ReduceScatter/AllGather-8GB_msg_size run to HOT 1
- How can I see which version of NCCL pytorch is using?
- why load repeatedly when receiving in prims_ll128 HOT 1
- Cannot use P2P in Azure GPU cluster HOT 5
- Why does NCCL not utilize all channels when the data volume is not large? HOT 4
- Why all_gather_perf only achieves 200GBps bandwidth for 2G message size for 32 (256 GPUS) H100 nodes
- Using COLLNET failed with sharp plugin HOT 5
- NCCL & CUDA
- NVLS don not work in systems with more than four nodes,is a bug?or i am use nccl in wrong way? HOT 2
- Loss or Performance for NCCL test between NCCL 2.19.3 and 2.20.3 HOT 1
- Large Performance Gap Between Internal IB and External Network Plugin HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.