Comments (3)
Using NCCL_COMM_ID
to try and fix the employed ip does not help, NCCL always goes for 127.0.0.1
, the first ip on the interface.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
axa:3048391:3048391 [0] NCCL INFO Bootstrap : Using lo:X.Y.Z.Q<0>
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
axa:3048391:3048391 [0] NCCL INFO NCCL_COMM_ID set by environment to X.Y.Z.Q:45711
axa:3048391:3048391 NCCL CALL ncclGetUniqueId(0x87648658c0c08b42)
axa:3048391:3048391 NCCL CALL ncclGroupStart()
axa:3048391:3048391 [0] NCCL INFO cudaDriverVersion 12020
axa:3048391:3048391 [0] NCCL INFO NCCL_COMM_ID set by environment to X.Y.Z.Q:45711
NCCL version 2.17.1+cuda12.1
ixh:523729:523729 NCCL CALL ncclGroupStart()
ixh:523729:523729 [0] NCCL INFO cudaDriverVersion 12020
ixh:523729:523729 [0] NCCL INFO Bootstrap : Using lo:X.Y.Z.W<0>
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
ixh:523729:523729 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7f48ea800000
ixh:523729:523985 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
ixh:523729:523985 [0] NCCL INFO P2P plugin IBext
ixh:523729:523985 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
ixh:523729:523985 [0] NCCL INFO net.cc:79 -> 3
ixh:523729:523985 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ixh:523729:523985 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
ixh:523729:523985 [0] NCCL INFO NCCL_SOCKET_IFNAME set to lo
ixh:523729:523985 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
ixh:523729:523985 [0] NCCL INFO Using network Socket
ixh:523729:523985 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
ixh:523729:523985 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lo'
ixh:523729:523985 [0] NCCL INFO === System : maxBw 1.2 totalBw 396.0 ===
ixh:523729:523985 [0] NCCL INFO CPU/0 (1/1/2)
ixh:523729:523985 [0] NCCL INFO + PCI[5000.0] - NIC/0
ixh:523729:523985 [0] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
ixh:523729:523985 [0] NCCL INFO + PCI[48.0] - PCI/16000 (15b3197900000000)
ixh:523729:523985 [0] NCCL INFO + PCI[48.0] - PCI/19000 (15b3197900000000)
ixh:523729:523985 [0] NCCL INFO + PCI[48.0] - GPU/1B000 (1)
ixh:523729:523985 [0] NCCL INFO + NVL[396.0] - NVS/0
ixh:523729:523985 [0] NCCL INFO ==========================================
ixh:523729:523985 [0] NCCL INFO GPU/1B000 :GPU/1B000 (0/5000.000000/LOC) CPU/0 (3/48.000000/PHB) NET/0 (5/1.250000/PHB)
ixh:523729:523985 [0] NCCL INFO NET/0 :GPU/1B000 (5/1.250000/PHB) CPU/0 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ixh:523729:523985 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,0000ffff
ixh:523729:523985 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
ixh:523729:523985 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
ixh:523729:523985 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
ixh:523729:523985 [0] NCCL INFO 0 : NET/0 GPU/1 NET/0
ixh:523729:523985 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type LOC/PIX, sameChannels 1
ixh:523729:523985 [0] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1
ixh:523729:523985 [0] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1
ixh:523729:523985 [0] NCCL INFO Ring 00 : 0 -> 1 -> 0
ixh:523729:523985 [0] NCCL INFO Ring 01 : 0 -> 1 -> 0
ixh:523729:523985 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
ixh:523729:523985 [0] NCCL INFO P2P Chunksize set to 131072
ixh:523729:523985 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
ixh:523729:523985 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f49bb800000
ixh:523729:523985 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f49bb800600
ixh:523729:523985 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f49bb800800
ixh:523729:523985 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f49bb800e00
ixh:523729:523988 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f48ac004e80
ixh:523729:523988 [0] NCCL INFO Allocated 4194660 bytes of shared memory in /dev/shm/nccl-pSRjS6
ixh:523729:523988 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
ixh:523729:523988 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048391 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7f961b400000
axa:3048391:3048566 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
axa:3048391:3048566 [0] NCCL INFO P2P plugin IBext
axa:3048391:3048566 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
axa:3048391:3048566 [0] NCCL INFO net.cc:79 -> 3
axa:3048391:3048566 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
axa:3048391:3048566 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
axa:3048391:3048566 [0] NCCL INFO NCCL_SOCKET_IFNAME set to lo
axa:3048391:3048566 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
axa:3048391:3048566 [0] NCCL INFO Using network Socket
axa:3048391:3048566 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lo'
axa:3048391:3048566 [0] NCCL INFO === System : maxBw 1.2 totalBw 264.0 ===
axa:3048391:3048566 [0] NCCL INFO CPU/3 (1/2/-1)
axa:3048391:3048566 [0] NCCL INFO + PCI[5000.0] - NIC/0
axa:3048391:3048566 [0] NCCL INFO + NET[1.2] - NET/0 (0/0/1.250000)
axa:3048391:3048566 [0] NCCL INFO + PCI[24.0] - PCI/1000 (1000c01010000000)
axa:3048391:3048566 [0] NCCL INFO + PCI[24.0] - PCI/5000 (1000c01010de13b8)
axa:3048391:3048566 [0] NCCL INFO + PCI[24.0] - GPU/7000 (0)
axa:3048391:3048566 [0] NCCL INFO + NVL[264.0] - NVS/0
axa:3048391:3048566 [0] NCCL INFO ==========================================
axa:3048391:3048566 [0] NCCL INFO GPU/7000 :GPU/7000 (0/5000.000000/LOC) CPU/3 (3/24.000000/PHB) NET/0 (5/1.250000/PHB)
axa:3048391:3048566 [0] NCCL INFO NET/0 :GPU/7000 (5/1.250000/PHB) CPU/3 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
axa:3048391:3048566 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
axa:3048391:3048566 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
axa:3048391:3048566 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type LOC/PIX, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
axa:3048391:3048566 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1
axa:3048391:3048566 [0] NCCL INFO Channel 00/02 : 0 1
axa:3048391:3048566 [0] NCCL INFO Channel 01/02 : 0 1
axa:3048391:3048566 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
axa:3048391:3048566 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
axa:3048391:3048566 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
axa:3048391:3048566 [0] NCCL INFO P2P Chunksize set to 131072
axa:3048391:3048566 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
axa:3048391:3048566 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f96c7000000
axa:3048391:3048566 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f96c7000600
axa:3048391:3048566 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f96c7000800
axa:3048391:3048566 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f96c7000e00
axa:3048391:3048570 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f95f8000b60
axa:3048391:3048570 [0] NCCL INFO Allocated 4194660 bytes of shared memory in /dev/shm/nccl-Y04u6d
axa:3048391:3048570 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004f30
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [receive] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004f70
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [receive] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004fb0
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [send] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004ff0
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [send] via NET/Socket/0
axa:3048391:3048566 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7f95f1789440
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxyConnect() opId=0x7f95f1789440 op.reqBuff=0x7f95f801bfd0
axa:3048391:3048566 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7f95f1789570 &recv->proxyConn=0x7f95f1789578 connectInfo=0x7f95f179ae40
axa:3048391:3048566 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7f95f1789df8
axa:3048391:3048566 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7f95f1789f28 &recv->proxyConn=0x7f95f1789f30 connectInfo=0x7f95f179aec0
axa:3048391:3048570 [0] misc/socket.cc:480 NCCL WARN socketStartConnect: Connect to 127.0.0.1<53737> failed : Software caused connection abort
from nccl.
Indeed, NCCL does not support picking the second IP address of an IP interface.
from nccl.
This type of thing should work for the case above.
instead of associating your routable /32 as a secondary addr on the loopback create a new dummy iface owning the ip.
ip link add name primaryip type dummy
ip addr add x.x.x.x/32 dev primaryip
NCCL_SOCKET_IFNAME=primaryip
from nccl.
Related Issues (20)
- Allreduce timeout
- ALLREDUCE timeout
- ALLREDUCE timeout HOT 9
- Why NCCL P2P(send/recv) operators need a datatype parameters? HOT 3
- Is there any benchmark of P2P communication between NCCL and UCX(ucp)? HOT 2
- NVLink SHARP Performance on AWS P5
- 【the difference between NCCL and cudaMemcpyPeerAsync】
- How can I see the algorithm chosen by NCCL? HOT 2
- cuda memcpy instead of gpu kernel in p2p sendrecv operation HOT 3
- [ext-net] is bundling headers still recommended? HOT 1
- NCCL socket performance over multiple NICs HOT 2
- Build failure on nccl 2.23.4. Missing shmutils.h HOT 5
- Unable to Specify CUDA Stream for Collective Operations Using with torch.cuda.stream() context
- A Question about network buffer HOT 2
- Poor NCCL allreduce performance HOT 4
- 300node 8GPU 4 IB NCCL TEST HOT 2
- [SHArP] about the intranode allreduce performance with SHArP
- [Question] Why ncclSend is non-blocking? HOT 1
- Some questions about fifo buffer design
- How to estimate the communication time of NCCL alltoallv?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.