Coder Social home page Coder Social logo

Comments (3)

itzsimpl avatar itzsimpl commented on September 21, 2024

Using NCCL_COMM_ID to try and fix the employed ip does not help, NCCL always goes for 127.0.0.1, the first ip on the interface.

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
axa:3048391:3048391 [0] NCCL INFO Bootstrap : Using lo:X.Y.Z.Q<0>
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
axa:3048391:3048391 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
axa:3048391:3048391 [0] NCCL INFO NCCL_COMM_ID set by environment to X.Y.Z.Q:45711
axa:3048391:3048391 NCCL CALL ncclGetUniqueId(0x87648658c0c08b42)
axa:3048391:3048391 NCCL CALL ncclGroupStart()
axa:3048391:3048391 [0] NCCL INFO cudaDriverVersion 12020
axa:3048391:3048391 [0] NCCL INFO NCCL_COMM_ID set by environment to X.Y.Z.Q:45711
NCCL version 2.17.1+cuda12.1
ixh:523729:523729 NCCL CALL ncclGroupStart()
ixh:523729:523729 [0] NCCL INFO cudaDriverVersion 12020
ixh:523729:523729 [0] NCCL INFO Bootstrap : Using lo:X.Y.Z.W<0>
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
ixh:523729:523729 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
ixh:523729:523729 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7f48ea800000
ixh:523729:523985 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
ixh:523729:523985 [0] NCCL INFO P2P plugin IBext
ixh:523729:523985 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
ixh:523729:523985 [0] NCCL INFO net.cc:79 -> 3
ixh:523729:523985 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
ixh:523729:523985 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
ixh:523729:523985 [0] NCCL INFO NCCL_SOCKET_IFNAME set to lo
ixh:523729:523985 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
ixh:523729:523985 [0] NCCL INFO Using network Socket
ixh:523729:523985 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
ixh:523729:523985 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lo'
ixh:523729:523985 [0] NCCL INFO === System : maxBw 1.2 totalBw 396.0 ===
ixh:523729:523985 [0] NCCL INFO CPU/0 (1/1/2)
ixh:523729:523985 [0] NCCL INFO + PCI[5000.0] - NIC/0
ixh:523729:523985 [0] NCCL INFO                 + NET[1.2] - NET/0 (0/0/1.250000)
ixh:523729:523985 [0] NCCL INFO + PCI[48.0] - PCI/16000 (15b3197900000000)
ixh:523729:523985 [0] NCCL INFO               + PCI[48.0] - PCI/19000 (15b3197900000000)
ixh:523729:523985 [0] NCCL INFO                             + PCI[48.0] - GPU/1B000 (1)
ixh:523729:523985 [0] NCCL INFO                                           + NVL[396.0] - NVS/0
ixh:523729:523985 [0] NCCL INFO ==========================================
ixh:523729:523985 [0] NCCL INFO GPU/1B000 :GPU/1B000 (0/5000.000000/LOC) CPU/0 (3/48.000000/PHB) NET/0 (5/1.250000/PHB)
ixh:523729:523985 [0] NCCL INFO NET/0 :GPU/1B000 (5/1.250000/PHB) CPU/0 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
ixh:523729:523985 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,0000ffff
ixh:523729:523985 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
ixh:523729:523985 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
ixh:523729:523985 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
ixh:523729:523985 [0] NCCL INFO  0 : NET/0 GPU/1 NET/0
ixh:523729:523985 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type LOC/PIX, sameChannels 1
ixh:523729:523985 [0] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1
ixh:523729:523985 [0] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1
ixh:523729:523985 [0] NCCL INFO Ring 00 : 0 -> 1 -> 0
ixh:523729:523985 [0] NCCL INFO Ring 01 : 0 -> 1 -> 0
ixh:523729:523985 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
ixh:523729:523985 [0] NCCL INFO P2P Chunksize set to 131072
ixh:523729:523985 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
ixh:523729:523985 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f49bb800000
ixh:523729:523985 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f49bb800600
ixh:523729:523985 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f49bb800800
ixh:523729:523985 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f49bb800e00
ixh:523729:523988 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f48ac004e80
ixh:523729:523988 [0] NCCL INFO Allocated 4194660 bytes of shared memory in /dev/shm/nccl-pSRjS6
ixh:523729:523988 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
ixh:523729:523988 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048391 [0] NCCL INFO init.cc:1301 Cuda Host Alloc Size 4 pointer 0x7f961b400000
axa:3048391:3048566 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
axa:3048391:3048566 [0] NCCL INFO P2P plugin IBext
axa:3048391:3048566 [0] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
axa:3048391:3048566 [0] NCCL INFO net.cc:79 -> 3
axa:3048391:3048566 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
axa:3048391:3048566 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
axa:3048391:3048566 [0] NCCL INFO NCCL_SOCKET_IFNAME set to lo
axa:3048391:3048566 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
axa:3048391:3048566 [0] NCCL INFO Using network Socket
axa:3048391:3048566 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'lo'
axa:3048391:3048566 [0] NCCL INFO === System : maxBw 1.2 totalBw 264.0 ===
axa:3048391:3048566 [0] NCCL INFO CPU/3 (1/2/-1)
axa:3048391:3048566 [0] NCCL INFO + PCI[5000.0] - NIC/0
axa:3048391:3048566 [0] NCCL INFO                 + NET[1.2] - NET/0 (0/0/1.250000)
axa:3048391:3048566 [0] NCCL INFO + PCI[24.0] - PCI/1000 (1000c01010000000)
axa:3048391:3048566 [0] NCCL INFO               + PCI[24.0] - PCI/5000 (1000c01010de13b8)
axa:3048391:3048566 [0] NCCL INFO                             + PCI[24.0] - GPU/7000 (0)
axa:3048391:3048566 [0] NCCL INFO                                           + NVL[264.0] - NVS/0
axa:3048391:3048566 [0] NCCL INFO ==========================================
axa:3048391:3048566 [0] NCCL INFO GPU/7000 :GPU/7000 (0/5000.000000/LOC) CPU/3 (3/24.000000/PHB) NET/0 (5/1.250000/PHB)
axa:3048391:3048566 [0] NCCL INFO NET/0 :GPU/7000 (5/1.250000/PHB) CPU/3 (2/1.250000/PHB) NET/0 (0/5000.000000/LOC)
axa:3048391:3048566 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, bw 1.200000/1.200000, type LOC/PHB, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
axa:3048391:3048566 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, bw 2.400000/1.200000, type LOC/PHB, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
axa:3048391:3048566 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, bw 0.000000/0.000000, type LOC/PIX, sameChannels 1
axa:3048391:3048566 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
axa:3048391:3048566 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1
axa:3048391:3048566 [0] NCCL INFO Channel 00/02 :    0   1
axa:3048391:3048566 [0] NCCL INFO Channel 01/02 :    0   1
axa:3048391:3048566 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
axa:3048391:3048566 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
axa:3048391:3048566 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
axa:3048391:3048566 [0] NCCL INFO P2P Chunksize set to 131072
axa:3048391:3048566 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
axa:3048391:3048566 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f96c7000000
axa:3048391:3048566 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f96c7000600
axa:3048391:3048566 [0] NCCL INFO channel.cc:24 Cuda Alloc Size 1536 pointer 0x7f96c7000800
axa:3048391:3048566 [0] NCCL INFO channel.cc:28 Cuda Alloc Size 8 pointer 0x7f96c7000e00
axa:3048391:3048570 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f95f8000b60
axa:3048391:3048570 [0] NCCL INFO Allocated 4194660 bytes of shared memory in /dev/shm/nccl-Y04u6d
axa:3048391:3048570 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004f30
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 00/0 : 1[1b000] -> 0[7000] [receive] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004f70
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 01/0 : 1[1b000] -> 0[7000] [receive] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy send connection 2 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004fb0
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[1b000] [send] via NET/Socket/0
axa:3048391:3048570 [0] NCCL INFO New proxy send connection 3 from local rank 0, transport 2
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Init res=0
axa:3048391:3048566 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f95f8004ff0
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxySetup() opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Could not get speed from /sys/class/net/lo/speed. Defaulting to 10 Gbps.
axa:3048391:3048566 [0] NCCL INFO ncclPollProxyResponse Recieved new opId=0x7f95f1788020
axa:3048391:3048570 [0] NCCL INFO Received and initiated operation=Setup res=0
axa:3048391:3048566 [0] NCCL INFO recvOpId=0x7f95f1788020 matches expected opId=0x7f95f1788020
axa:3048391:3048566 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[1b000] [send] via NET/Socket/0
axa:3048391:3048566 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7f95f1789440
axa:3048391:3048570 [0] NCCL INFO proxyConnSetupConnect for peer->localRank 0,
axa:3048391:3048570 [0] NCCL INFO proxyProgressAsync::proxyConnect() opId=0x7f95f1789440 op.reqBuff=0x7f95f801bfd0
axa:3048391:3048566 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7f95f1789570 &recv->proxyConn=0x7f95f1789578 connectInfo=0x7f95f179ae40
axa:3048391:3048566 [0] NCCL INFO sendConnect ncclProxyCallAsync opId=0x7f95f1789df8
axa:3048391:3048566 [0] NCCL INFO recvConnect ncclProxyCallAsync opId=0x7f95f1789f28 &recv->proxyConn=0x7f95f1789f30 connectInfo=0x7f95f179aec0

axa:3048391:3048570 [0] misc/socket.cc:480 NCCL WARN socketStartConnect: Connect to 127.0.0.1<53737> failed : Software caused connection abort

from nccl.

sjeaugey avatar sjeaugey commented on September 21, 2024

Indeed, NCCL does not support picking the second IP address of an IP interface.

from nccl.

blegear avatar blegear commented on September 21, 2024

This type of thing should work for the case above.

instead of associating your routable /32 as a secondary addr on the loopback create a new dummy iface owning the ip.

ip link add name primaryip type dummy
ip addr add x.x.x.x/32 dev primaryip
NCCL_SOCKET_IFNAME=primaryip

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.