Comments (5)
More Information, in the sendrecv_perf test, there is no difference between 2.18, 2.19 and 2.10
from nccl.
It could be that in 2.18, by default we'd use 32 channels for collectives, hence 32 channels for p2p. In 2.19 we have reduced the memory footprint and SM usage to something more reasonable, but that may have impacted the alltoall performance.
But first, I'd advise to unset NCCL_NCHANNELS_PER_NET_PEER. Setting it to 8 can have a negative effect on alltoall operations. Can you run the comparison again without that variable set?
from nccl.
thank you for your reply.
We remove the NCCL_NCHANNELS_PER_NET_PEER from command and run it on 8 nodes. However, the performance degraded about 2GB.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null
As you said, "In 2.19 we have reduced the memory footprint and SM usage to something more reasonable". Is there some environment variables that we can set to force use more SM and get higher performance? We try use NCCL_MIN_P2P_NCHANNELS=16/32 for use more SM, but it doesn't work.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_MIN_P2P_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null
from nccl.
Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).
Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.
On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.
from nccl.
Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).
Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.
On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.
NCCL_IB_QPS_PER_CONNECTION does hurt the performance. On 2.18 version, the more qps we use in one connection, the performance lower; But we use bond mode nic, so that the least qps for every connection is 2. Considering 2.18 can reach a satisfying bw, 2 may be suitable.
We tests more combination of variables, finally we found increasing NCCL_NCHANNELS_PER_NET_PEER to 32 can bring a bit perfomance and the Switch port reached 85% usage at last.
If you have any other suggestions at any time, I would be very grateful.
from nccl.
Related Issues (20)
- Profiling Tools for NCCL collective operations
- Local user buffer registration for NVLink SHARP HOT 1
- Some questions about selecting NET when searching channels. HOT 12
- Compute time in the reduction operation
- Understanding LL, LL128, and Simple Protocols
- NCCL2.21 hangs at cudaLaunchKernelExC() HOT 6
- How are threads in different channels parallelized
- How sendProxyProgress() in net.cc works HOT 2
- Execute all_reduce_perf block HOT 1
- Has NCCL support inter-node through NVswitch and NVlink? HOT 8
- For channel computing, why nvlinkBw is accumulated, but pciBw is not? Is this a BUG? HOT 2
- nccl with specified pkey_index HOT 1
- How to locate the hanging node? HOT 1
- Why dose theoretical busBw multiply by the ratio 5/6?
- how double binary tree communicate HOT 4
- NCCL error "receiving 524288 bytes instead of 65536" HOT 1
- Why can't two GPUs in a virtual machine communicate using P2Pīŧ HOT 1
- The variable NCCL_IB_ADDR_RANGE did not work properly after being configured HOT 3
- GID index change cause training to stop on ConnectX-7 400G Adapters when traing LLM HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
đ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. đđđ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google â¤ī¸ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nccl.