We have observed a significant performance degradation in the alltoall operation when

Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 about nccl HOT 5 OPEN

GeofferyGeng commented on August 16, 2024

Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20

from nccl.

Comments (5)

GeofferyGeng commented on August 16, 2024

More Information, in the sendrecv_perf test, there is no difference between 2.18, 2.19 and 2.10

from nccl.

sjeaugey commented on August 16, 2024

It could be that in 2.18, by default we'd use 32 channels for collectives, hence 32 channels for p2p. In 2.19 we have reduced the memory footprint and SM usage to something more reasonable, but that may have impacted the alltoall performance.

But first, I'd advise to unset NCCL_NCHANNELS_PER_NET_PEER. Setting it to 8 can have a negative effect on alltoall operations. Can you run the comparison again without that variable set?

from nccl.

GeofferyGeng commented on August 16, 2024

thank you for your reply.

We remove the NCCL_NCHANNELS_PER_NET_PEER from command and run it on 8 nodes. However, the performance degraded about 2GB.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

As you said, "In 2.19 we have reduced the memory footprint and SM usage to something more reasonable". Is there some environment variables that we can set to force use more SM and get higher performance? We try use NCCL_MIN_P2P_NCHANNELS=16/32 for use more SM, but it doesn't work.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_MIN_P2P_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

from nccl.

sjeaugey commented on August 16, 2024

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

from nccl.

GeofferyGeng commented on August 16, 2024

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

NCCL_IB_QPS_PER_CONNECTION does hurt the performance. On 2.18 version, the more qps we use in one connection, the performance lower; But we use bond mode nic, so that the least qps for every connection is 2. Considering 2.18 can reach a satisfying bw, 2 may be suitable.

We tests more combination of variables, finally we found increasing NCCL_NCHANNELS_PER_NET_PEER to 32 can bring a bit perfomance and the Switch port reached 85% usage at last.

If you have any other suggestions at any time, I would be very grateful.

from nccl.

Performance Degradation in Alltoall Operation with NCCL 2.19 and 2.20 about nccl HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent