Coder Social home page Coder Social logo

Comments (5)

GeofferyGeng avatar GeofferyGeng commented on August 16, 2024

More Information, in the sendrecv_perf test, there is no difference between 2.18, 2.19 and 2.10

from nccl.

sjeaugey avatar sjeaugey commented on August 16, 2024

It could be that in 2.18, by default we'd use 32 channels for collectives, hence 32 channels for p2p. In 2.19 we have reduced the memory footprint and SM usage to something more reasonable, but that may have impacted the alltoall performance.

But first, I'd advise to unset NCCL_NCHANNELS_PER_NET_PEER. Setting it to 8 can have a negative effect on alltoall operations. Can you run the comparison again without that variable set?

from nccl.

GeofferyGeng avatar GeofferyGeng commented on August 16, 2024

thank you for your reply.

We remove the NCCL_NCHANNELS_PER_NET_PEER from command and run it on 8 nodes. However, the performance degraded about 2GB.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

As you said, "In 2.19 we have reduced the memory footprint and SM usage to something more reasonable". Is there some environment variables that we can set to force use more SM and get higher performance? We try use NCCL_MIN_P2P_NCHANNELS=16/32 for use more SM, but it doesn't work.
date && /usr/mpi/gcc/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root --mca oob_tcp_if_include bond2 --bind-to none --host $hosts -x UCX_NET_DEVICES=bond2 -x UCX_IB_GID_INDEX=3 -x NCCL_SOCKET_IFNAME==bond2 -x NCCL_IB_HCA==mlx5_bond_1,mlx5_bond_2,mlx5_bond_3,mlx5_bond_4,mlx5_bond_5,mlx5_bond_6,mlx5_bond_7,mlx5_bond_8 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x NCCL_MIN_NCHANNELS=16 -x NCCL_MIN_P2P_NCHANNELS=16 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_FILE=/dev/stderr -x NCCL_IB_SPLIT_DATA_ON_QPS=0 /root/nccl-tests/build/alltoall_perf --ngpus=1 --minbytes=64M --maxbytes=16G --stepfactor=2 --iters=200 2>/dev/null

from nccl.

sjeaugey avatar sjeaugey commented on August 16, 2024

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

from nccl.

GeofferyGeng avatar GeofferyGeng commented on August 16, 2024

Sorry I missed the GPUs were H800. The number of channels would likely be limited, due to the number of NVLinks, so my theory doesn't hold (and your experiments confirmed that).

Unfortunately I don't see much else you could play with to optimize the alltoall performance. Given that you have 2 ports per NIC, I'm wondering whether NCCL_IB_QPS_PER_CONNECTION=2 could hurt, having to progress too many QPs at once.

On the other hand, given you're setting that environment variable, I'm guessing the fabric is RoCE. Given the lack of a good adaptive routing on most RoCE fabrics, optimizing performance on RoCE can be tricky and any change in the algorithm/chunk size/timing can make performance go up or down, so it goes beyond NCCL.

NCCL_IB_QPS_PER_CONNECTION does hurt the performance. On 2.18 version, the more qps we use in one connection, the performance lower; But we use bond mode nic, so that the least qps for every connection is 2. Considering 2.18 can reach a satisfying bw, 2 may be suitable.

We tests more combination of variables, finally we found increasing NCCL_NCHANNELS_PER_NET_PEER to 32 can bring a bit perfomance and the Switch port reached 85% usage at last.

If you have any other suggestions at any time, I would be very grateful.

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤ī¸ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.