Coder Social home page Coder Social logo

Comments (8)

martinjaggi avatar martinjaggi commented on August 29, 2024

yes.
and also check the impact of multiple GPUs per node (depending which one works better, this will also determine which one we want to fix as default in the benchmark task in the end. now it's 2.5 gpus per node in some places which is weird)

from mlbench-benchmarks.

martinjaggi avatar martinjaggi commented on August 29, 2024

thanks for the update
@ehoelzl can you add more details on the experiment setup (#nodes, gpus per node, type etc)
@tvogels any idea about the 10x difference with NCCL? i remember in your earlier experiments nccl basically always beat gloo

from mlbench-benchmarks.

tvogels avatar tvogels commented on August 29, 2024

Don’t know what could cause NCCL to be so slow. Indeed I always saw it perform much better than GLOO. In my recent tests with a highly optimized MPI implementation, it’s still slightly faster than MPI for all reduce. Are all tensors on GPUs?

from mlbench-benchmarks.

ehoelzl avatar ehoelzl commented on August 29, 2024

@tvogels Yes all tensors are on GPU. Can it be a connectivity issue ?

from mlbench-benchmarks.

tvogels avatar tvogels commented on August 29, 2024

Can it be a connectivity issue ?

Don't know ... Here are some thoughts related to the potential causes I can think of:

I guess you built a custom PyTorch with MPI support. If that is the case, could you try

import torch
print(torch.__config__.show())

and share the compilation settings? Maybe something went wrong there.

Another thing that comes to mind is that NCCL behaves a bit differently from the others: Because it strictly operates on GPU tensors, ops return immediately, and are put on a CUDA queue (even .wait()) They only synchronize/pause when you do it explicitly or transfer results to the CPU. Did you take that into account when benchmarking (adding explicit cuda barriers)?

For reference, here are my measurements (16 nodes, 1 K80 gpu per node). What I said before is not completely accurate: my MPI (with UCX) seems slightly faster than NCCL, but not by as much as by what you get, it seems.
image

from mlbench-benchmarks.

ehoelzl avatar ehoelzl commented on August 29, 2024

and share the compilation settings? Maybe something went wrong there.

Here is the result of this command

PyTorch built with:
  - GCC 5.4
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201307 (a.k.a. OpenMP 4.0)
  - NNPACK is enabled
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

Another thing that comes to mind is that NCCL behaves a bit differently from the others: Because it strictly operates on GPU tensors, ops return immediately, and are put on a CUDA queue (even .wait()) They only synchronize/pause when you do it explicitly or transfer results to the CPU. Did you take that into account when benchmarking (adding explicit cuda barriers)?

After every reduction, I added a torch.cuda.synchronize(), and print the first element of the reduced tensor.

I'm really not sure why this huge difference happens. Let me know if you have any idea

from mlbench-benchmarks.

ehoelzl avatar ehoelzl commented on August 29, 2024

@tvogels Do you use any NCCL environment variables ? Also, do you have any special connection between the nodes (e.g. NVLink) ?

By reducing the NCCL_BUFFSIZE, i can obtain slightly better performance (but still lower than MPI). Also, by setting NCCL_IB_DISABLE=1 (disables InfiniBand and forces use of IP sockets).

Also, what version of NCCL are you using ? I'm using 2.4.8+cuda10.1

from mlbench-benchmarks.

ehoelzl avatar ehoelzl commented on August 29, 2024

@iamtao @martinjaggi I have updated the issue with a new graph resulting from my experiments. We need to see why NCCL performs much worse.

from mlbench-benchmarks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.