[UPDATED] This issue documents my results from testing the 3 differe

thanks for the update <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Not an Issue] Comparing 3 backends on multi-node single-gpu env about mlbench-benchmarks HOT 8 CLOSED

mlbench commented on August 29, 2024

[Not an Issue] Comparing 3 backends on multi-node single-gpu env

from mlbench-benchmarks.

Comments (8)

martinjaggi commented on August 29, 2024

yes.
and also check the impact of multiple GPUs per node (depending which one works better, this will also determine which one we want to fix as default in the benchmark task in the end. now it's 2.5 gpus per node in some places which is weird)

from mlbench-benchmarks.

martinjaggi commented on August 29, 2024

thanks for the update
@ehoelzl can you add more details on the experiment setup (#nodes, gpus per node, type etc)
@tvogels any idea about the 10x difference with NCCL? i remember in your earlier experiments nccl basically always beat gloo

from mlbench-benchmarks.

tvogels commented on August 29, 2024

Don’t know what could cause NCCL to be so slow. Indeed I always saw it perform much better than GLOO. In my recent tests with a highly optimized MPI implementation, it’s still slightly faster than MPI for all reduce. Are all tensors on GPUs?

from mlbench-benchmarks.

ehoelzl commented on August 29, 2024

@tvogels Yes all tensors are on GPU. Can it be a connectivity issue ?

from mlbench-benchmarks.

tvogels commented on August 29, 2024

Can it be a connectivity issue ?

Don't know ... Here are some thoughts related to the potential causes I can think of:

I guess you built a custom PyTorch with MPI support. If that is the case, could you try

import torch
print(torch.__config__.show())

and share the compilation settings? Maybe something went wrong there.

Another thing that comes to mind is that NCCL behaves a bit differently from the others: Because it strictly operates on GPU tensors, ops return immediately, and are put on a CUDA queue (even .wait()) They only synchronize/pause when you do it explicitly or transfer results to the CPU. Did you take that into account when benchmarking (adding explicit cuda barriers)?

For reference, here are my measurements (16 nodes, 1 K80 gpu per node). What I said before is not completely accurate: my MPI (with UCX) seems slightly faster than NCCL, but not by as much as by what you get, it seems.

from mlbench-benchmarks.

ehoelzl commented on August 29, 2024

and share the compilation settings? Maybe something went wrong there.

Here is the result of this command

PyTorch built with:
  - GCC 5.4
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201307 (a.k.a. OpenMP 4.0)
  - NNPACK is enabled
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

Another thing that comes to mind is that NCCL behaves a bit differently from the others: Because it strictly operates on GPU tensors, ops return immediately, and are put on a CUDA queue (even .wait()) They only synchronize/pause when you do it explicitly or transfer results to the CPU. Did you take that into account when benchmarking (adding explicit cuda barriers)?

After every reduction, I added a torch.cuda.synchronize(), and print the first element of the reduced tensor.

I'm really not sure why this huge difference happens. Let me know if you have any idea

from mlbench-benchmarks.

ehoelzl commented on August 29, 2024

@tvogels Do you use any NCCL environment variables ? Also, do you have any special connection between the nodes (e.g. NVLink) ?

By reducing the NCCL_BUFFSIZE, i can obtain slightly better performance (but still lower than MPI). Also, by setting NCCL_IB_DISABLE=1 (disables InfiniBand and forces use of IP sockets).

Also, what version of NCCL are you using ? I'm using 2.4.8+cuda10.1

from mlbench-benchmarks.

ehoelzl commented on August 29, 2024

@iamtao @martinjaggi I have updated the issue with a new graph resulting from my experiments. We need to see why NCCL performs much worse.

from mlbench-benchmarks.

[Not an Issue] Comparing 3 backends on multi-node single-gpu env about mlbench-benchmarks HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent