Coder Social home page Coder Social logo

nccl's Introduction

NCCL

Optimized primitives for inter-GPU communication.

Introduction

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

For more information on NCCL usage, please refer to the NCCL documentation.

Build

Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.

To build the library :

$ cd nccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

NCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

Install

To install NCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

Tests

Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.

$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>

Copyright

All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.

nccl's People

Contributors

addyladdy avatar aokomoriuta avatar borisfom avatar chsigg avatar cliffwoolley avatar crazy-jiangdonghua avatar drpnd avatar flx42 avatar ilya-biryukov avatar janeyx99 avatar jbachan avatar jia-kai avatar jonaszhou1 avatar kaimingouyang avatar kwen2501 avatar kylefernandes avatar lowintelligence avatar lukeyeager avatar nluehr avatar obihoernchen avatar peterhj avatar rajatchopra avatar rashikakheria avatar riatre avatar rongou avatar sclarkson avatar sjeaugey avatar sl1pkn07 avatar slayton58 avatar void-main avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nccl's Issues

Receive MPI Warning when runing test/mpi/mpi_test

Hi, all
I Receive following warning when runing test/mpi/mpi_test:

A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[46957,1],3] (PID 2896)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.

Test Passed!

The test passed, But does it matter and how to get avoid of it?

I use OpenMPI-2.0.0. It seems that I can not receive any information in the case of program crash. It is important in my application.

mpi + nccl hangs in muti-process scenarios

Hi stuff,
I observed mystery hangs with mpi + nccl. Basically, I follow the testcase nccl/test/mpi/mpi_test.cu. My scenarios is also one single GPU per process(rank). According to issue#37 discussed about mutil-threads scenarios, they resolve the hang issue by add boost::barrier before nccl call. Similarly, I add MPI_Barrier() before nccl call in my case. But still hang.

Is it a known issue of nccl? Maybe I miss something. Did you have any sugguestion about fix this?

Testing nccl with a difficult topology

Dear NCCL team,
First of all, thx much for such nice open-source project.
I just got to know about you through the Parallel-Forall Blog.
Currently, I'm testing your examples in a small production PC, and I noticed that the topology that I'm using is a little bit complex, namely:

[r1bsl@supermicro single]$ nvidia-smi topo --matrix
        GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity
GPU0     X  PIX SOC SOC SOC SOC 0-7,16-23
GPU1    PIX  X  SOC SOC SOC SOC 0-7,16-23
GPU2    SOC SOC  X  PIX PHB PHB 8-15,24-31
GPU3    SOC SOC PIX  X  PHB PHB 8-15,24-31
GPU4    SOC SOC PHB PHB  X  PIX 8-15,24-31
GPU5    SOC SOC PHB PHB PIX  X  8-15,24-31


Legend:
  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

As you may see, I'm working with K80-type GPUs in this machine.
I've noticed that I have no problem running your tests using one of the internal GPUs, e.g.:

[r1bsl@supermicro single]$ ./all_gather_test 10000000 3 1 3 5 
# Using devices
#   Rank  0 uses device  1 [0x06] Tesla K80
#   Rank  1 uses device  3 [0x85] Tesla K80
#   Rank  2 uses device  5 [0x89] Tesla K80

#      bytes             N    type     time  algbw  busbw    delta
    10000000      10000000    char    5.247   3.81   3.81    0e+00
    10000000       2500000     int    4.872   4.11   4.11    0e+00
    10000000       5000000    half    4.802   4.16   4.16    0e+00
    10000000       2500000   float    4.816   4.15   4.15    0e+00
    10000000       1250000  double    4.793   4.17   4.17    0e+00
    10000000       1250000   int64    4.766   4.20   4.20    0e+00
    10000000       1250000  uint64    4.731   4.23   4.23    0e+00

However, it I want to run the test using both internal GPU in a single K80 card, I get in troubles:

[r1bsl@supermicro single]$ ./all_gather_test 100000 2 2 3
# Using devices
#   Rank  0 uses device  2 [0x84] Tesla K80
#   Rank  1 uses device  3 [0x85] Tesla K80

#      bytes             N    type     time  algbw  busbw    delta
[code stalls]
^C

The execution stalls and I have no more option that to kill the execution.
My question is: Can NCCL handle such complex topology? and if so, what can I do to modify the examples for the case that I can run them with all my 6 GPUs?

Undefined identifiers in all_reduce.cu

Hello!

I'm trying to compile nccl, but I'm getting the following errors:

Compiling src/all_reduce.cu         > build/obj/all_reduce.o   
src/reduce_kernel.h(199): error: identifier "__half22float2" is undefined

src/reduce_kernel.h(203): error: identifier "__float22half2_rn" is undefined

src/reduce_kernel.h(214): error: identifier "__half22float2" is undefined

src/reduce_kernel.h(218): error: identifier "__float22half2_rn" is undefined

src/reduce_kernel.h(229): error: identifier "__half22float2" is undefined

src/reduce_kernel.h(233): error: identifier "__float22half2_rn" is undefined

src/reduce_kernel.h(248): error: identifier "__half22float2" is undefined

src/reduce_kernel.h(252): error: identifier "__float22half2_rn" is undefined

8 errors detected in the compilation of "/tmp/tmpxft_000004bb_00000000-13_all_reduce.compute_52.cpp1.ii".
make: *** [build/obj/all_reduce.o] Error 2

I have a TITAN X and cuda-7.5 installed. I ran make CUDA_HOME=/usr/local/cuda-7.5 test to build the library.

Do you have any idea why does it fail? I've seen that these identifiers are defined in /usr/local/cuda-7.5/includes/cuda_fp16.h, but it's not included in reduce_kernel.h. Also, they are guarded by a check __CUDA_ARCH__ >= 530, but my GPU has capability 5.2. Since TITAN is a Maxwell card, then it should be supported, right?

MPI test Segmentation fault

Dear NCCL team:
I had no problem compiling with OpenMPI 1.10.2 libraries and CUDA toolkit 7.5, to execute your example in my workstation. The PC I'm using has the following topology:

[manuel@nhri single]$ nvidia-smi topo --matrix
GPU0 GPU1 CPU Affinity
GPU0 X PHB 0-7
GPU1 PHB X 0-7

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

I'm testing your MPI example but I'm running into Segmentation fault errors, namely:

[manuel@nhri mpi]$ ~/openMPI/bin/mpirun -np 1 mpi_test
[nhri:08445] *** Process received signal ***
[nhri:08445] Signal: Segmentation fault (11)
[nhri:08445] Signal code: Address not mapped (1)
[nhri:08445] Failing at address: (nil)
[nhri:08445] [ 0] /lib64/libpthread.so.0(+0xf100)[0x7f1b82419100]
[nhri:08445] [ 1] /lib64/libc.so.6(+0x3a167)[0x7f1b8165f167]
[nhri:08445] [ 2] mpi_test[0x40151d]
[nhri:08445] [ 3] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1b81646b15]
[nhri:08445] [ 4] mpi_test[0x401c09]
[nhri:08445] *** End of error message ***
%--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 8445 on node nhri exited on signal 11 (Segmentation fault).
%--------------------------------------------------------------------------

All my machines use CentOS 7.0 and my cuda toolkit is 7.5. I know is a very naive my question but is there any preference in the machine configuration to run your examples?

make failed

hi all
i miss a error ,i cant make the nccl

user@user-ProLiant-DL380-Gen9:~/nccl$ make CUDA_HOME=</usr/local/cuda-7.5> test
-bash: test: Is a directory
user@user-ProLiant-DL380-Gen9:~/nccl$ ls
debian  fortran  LICENSE.txt  Makefile  Makefile~  README.md  src  test  

test is a existed file, i dont know where is my false

Expose Ring Communications

This is an excellent and necessary library. My understanding is that each collective communication is implemented via ring communications. If this is the case, a large class of problems (e.g. halo communications) could benefit greatly from exposing the collective ring communication as another primitive.

I imagine this could look similar to MPI's virtual topology:
https://computing.llnl.gov/tutorials/mpi/#Virtual_Topologies
where the ncclComm (or a wrapper-like object) would be exposed as a ring_communicator that could be passed to ring_rank, ring_coord, ring_shift, send, recv, and sendrecv-like functions.

I was going to take a quick crack at this, but thought I would get some feedback from the experts first.

Cannot install nccl

I was trying to install nccl on my PC:
Ubuntu 16.04 LTS
GTX 1080
CUDA 8.0
CUDNN 5.1
but got errors:

minedl@minedl-machine:~/nccl$ make install
Compiling src/core.cu > /home/minedl/nccl/build/obj/core.o
src/common_coll.h(96): error: class "ncclComm" has no member "buffSizePerRing"

src/common_coll.h(100): error: class "ncclMem" has no member "doneCount"

src/common_coll.h(105): error: class "ncclComm" has no member "nRings"

3 errors detected in the compilation of "/tmp/tmpxft_000037f6_00000000-5_core.cpp4.ii".
Makefile:109: recipe for target '/home/minedl/nccl/build/obj/core.o' failed
make: *** [/home/minedl/nccl/build/obj/core.o] Error 2

Did anyone get the same error and solve it before?

CuRAND Error on Tesla M2070 + CUDA 7.0

I am facing an issue while installing 'nccl' for Tesla M2070 + CUDA library 7.0.

 -> % ./build/test/single/all_gather_test 10000000
 # Using devices
 #   Rank  0 uses device  0 [0x06] Tesla M2070
 #   Rank  1 uses device  1 [0x14] Tesla M2070
 #   Rank  2 uses device  2 [0x11] Tesla M2070

 #      bytes             N    type     time  algbw  busbw    delta
 CuRAND error 204 at test/include/test_utilities.h:112

NCCL has been compiled without any error. but when the test is run, CuRAND error occurs.
What could be the reason & how to fix this?

Thanks in advance!

puzzled by the latency number

Hi All,
I was having a difficult time understanding the latency of around 1.6 usec for a 10000000 byte allreduce posted..which gives bw ~ 6 GB/sec. Where do all the other overheads such as kernel launch, cuda synchronization go?All these easily amount to more than 10 usec? I am missing something here..please help me understand..

thanks..

scatter support

I noticed that there is only ncclReduceScatter, but no ncclSctter(it was commented in header file).

Why is that?

Troubleshooting help?

I followed the instructions from the readme, but I can't get the tests to run. Is there any additional advice someone can give me?

# make CUDA_HOME=/usr/local/cuda test
Compiling src/libwrap.cu            > build/obj/libwrap.o
Compiling src/core.cu               > build/obj/core.o
Compiling src/all_gather.cu         > build/obj/all_gather.o
Compiling src/all_reduce.cu         > build/obj/all_reduce.o
Compiling src/broadcast.cu          > build/obj/broadcast.o
Compiling src/reduce.cu             > build/obj/reduce.o
Compiling src/reduce_scatter.cu     > build/obj/reduce_scatter.o
Linking   build/lib/libnccl.so.1.2.2
Grabbing  src/nccl.h                > build/include/nccl.h
Building  test/single/all_gather_test.cu > build/test/single/all_gather_test
Building  test/single/all_reduce_test.cu > build/test/single/all_reduce_test
Building  test/single/broadcast_test.cu > build/test/single/broadcast_test
Building  test/single/reduce_test.cu > build/test/single/reduce_test
Building  test/single/reduce_scatter_test.cu > build/test/single/reduce_scatter_test
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
# ./build/test/single/all_reduce_test
Error: must specify at least data size in bytes!

Tests nccl AllReduce with user supplied arguments.
    Usage: all_reduce_test <data size in bytes> [number of GPUs] [GPU 0] [GPU 1] ...

# ./build/test/single/all_reduce_test 10000000
NCCL failure test/single/all_reduce_test.cu:259 'unhandled cuda error'
# nvidia-smi
Tue Jun  7 18:35:23 2016
+------------------------------------------------------+
| NVIDIA-SMI 361.42     Driver Version: 361.42         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:02:00.0     Off |                  N/A |
| 22%   35C    P8    15W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:04:00.0     Off |                  N/A |
| 22%   34C    P8    14W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:83:00.0     Off |                  N/A |
| 22%   34C    P8    16W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 0000:84:00.0     Off |                  N/A |
| 22%   32C    P8    15W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Issue on K10 board

Hi NCCL team,
I downloaded NCCL library and compiled the whole suite (lib+sample code) under CUDA 7.5.
Then, I tried one of your sample run (./build/test/single/all_reduce_test 10000000)
using my in-house hw setup (K10/K80).

In case of K10 run, I have this:
CuRAND error 204 at test/include/test_utilities.h:111
triggered by a Randomize() into the sample code.
I can overcome that simply by resetting to 0 the buffer instead of in random way,
but weird elapsed and bandwidth estimation values are print on screen like this:

$ ./build/test/single/all_reduce_test 10000000
# Using devices
#   Rank  0 uses device  0 [0x26] Tesla K10.G1.8GB
#   Rank  1 uses device  1 [0x27] Tesla K10.G1.8GB
#   Rank  2 uses device  2 [0x2a] Tesla K10.G1.8GB
#   Rank  3 uses device  3 [0x2b] Tesla K10.G1.8GB

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
    10000000      10000000    char     sum    0.009  1153.20  1729.81   2e-316    0.020  493.75  740.63   7e-310
    10000000      10000000    char    prod    0.008  1239.88  1859.82   7e-310    0.020  510.67  766.01   7e-310
    10000000      10000000    char     max    0.008  1221.47  1832.21   7e-310    0.020  510.46  765.70   7e-310
    10000000      10000000    char     min    0.008  1244.91  1867.37   7e-310    0.020  511.95  767.93   7e-310

In case of K80 (Randomize replaced by memset0) it seems fine:

$ ./build/test/single/all_reduce_test 10000000
# Using devices
#   Rank  0 uses device  0 [0x0c] Tesla K80
#   Rank  1 uses device  1 [0x0d] Tesla K80
#   Rank  2 uses device  2 [0x10] Tesla K80
#   Rank  3 uses device  3 [0x11] Tesla K80
#   Rank  4 uses device  4 [0x14] Tesla K80
#   Rank  5 uses device  5 [0x15] Tesla K80
#   Rank  6 uses device  6 [0x18] Tesla K80
#   Rank  7 uses device  7 [0x19] Tesla K80

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
    10000000      10000000    char     sum    2.887   3.46   6.06    0e+00    2.895   3.45   6.05    0e+00
    10000000      10000000    char    prod    2.547   3.93   6.87    0e+00    2.583   3.87   6.77    0e+00
    10000000      10000000    char     max    2.151   4.65   8.14    0e+00    2.166   4.62   8.08    0e+00
    10000000      10000000    char     min    1.966   5.09   8.90    0e+00    1.994   5.02   8.78    0e+00

What's going to happen with K10? It seems to me covered by NCCL requirement.
In any case, could you support K10? Despite we're moving toward new GPU arch, my lab is quite populate by K10.

Thanks,
Franco

Edited for legibility by @lukeyeager

Can nccl be used with multiple processes per GPU?

The header file suggests that the number of ranks (or tasks) must be less than (or equal to) the number of devices. However, it would be convenient to have, say, two processes training their own copies of a neural net on the same GPU and then using the reduce and bcast functionality to transfer data between the models during an update. Specifically, using reduce to sum all the gradient parameters onto the master nnet, and then, after updating the master nnet's parameters, using bcast to send the updated parameters to the slave nnets. Is this already possible, or do I need to wait for an enhancement?

NCCL allreduce hangs when cudaFreeHost

Hi Nickel team,

I have introduced your library into a my application. The integration was done into a multi-thread scenario. Each thread uses allreduce and in principle the allreduce is called into a loop.
The first part of the body of the loop is used to compute intermediate data then at the end
of the body of the loop I jump into allreduce.

It works perfectly but from time to time I fall into deadlock. Attaching the process with gdb,
I can see that (N-1) threads are into the cudaStreamSynchronize() (each allreduce has its own
custom cuda stream) while 1 thread is into cuFreeHost() (I use GPU malloc for GPU&CPU memory
allocator).
What's happening there is that during the first part of the body of the loop a thread needs to reallocate
some memory before doing its processing while the others (N-1) threads make their own processing
and jump into Nickel allreduce.
This creates from time to time some deadlock condition. What I can guiess is that there is some timeing condition with which threads make action that produce deadlock.
This is not deterministic: the need to reallocate is deterministic after some iteration but not always produces deadlock.

Could you help me in some way?

Not clear if it is a Cuda issue, a Nickel/Cuda bug or a Cuda limitation.
Does any memory management action alloc/free CPU/GPU require that gpus are idle?

I use Nickel allreduce GPU-based sync methods. No CPU-based barrier introduced before entering into allreduce().
Do I need to add a CPU-based barrier? Any C/C++ safe-code to use in case?

Thanks a lot,
Franco

Next some details about gdb info:
(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc41538ffde in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415412916 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415412fa8 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153793fc in ?? () from /usr/lib64/libcuda.so.1
#8 0x00007fc415347392 in cuMemFreeHost () from /usr/lib64/libcuda.so.1
#9 0x00007fc41ac6284d in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#10 0x00007fc41ac4782c in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

(gdb) where
#0 0x00007fffc6bffa11 in clock_gettime ()
#1 0x0000003ab7a03e46 in clock_gettime () from /lib64/librt.so.1
#2 0x00007fc415a821de in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007fc4154377ab in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007fc415414e33 in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007fc415414f89 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007fc415388c87 in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007fc4153610c2 in cuStreamSynchronize () from /usr/lib64/libcuda.so.1
#8 0x00007fc41ac40d90 in ?? () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5
#9 0x00007fc41ac781fd in cudaStreamSynchronize () from /usr/local/cuda-7.5//lib64/libcudart.so.7.5

double gtx680 cuda8.0 ubuntu14.04LTS

I followed the instructions from the readme, but I can't get the tests to run.
Is someone help me?

my options:
1ใ€git clone https://github.com/NVIDIA/nccl.git
2ใ€cd nccl
3ใ€sudo make CUDA_HOME=/usr/local/cuda-8.0 test
4ใ€sudo make install
4ใ€export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib

and then๏ผš
sudo ./build/test/single/all_reduce_test 10000000

tips๏ผš
./build/test/single/all_reduce_test: error while loading shared libraries: libnccl.so.1: cannot open shared object file: No such file or directory

but๏ผš
zengraoli@zengraoli-desktop:/dl/nccl$ ll /usr/local/lib/libnccl.so*
lrwxrwxrwx 1 root root 12 3ๆœˆ 4 18:51 /usr/local/lib/libnccl.so -> libnccl.so.1*
lrwxrwxrwx 1 root root 16 3ๆœˆ 4 18:51 /usr/local/lib/libnccl.so.1 -> libnccl.so.1.3.3*
-rwxr-xr-x 1 root root 23126897 3ๆœˆ 4 18:51 /usr/local/lib/libnccl.so.1.3.3*
zengraoli@zengraoli-desktop:
/dl/nccl$

Mystery hangs with nccl, CUDA 8rc, and Torch

I have attached a script that used to work and now does not (parallel-nccl.lua); I have also attached an ancillary file (nccl_ffi.lua) which serves as a Torch/Lua wrapper for nccl.h and is needed to run parallel-nccl.lua

The runtime environment is a Linux laptop, Ubuntu 16.04, 2 x GTX 970M cards, driver version 367.35 (the current release level for the GTX 970M's). Cuda Toolkit v. 8rc.
I am using the Torch development environment.

The script tests the basic nccl operations. Instead of MPI, it uses a functionally similar package in Torch called "parallel", which is a multi-process harness, not to be confused with other Torch packages which also use the word "parallel" in their names.

The script hangs during the "reduce-out-of-place" test when running 2 workers per GPU.
I have seen similar behavior when training a neural net using the "AllReduce" function to consolidate gradient parameters between processes (each process trains a clone of a network), the code hangs at the first subsequent call to inspect/modify a (Torch) tensor.

Stopping the X Windows Server (sudo service lightdm stop) does not help.

Any insights would be much appreciated.

Save the attached files into the same folder, run the test script using "th parallel-nccl.lua".
The script hangs (for me, at any rate) during the reduce-out-of-place code block towards the beginning. Note that the script comprises a "parent" process and and a "worker" process. The code for the worker is above the code for the parent.

parallel-nccl.lua.txt
nccl_ffi.lua.txt

Is NCCL available in CUDA 8.0 ?

Not really an issue.
In the perspective to move toward CUDA 8.0 and Pascal (PCIe/NVlink) architecture, is Nickel already available in CUDA 8.0 ?
Do I need to download it from GitHub site ?
Is Nickel already optimized for NVlink ?

Thanks,
Franco

NCCL not found when build NV caffe

Hi, I am new guy to Digits and Linux, So pls forgive me if the issue is too naive.
I biuld the NV caffe using the source code, my system have 4 K40c Tesla cards.
I build the nccl lib according the guidline in this page, but when I "cmake .." the caffe,
the system tell me that NCCL not found. I think it is some problem with enveirment
paras setup. So would pls tell me how to fix it, thanks.

all_reduce_test stop.

I used CentOS 7.0 and CUDA 7.5 on the server with 6pcs Tesla cards, it stop and has no response when running ./all_reduce_test 10000000 under single folder.
2

My GPU topo is as below

CPU 0 -- GPU0
-- GPU1
-- GPU2
CPU 1 -- GPU3
-- GPU4
-- GPU5
Even I ran with ./all_reduce_test 2 0 1, it still didn't run.
Do I need to install MPI even if I use tests in single folder? Is single test valid for multi-CPU as the topo above?
I checked ACSCtl, all are negative. I don't know what I can do.

CUDA_VERSION instead of __CUDACC_VER_{MAJOR,MINOR}__?

Just out of curiosity, is there a reason we infer CUDA_VERSION out of libcudart.so, instead of CUDACC_VER_MAJOR and CUDACC_VER_MINOR defined by nvcc? I am curious mainly because having CUDA_VERSION figured out via shell script makes it kind of hard to compile in a separate environment without manually feeding in these macros.

How can I scale with nccl addoption?

Hi,

I have single GPU application which I am now trying to extend to work on multiple GPUs, which is offloading data in segments and I would like to let it scale both horizontally and vertically.

Q1: I am considering adopting nccl for single node multi-gpu task distribution. I think this is the way to go in the moment. Is this where nccl will help me?

Q2: When it comes to multigpu clusters distribution(for example I have 2x 8GPU systems), is there also work for nccl or better to look into frameworks like rCUDA MPI?

Q3: Does nccl try to aspirate to become a generic layer into which I could potentially connect any system with any number of GPUs and distribute tasks?

Thanks for suggestions/answers.

Ladislav

Test files absent

The documentation specifies

"Test binaries are located in the subdirectories nccl/build/test and nccl/build/mpitest."
and
"./build/test/all_reduce_test"

Whereas when I build the library; mpitest library was absent and all_reduce_test was present in ./build/test/single/ directory.

Am I missing something here

Windows platform support

As more and more libraries are built on this project, the support for Windows platform is necessary. Although PR #31 already makes such effort, but it is lag behind the newest version. I think the official support for Windows platform should be added in the future.

NCCL All reduce on M40

Hello,

I tried running test/single/all_reduce test on M40 nodes and the test just hangs. The same test works fine on TitanX nodes. I'm running the test on 2 GPUs with cuda 7.5. The driver version for the TitanX node is 352.79. Here is the output from the TitanX nodes:

~/nccl$ ./build/test/single/all_reduce_test 100# Using devices
#   Rank  0 uses device  0 [0x04] GeForce GTX TITAN X
#   Rank  1 uses device  1 [0x05] GeForce GTX TITAN X

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
         100           100    char     sum    0.038   0.00   0.00    0e+00    0.046   0.00   0.00    0e+00
         100           100    char    prod    0.032   0.00   0.00    0e+00    0.046   0.00   0.00    0e+00
         100           100    char     max    0.032   0.00   0.00    0e+00    0.045   0.00   0.00    0e+00
         100           100    char     min    0.037   0.00   0.00    0e+00    0.044   0.00   0.00    0e+00
         100            25     int     sum    0.033   0.00   0.00    0e+00    0.045   0.00   0.00    0e+00
         100            25     int    prod    0.032   0.00   0.00    0e+00    0.044   0.00   0.00    0e+00
         100            25     int     max    0.032   0.00   0.00    0e+00    0.044   0.00   0.00    0e+00
         100            25     int     min    0.032   0.00   0.00    0e+00    0.043   0.00   0.00    0e+00
         100            50    half     sum    0.033   0.00   0.00    0e+00    0.044   0.00   0.00    0e+00
         100            50    half    prod    0.048   0.00   0.00    0e+00    0.069   0.00   0.00    0e+00
         100            50    half     max    0.035   0.00   0.00    0e+00    0.053   0.00   0.00    0e+00
         100            50    half     min    0.036   0.00   0.00    0e+00    0.048   0.00   0.00    0e+00
         100            25   float     sum    0.034   0.00   0.00    0e+00    0.052   0.00   0.00    0e+00
         100            25   float    prod    0.035   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
         100            25   float     max    0.035   0.00   0.00    0e+00    0.050   0.00   0.00    0e+00
         100            25   float     min    0.036   0.00   0.00    0e+00    0.050   0.00   0.00    0e+00
          96            12  double     sum    0.033   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12  double    prod    0.034   0.00   0.00    0e+00    0.050   0.00   0.00    0e+00
          96            12  double     max    0.033   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12  double     min    0.046   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12   int64     sum    0.034   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12   int64    prod    0.033   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12   int64     max    0.034   0.00   0.00    0e+00    0.052   0.00   0.00    0e+00
          96            12   int64     min    0.034   0.00   0.00    0e+00    0.048   0.00   0.00    0e+00
          96            12  uint64     sum    0.046   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12  uint64    prod    0.033   0.00   0.00    0e+00    0.049   0.00   0.00    0e+00
          96            12  uint64     max    0.034   0.00   0.00    0e+00    0.052   0.00   0.00    0e+00
          96            12  uint64     min    0.034   0.00   0.00    0e+00    0.053   0.00   0.00    0e+00

I tried running the same binary on M40 nodes with drivers 352.79 and 352.93. However, the test just stalls:

~/nccl$ ./build/test/single/all_reduce_test 100
# Using devices
#   Rank  0 uses device  0 [0x04] Tesla M40
#   Rank  1 uses device  1 [0x05] Tesla M40

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res

Can you advice regarding this problem? Have you tried running these tests on M40s?

NCCL fails to build on mac

I am building the master version (at commit 2a974f5), with Mac OS 10.12 and got the following error:

jiayq-mbp:nccl jiayq$ make
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.
: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.
: No such file or directory
Grabbing src/nccl.h > /Users/jiayq/Research/nccl/build/include/nccl.h
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.
: No such file or directory
Compiling src/libwrap.cu > /Users/jiayq/Research/nccl/build/obj/libwrap.o
ls: /usr/local/cuda/lib64/libcudart.so.: No such file or directory
ls: /usr/local/cuda/lib64/libcudart.so.
: No such file or directory
Compiling src/core.cu > /Users/jiayq/Research/nccl/build/obj/core.o
src/core.cu(717): error: expected an expression

src/core.cu(717): error: expected an expression

2 errors detected in the compilation of "/var/folders/4x/jpsdl58x643dsgw7tbq1zs5clp6v5p/T//tmpxft_00013b6e_00000000-11_core.compute_52.cpp1.ii".
make: *** [/Users/jiayq/Research/nccl/build/obj/core.o] Error 2

@slayton58 recommended me opening an issue - happy to provide more details :)

My build environment is nvcc 8.0.54 and Apple LLVM version 8.0.0 (clang-800.0.42.1).

Segmentation fault with cuda-7.5

I build nccl with cuda-7.5:

make CUDA_HOME=/usr/local/cuda-7.5 test

And run test with the following command:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:./build/lib
./build/test/all_reduce_test

causes a segmentation fault:

# Using devices
Segmentation fault

But all tests run smoothly if I build nccl with cuda-7.0.
Is the current version of nccl not compatible with cuda-7.5?

Compiling NCCL on Fedora 24

Hello everyone!

I tried compiling NCCL from source on Fedora 24 with GCC 6.1 and CUDA 8.0 on a server with 2 GPUs (a single Tesla K80 card). However, I ran into the following error while trying to compile:

/usr/local/cuda/include/cuda_fp16.h(2970): error: more than one instance of overloaded function "isinf" matches the argument list

I then tried adding the -std=c++98 flag to the CXXFLAGS in the Makefile, and it progressed further, albeit with the following warnings:

cc1: warning: command line option โ€˜-std=c++98โ€™ is valid for C++/ObjC++ but not for C

The error it threw this time was:

error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options

So if I understand correctly, if I try to compile with GCC 6 without any flags, it throws the overloaded error, but if I try to compile with an older GCC standard, it can't compile because it needs the newer GCC functions.

Do you maybe have any suggestions or am I out of luck trying to get NCCL working in this environment? Is there perhaps any alternative to NCCL for multi-GPU usage for Deep Learning (specifically, Caffe)?

Thank you very much! :)

How use the nccl?

When I use the nccl to the cifar10 example, I find the time is no change. How use it?
Using $TOOLS/caffe train --solver=examples/cifar10/cifar10_full_solver.prototxt -gpu 0
I0302 03:19:23.473100 8238 caffe.cpp:197] Using GPUs 0
I0302 03:19:23.473930 8238 caffe.cpp:202] GPU 0: Tesla P100-SXM2-16GB
I0302 03:19:23.825523 8238 solver.cpp:48] Initializing solver from parameters:
I0302 03:28:01.841064 8238 solver.cpp:362] Iteration 55000, Testing net (#0)
I0302 03:28:02.094408 8238 solver.cpp:429] Test net output #0: accuracy = 0.7896
I0302 03:28:02.094431 8238 solver.cpp:429] Test net output #1: loss = 0.620957 (* 1 = 0.620957 loss)
I0302 03:28:02.104485 8238 solver.cpp:242] Iteration 55000 (102.234 iter/s, 1.9563s/200 iter), loss = 0.370225
I0302 03:28:02.104549 8238 solver.cpp:261] Train net output #0: loss = 0.370225 (* 1 = 0.370225 loss)

Using $TOOLS/caffe train --solver=examples/cifar10/cifar10_full_solver.prototxt -gpu all ,the time is no little change!
I0302 03:31:23.499303 8361 solver.cpp:362] Iteration 5000, Testing net (#0)
I0302 03:31:23.723893 8361 solver.cpp:429] Test net output #0: accuracy = 0.6884
I0302 03:31:23.723942 8361 solver.cpp:429] Test net output #1: loss = 0.890609 (* 1 = 0.890609 loss)
I0302 03:31:23.733755 8361 solver.cpp:242] Iteration 5000 (99.2158 iter/s, 2.01581s/200 iter), loss = 0.571789
I0302 03:31:23.733794 8361 solver.cpp:261] Train net output #0: loss = 0.571788 (* 1 = 0.571788 loss)

Running NCCL mpi test accros multiple nodes

Hi,

I've built and run the mpi_test on 1 node with 8 TitanX gpus successfully. I use srun to launch the mpi test and it passes. However, the test fails when run across 2 nodes with 8 TitanX gpus per node. I use the following command line:

srun -N2 -n16 --gres=gpu:8 -p TitanXx8 build/test/mpi/mpi_test 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

The test fails with the following error:

WARN src/core.cu:225 failed to allocate 2101248 byte device buffer
WARN src/core.cu:596 rank 12 failed to allocate device buffer
WARN src/core.cu:683 rank 12 failed to allocate communicator
NCCL Init failed (10) 'cuda malloc failed'

Does NCCL run across multiple nodes?

gcc version

which gcc version needed by NCCL?
/usr/bin/make64 MAC=64 CUDA_HOME=/home/work/cuda-7.5/ test
Compiling src/libwrap.cu > build/obj/libwrap.o
nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored.
src/core.h(47): error: expected an identifier

src/core.h(61): warning: parsing restarts here after previous syntax error

src/core.h(111): error: DevRing is not a template

2 errors detected in the compilation of "/tmp/tmpxft_00004dea_00000000-13_libwrap.compute_52.cpp1.ii".
make64: *** [build/obj/libwrap.o] Error 2

when I compile this project , an ERROR occured as above.

Question on HF support

Hi,

I included the Nickel library into my tool and I'm up and running.
Till now I used ncclAllReduce with float 32 bufferes and ncclSum reduction.
Is there an easy way to have the API working with float 32 buffers but transfer and adding
realized with HF16 floats? (sum even emulated on float 32 but transfer at HF 16)

Thanks,
franco

occasional crashes when using more than one comm per GPU

Hi All,

I have noticed crashes when I overload a device with more than one nccl comm. For example,
below I want to use 6 instances of the comm with only two devices 0, 1. I see crashes even with smaller instances..for eg. two instances of comm with each device. Does nccl assume that only one comm is created per device? This is restrictive if this is the case,

./build/test/single/broadcast_test 10000000 6 0 0 0 0 0 1
INFO NCCL debug level set to INFO
INFO rank 0 using buffSize = 2097152
INFO rank 0 using device 0 (0000:03:00.0)
INFO rank 1 using buffSize = 2097152
INFO rank 1 using device 0 (0000:03:00.0)
INFO rank 2 using buffSize = 2097152
INFO rank 2 using device 0 (0000:03:00.0)
INFO rank 3 using buffSize = 2097152
INFO rank 3 using device 0 (0000:03:00.0)
INFO rank 4 using buffSize = 2097152
INFO rank 4 using device 0 (0000:03:00.0)
Segmentation fault

Amith

Install error: [nvcc fatal : Value 'gnu++0x' is not defined for option 'std']

Some information:

  1. Red Hat Enterprise Linux Server release 6.6 (Santiago)
  2. cuda7.5
  3. nccl-1.2.3-1-cuda7.5

I installed by below command:

[@ppk_02 nccl-1.2.3-1-cuda7.5]$ make CUDA_HOME=/usr/local/cuda test
Compiling src/libwrap.cu > build/obj/libwrap.o
nvcc fatal : Value 'gnu++0x' is not defined for option 'std'
make: *** [build/obj/libwrap.o] Error 1

Is there anyone who can help me figure it out? Thanks very much.

Is NCCL thread safe?

Hello,

It comes to my attention that all examples are in single thread. I tried a multithread example by the following codes:
void GPU(void threadid)
{
const int size = 2;
int tid = ((int) threadid);
cudaSetDevice(tid);
ncclComm_t comm = comms[tid]; //init as a in file variable in main thread
PerThreadData* data = (PerThreadData_) malloc(sizeof(PerThreadData));
int cudaDev;
int rank;
cudaDeviceProp prop;
ncclCommCuDevice(comm, &cudaDev);
ncclCommUserRank(comm, &rank);
cudaGetDeviceProperties(&prop, cudaDev);
//initialization
cudaStreamCreate(&(data->stream));
cudaMalloc(&(data->sendBuff), sizeof(double)_size);
cudaMalloc(&(data->recvBuff), sizeof(double)_size);
double temp[2] = {tid+1, tid+1};
cudaMemcpy(data->sendBuff, temp, sizeof(double)_size, cudaMemcpyHostToDevice);
cudaMemcpy(data->recvBuff, temp, sizeof(double)*size, cudaMemcpyHostToDevice);
data->size = size;
printf("# Rank %2d uses device %2d [0x%02x] %s\n", rank, cudaDev, prop.pciBusID, prop.name);
printf("Hello World! It's me, thread #%d!\n", tid);
//destruction
pthread_exit(NULL);
}

Each thread is responsible for a GPU running the above code. Then it creates a deadlock.

I assume NCCL is not thread safe in this case. Is that true? Do we only use it in a single thread?
Thank you.

Schedule about Gather and Scatter

I replaced MPI with NCCL in my procedure, and I'm surprised that it greatly outperforms than MPI. Thank you for your wonderful work very much. I'm going to call Gather and Scatter in my program, but these two functions are not implemented yet. Could you place them on the agenda?

allreduce and barriers

I'm looking at the NV branch of Caffe with NCCL support. It uses a barrier before doing allreduce. Is it still necessary, or is NCCL tracking data dependencies already?

MPI_test failed using cuda 7.5

Dear NCCL developers,

I got a confusing error message when trying to run mpi_test.

I build mpi_test using $ make CUDA_HOME=/usr/local/cuda MPI_HOME=$WORK_PATH/openmpi mpitest

And then $ export PATH=$PATH:./build/test/mpi/

And when I run the test using $ mpirun -np 2 mpi_test 0 1 (there are 4 gpus in the machine), I got the follwoing message:

* stack smashing detected _: mpi_test terminated
[snake07:16792] ** Process received signal ***
[snake07:16792] Signal: Aborted (6)
[snake07:16792] Signal code: (-6)
[snake07:16792] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0)[0x7f1ffd7c7cb0]
[snake07:16792] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f1ffd7c7c37]
[snake07:16792] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f1ffd7cb028]
[snake07:16792] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x732a4)[0x7f1ffd8042a4]
[snake07:16792] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f1ffd89bbbc]
[snake07:16792] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x7f1ffd89bb60]
[snake07:16792] [ 6] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x7e315)[0x7f1ffd0a5315]
[snake07:16792] [ 7] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_backends_notify_new_object+0x41)[0x7f1ffd0a0291]
[snake07:16792] [ 8] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_insert_pci_device_list+0x1b5)[0x7f1ffd0a47d5]
[snake07:16792] [ 9] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x821fc)[0x7f1ffd0a91fc]
[snake07:16792] [10] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_topology_load+0x29d)[0x7f1ffd0c6abd]
[snake07:16792] [11] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xe2)[0x7f1ffd0991d2]
[snake07:16792] [12] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(ompi_mpi_init+0x5dd)[0x7f20046f7b2d]
[snake07:16792] [13] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(MPI_Init+0x16b)[0x7f20047176eb]
[snake07:16792] [14] mpi_test[0x4014e4]
[snake07:16792] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f1ffd7b2f45]
[snake07:16792] [16] mpi_test[0x401c77]
[snake07:16792] *** End of error message ***
* stack smashing detected _: mpi_test terminated
[snake07:16793] *
* Process received signal ***
[snake07:16793] Signal: Aborted (6)
[snake07:16793] Signal code: (-6)
[snake07:16793] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36cb0)[0x7f975342fcb0]
[snake07:16793] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f975342fc37]
[snake07:16793] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f9753433028]
[snake07:16793] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x732a4)[0x7f975346c2a4]
[snake07:16793] [ 4] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f9753503bbc]
[snake07:16793] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x0)[0x7f9753503b60]
[snake07:16793] [ 6] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x7e315)[0x7f9752d0d315]
[snake07:16793] [ 7] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_backends_notify_new_object+0x41)[0x7f9752d08291]
[snake07:16793] [ 8] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_insert_pci_device_list+0x1b5)[0x7f9752d0c7d5]
[snake07:16793] [ 9] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(+0x821fc)[0x7f9752d111fc]
[snake07:16793] [10] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc191_hwloc_topology_load+0x29d)[0x7f9752d2eabd]
[snake07:16793] [11] /net/mlfs01/export/users/leywang/openmpi/lib/libopen-pal.so.13(opal_hwloc_base_get_topology+0xe2)[0x7f9752d011d2]
[snake07:16793] [12] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(ompi_mpi_init+0x5dd)[0x7f975a35fb2d]
[snake07:16793] [13] /net/mlfs01/export/users/leywang/openmpi/lib/libmpi.so(MPI_Init+0x16b)[0x7f975a37f6eb]
[snake07:16793] [14] mpi_test[0x4014e4]
[snake07:16793] [15] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f975341af45]
[snake07:16793] [16] mpi_test[0x401c77]
[snake07:16793] *** End of error message ***

I don't get any clue from the error message. Could you help me?

How to use memory allocated by cudaMallocPitch?

I could follow this library easily for memories allocated by cudaMalloc since I know the exact number of count/datatype I requested. How can I use memory allocated by cudaMallocPitch?

Thanks,
Pranav

Is it safe to reuse cudaStream_t?

hi, all
Is it safe to reuse a cudaStream_t object(say stream0) after cudaStreamSynchronize(stream0)? More specifically, Is following code safe:

cudaStream_t stream0;
CHECK_EQ(cudaStreamCreateWithFlags(&stream0, cudaStreamNonBlocking), ncclSuccess);
for(int i = 0; i < 1000000; ++i) {
  // communition
  CHECK_EQ(ncclAllReduce(src_ptr, dst_ptr,  some_count, ncclFloat,
                            ncclSum, some_comm, stream0), ncclSuccess);
   // do something else
   ...
   ...
  
   // wait it
   cudaStreamSynchronize(stream0)

   // without destory stream0
}

nccl mpi communication

Dear All:

I got some questions while testing the nccl/test/mpi/nccl_test.cu code. Firstly, does it support multi-workstation communication cross different workstations. Secondly, will it block the GPU computation during the nccl data communication(such as ncclallredue). Thanks a lot.

Yours
Sincerely

how to reduce the extra memory usage?

while i use the nccl for the program, i found that every gpu would use the extra memorys. For example, i used 4 gpus, and every gpu would use the extra 3 memorys of 110M. just like the iamge. my question is how to reduce the 110M?

Build error __builtin_ia32_monitorx is undefined solution

Distro: Linux 4.2.6-300.fc23.x86_64

/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/mwaitxintrin.h(36): error: identifier "__builtin_ia32_monitorx" is undefined

/usr/lib/gcc/x86_64-redhat-linux/5.3.1/include/mwaitxintrin.h(42): error: identifier "__builtin_ia32_mwaitx" is undefined

However when I add the CXX flags:
-D_MWAITXINTRIN_H_INCLUDED
-D_FORCE_INLINES
-D__STRICT_ANSI__

the build completes successfully see here:
tensorflow/tensorflow#1066

6x slowdown on Pascal TitanX on 80 lane PCIe switch.

So we have some 8 gpu machines running Maxwell TitanX's and we decided to try swapping them out with the newer cards. The basic hardware architecture is a pair of 80 lane switches connected to the same CPU. The driver is version 367.35

When running your benchmark on the Maxwell cards I get pretty much the expected numbers:

./build/test/single/all_reduce_test 10000000 2 0 1
       N    type      op     time  algbw  busbw
10000000    char     sum    0.886  11.28  11.28

./build/test/single/all_reduce_test 10000000 2 0 7
       N    type      op     time  algbw  busbw
10000000    char     sum    1.215   8.23   8.23

I get about a 6x slowdown running on the exact same system but with Pascal instead of Maxwell cards. Also, the test that traverses the CPU runs at the same speed:

./build/test/single/all_reduce_test 10000000 2 0 1
       N    type      op     time  algbw  busbw
10000000    char     sum    5.650   1.77   1.77

./build/test/single/all_reduce_test 10000000 2 0 7
       N    type      op     time  algbw  busbw
10000000    char     sum    5.661   1.77   1.77

The slowdown is about 5x when running the test with all 8 gpus enabled.

Here are the results on a on an Intel Z170 chipset running two Pascal Titans on 8x PCIe. There doesn't seem to be an issue here (about 2x slower than the Maxwells running on 16x PCIe).

       N    type      op     time  algbw  busbw
10000000    char     sum    1.674   5.97   5.97

When testing with the cuda sample p2pBandwidthLatencyTest program, I get nearly identical results with the 2 sets of cards. The exception is the latency numbers with peer access enabled:

Maxwell:

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7 
     0   4.10   8.03   8.22   7.46   7.25   6.95   7.47   6.76 
     1   7.37   4.49   7.26   7.25   7.05   6.66   7.25   6.72 
     2   7.27   7.33   4.24   7.34   7.16   6.66   7.49   6.85 
     3   7.19   7.05   7.38   4.04   6.94   6.47   6.86   6.73 
     4   7.03   6.85   6.89   7.30   3.90   6.52   7.19   6.72 
     5   7.42   7.24   6.94   7.02   7.00   4.22   7.17   7.09 
     6   8.68   7.32   7.11   7.24   7.07   7.10   4.41   6.39 
     7   7.77   7.76   7.20   7.68   8.09   6.77   7.55   4.01 

Pascal:

P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3      4      5      6      7 
     0   3.39  20.00  14.53  19.57  19.73  16.15  19.17  17.48 
     1  16.02  11.21  14.42  19.54  19.75  16.20  19.11  17.55 
     2  16.07  19.93   3.79  19.58  19.72  16.48  19.18  17.56 
     3  16.03  19.85  14.43   4.35  19.72  16.25  19.10  17.47 
     4  16.21  19.81  14.58  19.88  11.34  16.06  19.09  17.39 
     5  16.28  19.76  14.62  19.63  19.61   3.27  19.14  17.31 
     6  16.07  19.95  14.55  19.70  19.62  16.18   4.03  17.36 
     7  16.08  20.03  14.62  19.89  19.78  16.05  19.27  11.23 

We see similar slowdowns when training large models in tensorflow (which is how this came to our attention). Your tool seemed like a good way to probe this issue. Is there a more appropriate place to submit this bug?

peer mapping resources exhausted for < 8 GPUs

I am running a NCCL reduction across multiple GPUs on an Amazon P2 16x instance in a multi-process context (one MPI rank per GPU). When I added small arrays together across 16 workers I got the error "peer mapping resources exhausted". Looking online I determined that perhaps I was limited to 8 GPUs in a group and NCCL wasn't dealing with this limitation internally.

However, when I reduced between two groups of 8 GPUs using NCCL (by splitting MPI_COMM_WORLD into two separate communicators) and then did a standard MPI reduction in host memory to reduce the remaining two arrays, I got the same error. Same for 7 GPUs. I had to reduce the group size to 4 to get the correct behaviour.

It seems this is unrelated to the peer ensemble limitation but instead is related to other resources needed for multi-process reductions on a single node.

Joss Knight

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.