Comments (11)
Hi,
Did you see CPU_SYNC and GPU_SYNC events presented in the trace file?
If yes, maybe the code path related LL128 events is not triggered (net send/recv is not covered because you’re running single node test). Pls try test_npkit_events.sh in the samples/NPKit folder.
If no, then probably NPKit is not enabled. Pls make sure NCCL with NPKit is properly built and enabled.
from npkit.
Hi,
Thank you for your quick response.
Here seem some CPU_SYNC events.
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352062627302}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352108313614}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352165713543}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352220442632}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352262747296}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352309870301}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312142589}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312155555}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312169877}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312183815}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312489432}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313935949}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313962963}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313974965}
...
Here seem some GPU_SYNC events.
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29425007248}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29487368954}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29565711993}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29640398388}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29700178222}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29789938685}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794267047}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794291744}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794319086}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794347790}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794927153}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797682420}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797733911}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797756752}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797782310}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797805932}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797844888}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797868134}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797889785}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797913820}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797937842}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797960900}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797984347}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798008106}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798031891}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798056051}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798079492}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798102956}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798127477}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798150586}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798174421}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798743592}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29799730967}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29802882597}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29804244328}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29804268890}
...
Is it as expected?
from npkit.
Thanks, looks like the NPKit is enabled. Could you please try enforcing NCCL_PROTO=LL128 in NCCL run and see whether there are more events?
from npkit.
However, I find that the number of out-of-bounds changes.
Here is my log. Is it as expected?
# nThread 1 nGpus 1 minBytes 8 maxBytes 33554432 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 20141 on gpu9 device 0 [0x3d] NVIDIA GeForce RTX 2080 Ti
# Rank 1 Pid 20142 on gpu9 device 1 [0x3e] NVIDIA GeForce RTX 2080 Ti
# Rank 2 Pid 20143 on gpu9 device 2 [0xb1] NVIDIA GeForce RTX 2080 Ti
# Rank 3 Pid 20144 on gpu9 device 3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 40.17 0.00 0.00 1e-07 62.08 0.00 0.00 0e+00
16 4 float sum 32.28 0.00 0.00 3e-08 33.14 0.00 0.00 3e-08
32 8 float sum 32.79 0.00 0.00 3e-08 32.98 0.00 0.00 3e-08
64 16 float sum 32.89 0.00 0.00 3e-08 33.51 0.00 0.00 3e-08
128 32 float sum 34.14 0.00 0.01 3e-08 32.12 0.00 0.01 3e-08
256 64 float sum 33.03 0.01 0.01 3e-08 32.94 0.01 0.01 3e-08
512 128 float sum 33.03 0.02 0.02 3e-08 32.58 0.02 0.02 1e-08
1024 256 float sum 33.54 0.03 0.05 1e-07 35.28 0.03 0.04 1e-07
2048 512 float sum 38.17 0.05 0.08 1e-07 39.18 0.05 0.08 1e-07
4096 1024 float sum 40.24 0.10 0.15 1e-07 41.26 0.10 0.15 1e-07
8192 2048 float sum 58.39 0.14 0.21 1e-07 56.03 0.15 0.22 1e-07
16384 4096 float sum 71.16 0.23 0.35 1e-07 64.84 0.25 0.38 1e-07
32768 8192 float sum 94.28 0.35 0.52 1e-07 90.56 0.36 0.54 1e-07
65536 16384 float sum 87.96 0.75 1.12 1e-07 85.24 0.77 1.15 1e-07
131072 32768 float sum 91.06 1.44 2.16 2e-07 89.04 1.47 2.21 2e-07
262144 65536 float sum 138.0 1.90 2.85 2e-07 136.8 1.92 2.87 2e-07
524288 131072 float sum 232.1 2.26 3.39 2e-07 235.6 2.23 3.34 2e-07
1048576 262144 float sum 428.7 2.45 3.67 2e-07 459.4 2.28 3.42 2e+00
2097152 524288 float sum 852.8 2.46 3.69 3e-02 835.2 2.51 3.77 2e-07
4194304 1048576 float sum 1688.1 2.48 3.73 4e-02 1786.7 2.35 3.52 2e-07
8388608 2097152 float sum 4149.9 2.02 3.03 6e-02 3209.5 2.61 3.92 6e-03
16777216 4194304 float sum 6795.0 2.47 3.70 1e+00 7391.7 2.27 3.40 2e-02
33554432 8388608 float sum 14205 2.36 3.54 1e+00 14141 2.37 3.56 4e+10
# Out of bounds values : 27 FAILED
# Avg bus bandwidth : 1.41103
#
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[1455,1],2]
Exit code: 1
--------------------------------------------------------------------------
from npkit.
It’s not expected. Does this happen with the same command but without NPKit as well?
Also maybe you can try a newer version (NCCL 2.12.12 with NPKit enabled) here: https://github.com/yzygitzh/nccl/tree/npkit-2.12.12
from npkit.
I try NCCL 2.12.12 with NPKit enabled, however, I get a fatal error.
Here is my build code.
git clone https://github.com/yzygitzh/nccl.git
cd nccl
git checkout eb51f579251741e16c444fbc6e76b530d85fb023 -f
make src.build CUDA_HOME=/usr/local/cuda-10.2/
Here is my log.
make: Warning: File 'Makefile' has modification time 20 s in the future
make -C src build BUILDDIR=/home/xinglinpan/npkit/nccl/build
make[1]: Entering directory '/home/xinglinpan/npkit/nccl/src'
make[1]: Warning: File '../makefiles/formatting.mk' has modification time 20 s in the future
Generating nccl.h.in > /home/xinglinpan/npkit/nccl/build/include/nccl.h
Grabbing include/nccl_net.h > /home/xinglinpan/npkit/nccl/build/include/nccl_net.h
Compiling init.cc > /home/xinglinpan/npkit/nccl/build/obj/init.o
Compiling channel.cc > /home/xinglinpan/npkit/nccl/build/obj/channel.o
Compiling bootstrap.cc > /home/xinglinpan/npkit/nccl/build/obj/bootstrap.o
Compiling transport.cc > /home/xinglinpan/npkit/nccl/build/obj/transport.o
Compiling enqueue.cc > /home/xinglinpan/npkit/nccl/build/obj/enqueue.o
Compiling group.cc > /home/xinglinpan/npkit/nccl/build/obj/group.o
Compiling debug.cc > /home/xinglinpan/npkit/nccl/build/obj/debug.o
Compiling proxy.cc > /home/xinglinpan/npkit/nccl/build/obj/proxy.o
Compiling enhcompat.cc > /home/xinglinpan/npkit/nccl/build/obj/enhcompat.o
Compiling net.cc > /home/xinglinpan/npkit/nccl/build/obj/net.o
Compiling misc/nvmlwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/nvmlwrap.o
Compiling misc/ibvwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/ibvwrap.o
Compiling misc/gdrwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/gdrwrap.o
Compiling misc/utils.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/utils.o
Compiling misc/argcheck.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/argcheck.o
Compiling misc/socket.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/socket.o
Compiling misc/shmutils.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/shmutils.o
Compiling misc/profiler.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/profiler.o
Compiling misc/param.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/param.o
Compiling misc/npkit.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o
In file included from misc/npkit.cc:6:0:
include/npkit/npkit.h:7:10: fatal error: hip/hip_runtime.h: No such file or directory
#include <hip/hip_runtime.h>
^~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:111: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o' failed
make[1]: *** [/home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o] Error 1
make[1]: Leaving directory '/home/xinglinpan/npkit/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2
from npkit.
You'll need to use npkit-2.12.12 branch.
from npkit.
Here is my new code.
git checkout git checkout npkit-2.12.12
make clean
make src.build CUDA_HOME=/usr/local/cuda-10.2/ NVCC_GENCODE="-gencode=arch=compute_75,code=sm_75" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
Here is my new log.
make -C src build BUILDDIR=/home/xinglinpan/npkit/nccl/build
make[1]: Entering directory '/home/xinglinpan/npkit/nccl/src'
Generating nccl.h.in > /home/xinglinpan/npkit/nccl/build/include/nccl.h
Grabbing include/nccl_net.h > /home/xinglinpan/npkit/nccl/build/include/nccl_net.h
Compiling init.cc > /home/xinglinpan/npkit/nccl/build/obj/init.o
Compiling channel.cc > /home/xinglinpan/npkit/nccl/build/obj/channel.o
Compiling bootstrap.cc > /home/xinglinpan/npkit/nccl/build/obj/bootstrap.o
Compiling transport.cc > /home/xinglinpan/npkit/nccl/build/obj/transport.o
Compiling enqueue.cc > /home/xinglinpan/npkit/nccl/build/obj/enqueue.o
Compiling group.cc > /home/xinglinpan/npkit/nccl/build/obj/group.o
Compiling debug.cc > /home/xinglinpan/npkit/nccl/build/obj/debug.o
Compiling proxy.cc > /home/xinglinpan/npkit/nccl/build/obj/proxy.o
Compiling enhcompat.cc > /home/xinglinpan/npkit/nccl/build/obj/enhcompat.o
Compiling net.cc > /home/xinglinpan/npkit/nccl/build/obj/net.o
Compiling misc/nvmlwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/nvmlwrap.o
Compiling misc/ibvwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/ibvwrap.o
Compiling misc/gdrwrap.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/gdrwrap.o
Compiling misc/utils.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/utils.o
Compiling misc/argcheck.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/argcheck.o
Compiling misc/socket.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/socket.o
Compiling misc/shmutils.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/shmutils.o
Compiling misc/profiler.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/profiler.o
Compiling misc/param.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/param.o
Compiling misc/npkit.cc > /home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o
Compiling transport/p2p.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/p2p.o
Compiling transport/shm.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/shm.o
Compiling transport/net.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/net.o
Compiling transport/net_socket.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/net_socket.o
Compiling transport/net_ib.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/net_ib.o
Compiling transport/coll_net.cc > /home/xinglinpan/npkit/nccl/build/obj/transport/coll_net.o
Compiling collectives/sendrecv.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/sendrecv.o
Compiling collectives/all_reduce.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/all_reduce.o
Compiling collectives/all_gather.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/all_gather.o
Compiling collectives/broadcast.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/broadcast.o
Compiling collectives/reduce.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/reduce.o
Compiling collectives/reduce_scatter.cc > /home/xinglinpan/npkit/nccl/build/obj/collectives/reduce_scatter.o
Compiling graph/topo.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/topo.o
Compiling graph/paths.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/paths.o
Compiling graph/search.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/search.o
Compiling graph/connect.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/connect.o
Compiling graph/rings.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/rings.o
Compiling graph/trees.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/trees.o
Compiling graph/tuning.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/tuning.o
Compiling graph/xml.cc > /home/xinglinpan/npkit/nccl/build/obj/graph/xml.o
make[2]: Entering directory '/home/xinglinpan/npkit/nccl/src/collectives/device'
Generating rules > /home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules
make[2]: Warning: File '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules' has modification time 160 s in the future
Compiling sendrecv.cu > /home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o
../../include/npkit/npkit_struct.h(8): error: "Bitfields and field types containing bitfields are not supported in packed structures and unions for device compilation!"
/home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules:2: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o' failed
make[2]: *** [/home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o] Error 2
make[2]: Leaving directory '/home/xinglinpan/npkit/nccl/src/collectives/device'
Makefile:50: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/colldevice.a' failed
make[1]: *** [/home/xinglinpan/npkit/nccl/build/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/home/xinglinpan/npkit/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2
from npkit.
Does this happen with the same command but without NPKit as well?
I rebuild my code without NPKit.
This also seems to happen.
I build nccl-tests without NPKit
make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4 CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/nccl_2.10.3-1+cuda10.2_x86_64/
and run the code with
CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/nccl_2.10.3-1+cuda10.2_x86_64/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./ -x NCCL_PROTO=LL128 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
Here is my log.
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 26003 on gpu9 device 0 [0x3d] NVIDIA GeForce RTX 2080 Ti
# Rank 1 Pid 26004 on gpu9 device 1 [0x3e] NVIDIA GeForce RTX 2080 Ti
# Rank 2 Pid 26005 on gpu9 device 2 [0xb1] NVIDIA GeForce RTX 2080 Ti
# Rank 3 Pid 26006 on gpu9 device 3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 35.76 0.00 0.00 1e-07 35.73 0.00 0.00 0e+00
16 4 float sum 36.02 0.00 0.00 3e-08 34.24 0.00 0.00 3e-08
32 8 float sum 37.94 0.00 0.00 3e-08 34.84 0.00 0.00 3e-08
64 16 float sum 41.73 0.00 0.00 3e-08 38.56 0.00 0.00 3e-08
128 32 float sum 39.08 0.00 0.00 3e-08 33.29 0.00 0.01 3e-08
256 64 float sum 36.59 0.01 0.01 3e-08 42.12 0.01 0.01 3e-08
512 128 float sum 40.98 0.01 0.02 3e-08 41.01 0.01 0.02 1e-08
1024 256 float sum 50.67 0.02 0.03 1e-07 38.64 0.03 0.04 1e-07
2048 512 float sum 47.87 0.04 0.06 1e-07 51.49 0.04 0.06 1e-07
4096 1024 float sum 62.50 0.07 0.10 1e-07 60.32 0.07 0.10 1e-07
8192 2048 float sum 65.22 0.13 0.19 1e-07 108.4 0.08 0.11 1e-07
16384 4096 float sum 235.1 0.07 0.10 1e-07 100.7 0.16 0.24 1e-07
32768 8192 float sum 257.4 0.13 0.19 1e-07 128.1 0.26 0.38 1e-07
65536 16384 float sum 80.75 0.81 1.22 1e-07 85.43 0.77 1.15 1e-07
131072 32768 float sum 102.9 1.27 1.91 2e-07 111.9 1.17 1.76 2e-07
262144 65536 float sum 229.9 1.14 1.71 2e+00 238.7 1.10 1.65 2e-07
524288 131072 float sum 324.4 1.62 2.42 2e-07 277.4 1.89 2.83 2e-07
1048576 262144 float sum 485.0 2.16 3.24 2e-07 846.7 1.24 1.86 2e-07
2097152 524288 float sum 964.3 2.17 3.26 2e-07 918.3 2.28 3.43 2e-07
4194304 1048576 float sum 1822.6 2.30 3.45 2e-07 3010.2 1.39 2.09 2e-01
8388608 2097152 float sum 3542.5 2.37 3.55 2e-07 4195.9 2.00 3.00 2e-07
16777216 4194304 float sum 8426.8 1.99 2.99 2e-07 7970.6 2.10 3.16 1e-02
33554432 8388608 float sum 15913 2.11 3.16 3e-02 16281 2.06 3.09 1e+00
67108864 16777216 float sum 31764 2.11 3.17 4e-02 32472 2.07 3.10 2e-02
134217728 33554432 float sum 66614 2.01 3.02 1e+00 66821 2.01 3.01 3e+10
# Out of bounds values : 31 FAILED
# Avg bus bandwidth : 1.29865
#
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[11904,1],0]
Exit code: 1
--------------------------------------------------------------------------
By removing NCCL_PROTO=LL128, I can get the expected result.
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 34181 on gpu9 device 0 [0x3d] NVIDIA GeForce RTX 2080 Ti
# Rank 1 Pid 34182 on gpu9 device 1 [0x3e] NVIDIA GeForce RTX 2080 Ti
# Rank 2 Pid 34183 on gpu9 device 2 [0xb1] NVIDIA GeForce RTX 2080 Ti
# Rank 3 Pid 34184 on gpu9 device 3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 20.56 0.00 0.00 1e-07 133.3 0.00 0.00 0e+00
16 4 float sum 14.07 0.00 0.00 3e-08 15.91 0.00 0.00 3e-08
32 8 float sum 15.94 0.00 0.00 3e-08 22.74 0.00 0.00 3e-08
64 16 float sum 21.74 0.00 0.00 3e-08 19.91 0.00 0.00 3e-08
128 32 float sum 15.01 0.01 0.01 3e-08 19.08 0.01 0.01 3e-08
256 64 float sum 27.95 0.01 0.01 3e-08 39.70 0.01 0.01 3e-08
512 128 float sum 29.36 0.02 0.03 3e-08 22.62 0.02 0.03 1e-08
1024 256 float sum 21.65 0.05 0.07 1e-07 21.35 0.05 0.07 1e-07
2048 512 float sum 21.99 0.09 0.14 1e-07 27.51 0.07 0.11 1e-07
4096 1024 float sum 34.00 0.12 0.18 2e-07 26.85 0.15 0.23 2e-07
8192 2048 float sum 21.81 0.38 0.56 2e-07 38.44 0.21 0.32 2e-07
16384 4096 float sum 42.64 0.38 0.58 2e-07 40.77 0.40 0.60 2e-07
32768 8192 float sum 68.61 0.48 0.72 2e-07 55.08 0.59 0.89 2e-07
65536 16384 float sum 77.09 0.85 1.28 2e-07 66.30 0.99 1.48 2e-07
131072 32768 float sum 122.9 1.07 1.60 2e-07 98.81 1.33 1.99 2e-07
262144 65536 float sum 162.2 1.62 2.42 2e-07 155.7 1.68 2.53 2e-07
524288 131072 float sum 237.9 2.20 3.31 2e-07 197.8 2.65 3.98 2e-07
1048576 262144 float sum 374.1 2.80 4.20 2e-07 378.5 2.77 4.16 2e-07
2097152 524288 float sum 723.4 2.90 4.35 2e-07 737.7 2.84 4.26 2e-07
4194304 1048576 float sum 1420.6 2.95 4.43 2e-07 2185.8 1.92 2.88 2e-07
8388608 2097152 float sum 4362.8 1.92 2.88 2e-07 2931.3 2.86 4.29 2e-07
16777216 4194304 float sum 6279.3 2.67 4.01 2e-07 6639.8 2.53 3.79 2e-07
33554432 8388608 float sum 12984 2.58 3.88 2e-07 12806 2.62 3.93 2e-07
67108864 16777216 float sum 25783 2.60 3.90 2e-07 25296 2.65 3.98 2e-07
134217728 33554432 float sum 48364 2.78 4.16 2e-07 47972 2.80 4.20 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.72964
from npkit.
Looks like there is some issue in low-level software/hardware stack that causes LL128 errors. Maybe you could try upgrade driver and CUDA (e.g. to 11.8) and try again.
BTW, you can also try other tracing events instead of LL128 events, since NCCL doesn't use LL128 by default. For example, you can try Simple (-DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT) or LL (-DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) protocols.
from npkit.
I set NCCL_PROTO=LL,Simple (as Default) and add both Simple and LL into npkit_Flag. It seems no problem now! Thank you! Here is my log.
from npkit.
Related Issues (4)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from npkit.