Coder Social home page Coder Social logo

Empty trace file about npkit HOT 11 CLOSED

microsoft avatar microsoft commented on July 26, 2024
Empty trace file

from npkit.

Comments (11)

yzygitzh avatar yzygitzh commented on July 26, 2024

Hi,

Did you see CPU_SYNC and GPU_SYNC events presented in the trace file?

If yes, maybe the code path related LL128 events is not triggered (net send/recv is not covered because you’re running single node test). Pls try test_npkit_events.sh in the samples/NPKit folder.

If no, then probably NPKit is not enabled. Pls make sure NCCL with NPKit is properly built and enabled.

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

Hi,
Thank you for your quick response.
Here seem some CPU_SYNC events.

{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352062627302}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352108313614}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352165713543}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352220442632}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352262747296}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352309870301}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312142589}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312155555}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312169877}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312183815}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352312489432}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313935949}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313962963}
{'id': 52, 'size': 0, 'rsvd': 0, 'timestamp': 1675011352313974965}
...

Here seem some GPU_SYNC events.

{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29425007248}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29487368954}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29565711993}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29640398388}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29700178222}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29789938685}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794267047}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794291744}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794319086}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794347790}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29794927153}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797682420}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797733911}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797756752}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797782310}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797805932}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797844888}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797868134}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797889785}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797913820}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797937842}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797960900}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29797984347}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798008106}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798031891}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798056051}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798079492}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798102956}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798127477}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798150586}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798174421}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29798743592}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29799730967}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29802882597}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29804244328}
{'id': 51, 'size': 0, 'rsvd': 0, 'timestamp': 29804268890}
...

Is it as expected?

from npkit.

yzygitzh avatar yzygitzh commented on July 26, 2024

Thanks, looks like the NPKit is enabled. Could you please try enforcing NCCL_PROTO=LL128 in NCCL run and see whether there are more events?

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

It works for me.
image

However, I find that the number of out-of-bounds changes.
Here is my log. Is it as expected?

# nThread 1 nGpus 1 minBytes 8 maxBytes 33554432 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  20141 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid  20142 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid  20143 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid  20144 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    40.17    0.00    0.00  1e-07    62.08    0.00    0.00  0e+00
          16             4     float     sum    32.28    0.00    0.00  3e-08    33.14    0.00    0.00  3e-08
          32             8     float     sum    32.79    0.00    0.00  3e-08    32.98    0.00    0.00  3e-08
          64            16     float     sum    32.89    0.00    0.00  3e-08    33.51    0.00    0.00  3e-08
         128            32     float     sum    34.14    0.00    0.01  3e-08    32.12    0.00    0.01  3e-08
         256            64     float     sum    33.03    0.01    0.01  3e-08    32.94    0.01    0.01  3e-08
         512           128     float     sum    33.03    0.02    0.02  3e-08    32.58    0.02    0.02  1e-08
        1024           256     float     sum    33.54    0.03    0.05  1e-07    35.28    0.03    0.04  1e-07
        2048           512     float     sum    38.17    0.05    0.08  1e-07    39.18    0.05    0.08  1e-07
        4096          1024     float     sum    40.24    0.10    0.15  1e-07    41.26    0.10    0.15  1e-07
        8192          2048     float     sum    58.39    0.14    0.21  1e-07    56.03    0.15    0.22  1e-07
       16384          4096     float     sum    71.16    0.23    0.35  1e-07    64.84    0.25    0.38  1e-07
       32768          8192     float     sum    94.28    0.35    0.52  1e-07    90.56    0.36    0.54  1e-07
       65536         16384     float     sum    87.96    0.75    1.12  1e-07    85.24    0.77    1.15  1e-07
      131072         32768     float     sum    91.06    1.44    2.16  2e-07    89.04    1.47    2.21  2e-07
      262144         65536     float     sum    138.0    1.90    2.85  2e-07    136.8    1.92    2.87  2e-07
      524288        131072     float     sum    232.1    2.26    3.39  2e-07    235.6    2.23    3.34  2e-07
     1048576        262144     float     sum    428.7    2.45    3.67  2e-07    459.4    2.28    3.42  2e+00
     2097152        524288     float     sum    852.8    2.46    3.69  3e-02    835.2    2.51    3.77  2e-07
     4194304       1048576     float     sum   1688.1    2.48    3.73  4e-02   1786.7    2.35    3.52  2e-07
     8388608       2097152     float     sum   4149.9    2.02    3.03  6e-02   3209.5    2.61    3.92  6e-03
    16777216       4194304     float     sum   6795.0    2.47    3.70  1e+00   7391.7    2.27    3.40  2e-02
    33554432       8388608     float     sum    14205    2.36    3.54  1e+00    14141    2.37    3.56  4e+10
# Out of bounds values : 27 FAILED
# Avg bus bandwidth    : 1.41103
#
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[1455,1],2]
  Exit code:    1
--------------------------------------------------------------------------

from npkit.

yzygitzh avatar yzygitzh commented on July 26, 2024

It’s not expected. Does this happen with the same command but without NPKit as well?
Also maybe you can try a newer version (NCCL 2.12.12 with NPKit enabled) here: https://github.com/yzygitzh/nccl/tree/npkit-2.12.12

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

I try NCCL 2.12.12 with NPKit enabled, however, I get a fatal error.
Here is my build code.

git clone https://github.com/yzygitzh/nccl.git
cd nccl
git checkout eb51f579251741e16c444fbc6e76b530d85fb023 -f
make src.build CUDA_HOME=/usr/local/cuda-10.2/

Here is my log.

make: Warning: File 'Makefile' has modification time 20 s in the future
make -C src build BUILDDIR=/home/xinglinpan/npkit/nccl/build
make[1]: Entering directory '/home/xinglinpan/npkit/nccl/src'
make[1]: Warning: File '../makefiles/formatting.mk' has modification time 20 s in the future
Generating nccl.h.in                           > /home/xinglinpan/npkit/nccl/build/include/nccl.h
Grabbing   include/nccl_net.h                  > /home/xinglinpan/npkit/nccl/build/include/nccl_net.h
Compiling  init.cc                             > /home/xinglinpan/npkit/nccl/build/obj/init.o
Compiling  channel.cc                          > /home/xinglinpan/npkit/nccl/build/obj/channel.o
Compiling  bootstrap.cc                        > /home/xinglinpan/npkit/nccl/build/obj/bootstrap.o
Compiling  transport.cc                        > /home/xinglinpan/npkit/nccl/build/obj/transport.o
Compiling  enqueue.cc                          > /home/xinglinpan/npkit/nccl/build/obj/enqueue.o
Compiling  group.cc                            > /home/xinglinpan/npkit/nccl/build/obj/group.o
Compiling  debug.cc                            > /home/xinglinpan/npkit/nccl/build/obj/debug.o
Compiling  proxy.cc                            > /home/xinglinpan/npkit/nccl/build/obj/proxy.o
Compiling  enhcompat.cc                        > /home/xinglinpan/npkit/nccl/build/obj/enhcompat.o
Compiling  net.cc                              > /home/xinglinpan/npkit/nccl/build/obj/net.o
Compiling  misc/nvmlwrap.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/nvmlwrap.o
Compiling  misc/ibvwrap.cc                     > /home/xinglinpan/npkit/nccl/build/obj/misc/ibvwrap.o
Compiling  misc/gdrwrap.cc                     > /home/xinglinpan/npkit/nccl/build/obj/misc/gdrwrap.o
Compiling  misc/utils.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/utils.o
Compiling  misc/argcheck.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/argcheck.o
Compiling  misc/socket.cc                      > /home/xinglinpan/npkit/nccl/build/obj/misc/socket.o
Compiling  misc/shmutils.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/shmutils.o
Compiling  misc/profiler.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/profiler.o
Compiling  misc/param.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/param.o
Compiling  misc/npkit.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o
In file included from misc/npkit.cc:6:0:
include/npkit/npkit.h:7:10: fatal error: hip/hip_runtime.h: No such file or directory
 #include <hip/hip_runtime.h>
          ^~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:111: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o' failed
make[1]: *** [/home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o] Error 1
make[1]: Leaving directory '/home/xinglinpan/npkit/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2

from npkit.

yzygitzh avatar yzygitzh commented on July 26, 2024

You'll need to use npkit-2.12.12 branch.

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

Here is my new code.

git checkout git checkout npkit-2.12.12
make clean
make src.build CUDA_HOME=/usr/local/cuda-10.2/ NVCC_GENCODE="-gencode=arch=compute_75,code=sm_75" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"

Here is my new log.

make -C src build BUILDDIR=/home/xinglinpan/npkit/nccl/build
make[1]: Entering directory '/home/xinglinpan/npkit/nccl/src'
Generating nccl.h.in                           > /home/xinglinpan/npkit/nccl/build/include/nccl.h
Grabbing   include/nccl_net.h                  > /home/xinglinpan/npkit/nccl/build/include/nccl_net.h
Compiling  init.cc                             > /home/xinglinpan/npkit/nccl/build/obj/init.o
Compiling  channel.cc                          > /home/xinglinpan/npkit/nccl/build/obj/channel.o
Compiling  bootstrap.cc                        > /home/xinglinpan/npkit/nccl/build/obj/bootstrap.o
Compiling  transport.cc                        > /home/xinglinpan/npkit/nccl/build/obj/transport.o
Compiling  enqueue.cc                          > /home/xinglinpan/npkit/nccl/build/obj/enqueue.o
Compiling  group.cc                            > /home/xinglinpan/npkit/nccl/build/obj/group.o
Compiling  debug.cc                            > /home/xinglinpan/npkit/nccl/build/obj/debug.o
Compiling  proxy.cc                            > /home/xinglinpan/npkit/nccl/build/obj/proxy.o
Compiling  enhcompat.cc                        > /home/xinglinpan/npkit/nccl/build/obj/enhcompat.o
Compiling  net.cc                              > /home/xinglinpan/npkit/nccl/build/obj/net.o
Compiling  misc/nvmlwrap.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/nvmlwrap.o
Compiling  misc/ibvwrap.cc                     > /home/xinglinpan/npkit/nccl/build/obj/misc/ibvwrap.o
Compiling  misc/gdrwrap.cc                     > /home/xinglinpan/npkit/nccl/build/obj/misc/gdrwrap.o
Compiling  misc/utils.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/utils.o
Compiling  misc/argcheck.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/argcheck.o
Compiling  misc/socket.cc                      > /home/xinglinpan/npkit/nccl/build/obj/misc/socket.o
Compiling  misc/shmutils.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/shmutils.o
Compiling  misc/profiler.cc                    > /home/xinglinpan/npkit/nccl/build/obj/misc/profiler.o
Compiling  misc/param.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/param.o
Compiling  misc/npkit.cc                       > /home/xinglinpan/npkit/nccl/build/obj/misc/npkit.o
Compiling  transport/p2p.cc                    > /home/xinglinpan/npkit/nccl/build/obj/transport/p2p.o
Compiling  transport/shm.cc                    > /home/xinglinpan/npkit/nccl/build/obj/transport/shm.o
Compiling  transport/net.cc                    > /home/xinglinpan/npkit/nccl/build/obj/transport/net.o
Compiling  transport/net_socket.cc             > /home/xinglinpan/npkit/nccl/build/obj/transport/net_socket.o
Compiling  transport/net_ib.cc                 > /home/xinglinpan/npkit/nccl/build/obj/transport/net_ib.o
Compiling  transport/coll_net.cc               > /home/xinglinpan/npkit/nccl/build/obj/transport/coll_net.o
Compiling  collectives/sendrecv.cc             > /home/xinglinpan/npkit/nccl/build/obj/collectives/sendrecv.o
Compiling  collectives/all_reduce.cc           > /home/xinglinpan/npkit/nccl/build/obj/collectives/all_reduce.o
Compiling  collectives/all_gather.cc           > /home/xinglinpan/npkit/nccl/build/obj/collectives/all_gather.o
Compiling  collectives/broadcast.cc            > /home/xinglinpan/npkit/nccl/build/obj/collectives/broadcast.o
Compiling  collectives/reduce.cc               > /home/xinglinpan/npkit/nccl/build/obj/collectives/reduce.o
Compiling  collectives/reduce_scatter.cc       > /home/xinglinpan/npkit/nccl/build/obj/collectives/reduce_scatter.o
Compiling  graph/topo.cc                       > /home/xinglinpan/npkit/nccl/build/obj/graph/topo.o
Compiling  graph/paths.cc                      > /home/xinglinpan/npkit/nccl/build/obj/graph/paths.o
Compiling  graph/search.cc                     > /home/xinglinpan/npkit/nccl/build/obj/graph/search.o
Compiling  graph/connect.cc                    > /home/xinglinpan/npkit/nccl/build/obj/graph/connect.o
Compiling  graph/rings.cc                      > /home/xinglinpan/npkit/nccl/build/obj/graph/rings.o
Compiling  graph/trees.cc                      > /home/xinglinpan/npkit/nccl/build/obj/graph/trees.o
Compiling  graph/tuning.cc                     > /home/xinglinpan/npkit/nccl/build/obj/graph/tuning.o
Compiling  graph/xml.cc                        > /home/xinglinpan/npkit/nccl/build/obj/graph/xml.o
make[2]: Entering directory '/home/xinglinpan/npkit/nccl/src/collectives/device'
Generating rules                               > /home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules
make[2]: Warning: File '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules' has modification time 160 s in the future
Compiling  sendrecv.cu                         > /home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o
../../include/npkit/npkit_struct.h(8): error: "Bitfields and field types containing bitfields are not supported in packed structures and unions for device compilation!"

/home/xinglinpan/npkit/nccl/build/obj/collectives/device/Makefile.rules:2: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o' failed
make[2]: *** [/home/xinglinpan/npkit/nccl/build/obj/collectives/device/sendrecv_sum_i8.o] Error 2
make[2]: Leaving directory '/home/xinglinpan/npkit/nccl/src/collectives/device'
Makefile:50: recipe for target '/home/xinglinpan/npkit/nccl/build/obj/collectives/device/colldevice.a' failed
make[1]: *** [/home/xinglinpan/npkit/nccl/build/obj/collectives/device/colldevice.a] Error 2
make[1]: Leaving directory '/home/xinglinpan/npkit/nccl/src'
Makefile:25: recipe for target 'src.build' failed
make: *** [src.build] Error 2

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

Does this happen with the same command but without NPKit as well?

I rebuild my code without NPKit.
This also seems to happen.
I build nccl-tests without NPKit

make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4  CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/nccl_2.10.3-1+cuda10.2_x86_64/

and run the code with

CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/nccl_2.10.3-1+cuda10.2_x86_64/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/  mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./ -x NCCL_PROTO=LL128 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

Here is my log.

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  26003 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid  26004 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid  26005 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid  26006 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    35.76    0.00    0.00  1e-07    35.73    0.00    0.00  0e+00
          16             4     float     sum    36.02    0.00    0.00  3e-08    34.24    0.00    0.00  3e-08
          32             8     float     sum    37.94    0.00    0.00  3e-08    34.84    0.00    0.00  3e-08
          64            16     float     sum    41.73    0.00    0.00  3e-08    38.56    0.00    0.00  3e-08
         128            32     float     sum    39.08    0.00    0.00  3e-08    33.29    0.00    0.01  3e-08
         256            64     float     sum    36.59    0.01    0.01  3e-08    42.12    0.01    0.01  3e-08
         512           128     float     sum    40.98    0.01    0.02  3e-08    41.01    0.01    0.02  1e-08
        1024           256     float     sum    50.67    0.02    0.03  1e-07    38.64    0.03    0.04  1e-07
        2048           512     float     sum    47.87    0.04    0.06  1e-07    51.49    0.04    0.06  1e-07
        4096          1024     float     sum    62.50    0.07    0.10  1e-07    60.32    0.07    0.10  1e-07
        8192          2048     float     sum    65.22    0.13    0.19  1e-07    108.4    0.08    0.11  1e-07
       16384          4096     float     sum    235.1    0.07    0.10  1e-07    100.7    0.16    0.24  1e-07
       32768          8192     float     sum    257.4    0.13    0.19  1e-07    128.1    0.26    0.38  1e-07
       65536         16384     float     sum    80.75    0.81    1.22  1e-07    85.43    0.77    1.15  1e-07
      131072         32768     float     sum    102.9    1.27    1.91  2e-07    111.9    1.17    1.76  2e-07
      262144         65536     float     sum    229.9    1.14    1.71  2e+00    238.7    1.10    1.65  2e-07
      524288        131072     float     sum    324.4    1.62    2.42  2e-07    277.4    1.89    2.83  2e-07
     1048576        262144     float     sum    485.0    2.16    3.24  2e-07    846.7    1.24    1.86  2e-07
     2097152        524288     float     sum    964.3    2.17    3.26  2e-07    918.3    2.28    3.43  2e-07
     4194304       1048576     float     sum   1822.6    2.30    3.45  2e-07   3010.2    1.39    2.09  2e-01
     8388608       2097152     float     sum   3542.5    2.37    3.55  2e-07   4195.9    2.00    3.00  2e-07
    16777216       4194304     float     sum   8426.8    1.99    2.99  2e-07   7970.6    2.10    3.16  1e-02
    33554432       8388608     float     sum    15913    2.11    3.16  3e-02    16281    2.06    3.09  1e+00
    67108864      16777216     float     sum    31764    2.11    3.17  4e-02    32472    2.07    3.10  2e-02
   134217728      33554432     float     sum    66614    2.01    3.02  1e+00    66821    2.01    3.01  3e+10
# Out of bounds values : 31 FAILED
# Avg bus bandwidth    : 1.29865
#
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11904,1],0]
  Exit code:    1
--------------------------------------------------------------------------

By removing NCCL_PROTO=LL128, I can get the expected result.

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  34181 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid  34182 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid  34183 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid  34184 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    20.56    0.00    0.00  1e-07    133.3    0.00    0.00  0e+00
          16             4     float     sum    14.07    0.00    0.00  3e-08    15.91    0.00    0.00  3e-08
          32             8     float     sum    15.94    0.00    0.00  3e-08    22.74    0.00    0.00  3e-08
          64            16     float     sum    21.74    0.00    0.00  3e-08    19.91    0.00    0.00  3e-08
         128            32     float     sum    15.01    0.01    0.01  3e-08    19.08    0.01    0.01  3e-08
         256            64     float     sum    27.95    0.01    0.01  3e-08    39.70    0.01    0.01  3e-08
         512           128     float     sum    29.36    0.02    0.03  3e-08    22.62    0.02    0.03  1e-08
        1024           256     float     sum    21.65    0.05    0.07  1e-07    21.35    0.05    0.07  1e-07
        2048           512     float     sum    21.99    0.09    0.14  1e-07    27.51    0.07    0.11  1e-07
        4096          1024     float     sum    34.00    0.12    0.18  2e-07    26.85    0.15    0.23  2e-07
        8192          2048     float     sum    21.81    0.38    0.56  2e-07    38.44    0.21    0.32  2e-07
       16384          4096     float     sum    42.64    0.38    0.58  2e-07    40.77    0.40    0.60  2e-07
       32768          8192     float     sum    68.61    0.48    0.72  2e-07    55.08    0.59    0.89  2e-07
       65536         16384     float     sum    77.09    0.85    1.28  2e-07    66.30    0.99    1.48  2e-07
      131072         32768     float     sum    122.9    1.07    1.60  2e-07    98.81    1.33    1.99  2e-07
      262144         65536     float     sum    162.2    1.62    2.42  2e-07    155.7    1.68    2.53  2e-07
      524288        131072     float     sum    237.9    2.20    3.31  2e-07    197.8    2.65    3.98  2e-07
     1048576        262144     float     sum    374.1    2.80    4.20  2e-07    378.5    2.77    4.16  2e-07
     2097152        524288     float     sum    723.4    2.90    4.35  2e-07    737.7    2.84    4.26  2e-07
     4194304       1048576     float     sum   1420.6    2.95    4.43  2e-07   2185.8    1.92    2.88  2e-07
     8388608       2097152     float     sum   4362.8    1.92    2.88  2e-07   2931.3    2.86    4.29  2e-07
    16777216       4194304     float     sum   6279.3    2.67    4.01  2e-07   6639.8    2.53    3.79  2e-07
    33554432       8388608     float     sum    12984    2.58    3.88  2e-07    12806    2.62    3.93  2e-07
    67108864      16777216     float     sum    25783    2.60    3.90  2e-07    25296    2.65    3.98  2e-07
   134217728      33554432     float     sum    48364    2.78    4.16  2e-07    47972    2.80    4.20  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.72964

from npkit.

yzygitzh avatar yzygitzh commented on July 26, 2024

Looks like there is some issue in low-level software/hardware stack that causes LL128 errors. Maybe you could try upgrade driver and CUDA (e.g. to 11.8) and try again.
BTW, you can also try other tracing events instead of LL128 events, since NCCL doesn't use LL128 by default. For example, you can try Simple (-DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT) or LL (-DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) protocols.

from npkit.

Fragile-azalea avatar Fragile-azalea commented on July 26, 2024

I set NCCL_PROTO=LL,Simple (as Default) and add both Simple and LL into npkit_Flag. It seems no problem now! Thank you! Here is my log.
image

from npkit.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.