microsoft / npkit Goto Github PK

NCCL Profiling Kit

License: MIT License

Shell 38.08% Python 61.92%

npkit's Issues

Unable to generate GPU traces for MSCCL

I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.

git clone https://github.com/microsoft/NPKit.git
cd NPKit
git clone https://github.com/microsoft/msccl msccl-master-e52c525
cd msccl-master-e52c525
git checkout e52c525
find ../npkit_for_msccl_master_e52c525/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
make -j src.build src.build NVCC_GENCODE="-arch=sm_80" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"

cd ..
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/local/openmpi/ NCCL_HOME=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build -j

cd ..
mkdir dump_files
mkdir trace_files

# root directory copied to all machines

mpirun -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara rm -f /home/jasonfantl/NPKit/MSCCL/NPKit/dump_files/* && \
mpirun \
    --tag-output \
    -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara \
    -x PATH \
    -x LD_PRELOAD=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build/lib/libnccl.so.2 \
    -x LD_LIBRARY_PATH=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build:/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/local/openmpi/lib:$LD_LIBRARY_PATH  \
    -x NCCL_P2P_DISABLE=1 \
    -x NCCL_SHM_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=wan0 \
    -x NCCL_NET=IB \
    -x NCCL_IB_GID_INDEX=3 \
    -x NCCL_IB_HCA=mlx5 \
    -x NCCL_NET_GDR_LEVEL=SYS  \
    -x NCCL_ALGO=MSCCL \
    -x NCCL_PROTO=LL \
    -x NPKIT_DUMP_DIR=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
    -x MSCCL_XML_FILES=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl_samples/msccl_algo_sample.xml \
    /home/jasonfantl/NPKit/MSCCL/NPKit/nccl-tests/build/all_reduce_perf -b 1048576 -e 1048576 -f 2 -g 1 -c 1 -n 100 -w 100 -z 0

python /home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/samples/npkit/npkit_post_process.py \
  --npkit_dump_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
  --npkit_event_header_path=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/src/include/npkit/npkit_event.h \
  --output_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/trace_files

A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).

The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.

What are the correct build flags to generate GPU traces with MSCCL?

How to set the environment variables for all_to_all_perf profiling?

Hi! Thank you for developing this helpful profiling tool.

I would like to profile the detailed information of alltoall communication. However, I noticed that the default environment variable settings are configured for allreduce. I have experimented with several settings, and so far, only "-DENABLE_NPKIT_EVENT_NET_SEND_ENTRY xxx" has produced useful profiling results.

Could you please provide the appropriate variable settings for alltoall and send-receive profiling on multiple nodes? The test bin is the official nccl-test alltoall_perf.

Thank you for your assistance!

Question about the misalignment of the generated files

I used NPKIT to generate profiler files on two machines, and I found that the time of the files on the two machines did not seem to be aligned.
Here is an example. Process 3 and Process 4 are from different machines and the trace files don't seem to be aligned.

Here is my code with NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT".

CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun --prefix /home/xinglinpan/mpi/openmpi-4.1.4/ -np 8 -x NPKIT_DUMP_DIR=./ -x LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ -x NCCL_DEBUG=TRACE -x NCCL_DEBUG_SUBSYS=GRAPH -H gpu9:4,gpu10:4 ./build/alltoall_perf -b 64M -e 64M -f 2 -g 1 -n 1 -w 0

By the way, the event seems to be divided into 6 parts. By setting n=0, I found that there are still 4 parts (i.e., only initialization is completed), so is AlltoAll completed using the events of the last 2 parts?

Empty trace file

Describe the bug
Hi authors, I got an empty trace file when I followed the usage example. I feel confused about it. Here is my code. There is no parsed_gpu_event in any trace file that is satisfied with this condition https://github.com/microsoft/NPKit/blob/main/npkit_for_nccl_v2.10.3-1/samples/npkit/npkit_post_process.py.diff#L79

To Reproduce
Steps to reproduce the behavior:

$ git clone https://github.com/nvidia/nccl nccl-v2.10.3-1
$ cd nccl-v2.10.3-1
$ git checkout 7e51592
$ find ../npkit_for_nccl_v2.10.3-1/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
$ make -j src.build NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
$ cd samples
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4/ CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/npkit/npkit_result/npkit_src/
$ CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/  mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./  ./all_reduce_perf -b 8 -e 128M -f 2 -g 1
$ python npkit_post_process.py --npkit_dump_dir=/home/xinglinpan/nccl-tests/build --npkit_event_header_path=/home/xinglinpan/npkit/nccl-v2.10.3-1/src/include/npkit/npkit_event.h --output_dir=./

Logs

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid   3229 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid   3230 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid   3231 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid   3232 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    18.34    0.00    0.00  1e-07    56.46    0.00    0.00  0e+00
          16             4     float     sum    20.02    0.00    0.00  3e-08    24.77    0.00    0.00  3e-08
          32             8     float     sum    21.58    0.00    0.00  3e-08    23.57    0.00    0.00  3e-08
          64            16     float     sum    28.89    0.00    0.00  3e-08    25.10    0.00    0.00  3e-08
         128            32     float     sum    18.51    0.01    0.01  3e-08    17.26    0.01    0.01  3e-08
         256            64     float     sum    17.96    0.01    0.02  3e-08    18.64    0.01    0.02  3e-08
         512           128     float     sum    16.71    0.03    0.05  3e-08    20.96    0.02    0.04  1e-08
        1024           256     float     sum    27.02    0.04    0.06  1e-07    27.71    0.04    0.06  1e-07
        2048           512     float     sum    28.58    0.07    0.11  1e-07    24.29    0.08    0.13  1e-07
        4096          1024     float     sum    27.17    0.15    0.23  2e-07    30.42    0.13    0.20  2e-07
        8192          2048     float     sum    30.83    0.27    0.40  2e-07    36.36    0.23    0.34  2e-07
       16384          4096     float     sum    30.71    0.53    0.80  2e-07    40.37    0.41    0.61  2e-07
       32768          8192     float     sum    268.6    0.12    0.18  2e-07    43.18    0.76    1.14  2e-07
       65536         16384     float     sum    58.56    1.12    1.68  2e-07    64.99    1.01    1.51  2e-07
      131072         32768     float     sum    102.0    1.28    1.93  2e-07    105.5    1.24    1.86  2e-07
      262144         65536     float     sum    167.2    1.57    2.35  2e-07    178.3    1.47    2.21  2e-07
      524288        131072     float     sum    276.3    1.90    2.85  2e-07    239.1    2.19    3.29  2e-07
     1048576        262144     float     sum    360.9    2.91    4.36  2e-07    358.3    2.93    4.39  2e-07
     2097152        524288     float     sum    726.3    2.89    4.33  2e-07    729.2    2.88    4.31  2e-07
     4194304       1048576     float     sum   1442.7    2.91    4.36  2e-07   2084.9    2.01    3.02  2e-07
     8388608       2097152     float     sum   4161.9    2.02    3.02  2e-07   2828.9    2.97    4.45  2e-07
    16777216       4194304     float     sum   6272.6    2.67    4.01  2e-07   6720.3    2.50    3.74  2e-07
    33554432       8388608     float     sum    13248    2.53    3.80  2e-07    13023    2.58    3.86  2e-07
    67108864      16777216     float     sum    26097    2.57    3.86  2e-07    25792    2.60    3.90  2e-07
   134217728      33554432     float     sum    49738    2.70    4.05  2e-07    49205    2.73    4.09  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.71279

Platform

Device: GeForce RTX 2080Ti * 4
OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
CUDA version: 10.2
NCCL version: 2.7.8-1
PyTorch version: 1.9.1
Python Version: 3.8

NPKit for NCCL 2.18

Hi, Can NPkit be used with NCCL 2.18v

Some question regarding time scale in npkit_trace_generator.py

Hello! I have some (possibly silly) questions about the way post-processing timestamps in npkit_trace_generator.py:

As the script says the time unit displayed in the trace is in nanosecond (

NPKit/nccl_samples/npkit_trace_generator.py

Line 203 in 4cbb26e

trace['displayTimeUnit'] = 'ns'

) , and to my comprehension, the parsed_cpu_event['timestamp'] (

NPKit/nccl_samples/npkit_trace_generator.py

Line 127 in 4cbb26e

'ts': parsed_cpu_event['timestamp'] / cpu_clock_scale,

) is the number of CPU clock counts. As most CPU clock has the frequency of 1e9 Hz (which also applies to my experiments with the cpu_clock_period_num displaying num as 1 and cpu_clock_period_den displaying den as 1e9 from my dump files). So, if we want the time unit displayed in the trace as nanosecond, should 1e6 here (

NPKit/nccl_samples/npkit_trace_generator.py

Line 36 in 4cbb26e

return den / num / 1e6

) be changed to 1e9? Also, I think 1e6 here (

NPKit/nccl_samples/npkit_trace_generator.py

Line 29 in 4cbb26e

return float(freq_in_khz) * 1e3 / 1e6

) should also be changed to 1e9. Is there anything wrong with my comprehension about the time and thank you in advance for your help!

Thanks a lot again!

What does the index in the tracing result mean?

Hi, I derived some tracing results but I feel confused about the index of he plot. I wonder what does the numbers (index) 1000 - 1008 and 2000 - 2015 mean. To my comprehension, 1000+ and 2000+ represents different channels. Is that right? And what each index (1000,
1001, ...,1008, 2000, 2001, ..., 2015) represents? Thanks a lot!

Can NPKit only trace workloads launched by MSCCL tests or more than that

I found in the example for MSCCL that the workload is launched by MSCCL tests, does that mean NPKit can only trace workloads generated by the MSCCL tests? Or NPKit can trace arbitrary workloads that involves GPU communications?

microsoft / npkit Goto Github PK

npkit's Issues

Unable to generate GPU traces for MSCCL

How to set the environment variables for all_to_all_perf profiling?

Question about the misalignment of the generated files

Empty trace file

NPKit for NCCL 2.18

Some question regarding time scale in npkit_trace_generator.py

What does the index in the tracing result mean?

Can NPKit only trace workloads launched by MSCCL tests or more than that

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent