Coder Social home page Coder Social logo

npkit's Issues

Unable to generate GPU traces for MSCCL

I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.

git clone https://github.com/microsoft/NPKit.git
cd NPKit
git clone https://github.com/microsoft/msccl msccl-master-e52c525
cd msccl-master-e52c525
git checkout e52c525
find ../npkit_for_msccl_master_e52c525/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
make -j src.build src.build NVCC_GENCODE="-arch=sm_80" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"

cd ..
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/local/openmpi/ NCCL_HOME=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build -j

cd ..
mkdir dump_files
mkdir trace_files

# root directory copied to all machines

mpirun -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara rm -f /home/jasonfantl/NPKit/MSCCL/NPKit/dump_files/* && \
mpirun \
    --tag-output \
    -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara \
    -x PATH \
    -x LD_PRELOAD=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build/lib/libnccl.so.2 \
    -x LD_LIBRARY_PATH=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build:/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/local/openmpi/lib:$LD_LIBRARY_PATH  \
    -x NCCL_P2P_DISABLE=1 \
    -x NCCL_SHM_DISABLE=1 \
    -x NCCL_SOCKET_IFNAME=wan0 \
    -x NCCL_NET=IB \
    -x NCCL_IB_GID_INDEX=3 \
    -x NCCL_IB_HCA=mlx5 \
    -x NCCL_NET_GDR_LEVEL=SYS  \
    -x NCCL_ALGO=MSCCL \
    -x NCCL_PROTO=LL \
    -x NPKIT_DUMP_DIR=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
    -x MSCCL_XML_FILES=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl_samples/msccl_algo_sample.xml \
    /home/jasonfantl/NPKit/MSCCL/NPKit/nccl-tests/build/all_reduce_perf -b 1048576 -e 1048576 -f 2 -g 1 -c 1 -n 100 -w 100 -z 0

python /home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/samples/npkit/npkit_post_process.py \
  --npkit_dump_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
  --npkit_event_header_path=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/src/include/npkit/npkit_event.h \
  --output_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/trace_files

A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).

The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.

What are the correct build flags to generate GPU traces with MSCCL?

How to set the environment variables for all_to_all_perf profiling?

Hi! Thank you for developing this helpful profiling tool.

I would like to profile the detailed information of alltoall communication. However, I noticed that the default environment variable settings are configured for allreduce. I have experimented with several settings, and so far, only "-DENABLE_NPKIT_EVENT_NET_SEND_ENTRY xxx" has produced useful profiling results.

Could you please provide the appropriate variable settings for alltoall and send-receive profiling on multiple nodes? The test bin is the official nccl-test alltoall_perf.

Thank you for your assistance!

Question about the misalignment of the generated files

I used NPKIT to generate profiler files on two machines, and I found that the time of the files on the two machines did not seem to be aligned.
Here is an example. Process 3 and Process 4 are from different machines and the trace files don't seem to be aligned.
image

Here is my code with NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT".

CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun --prefix /home/xinglinpan/mpi/openmpi-4.1.4/ -np 8 -x NPKIT_DUMP_DIR=./ -x LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ -x NCCL_DEBUG=TRACE -x NCCL_DEBUG_SUBSYS=GRAPH -H gpu9:4,gpu10:4 ./build/alltoall_perf -b 64M -e 64M -f 2 -g 1 -n 1 -w 0

By the way, the event seems to be divided into 6 parts. By setting n=0, I found that there are still 4 parts (i.e., only initialization is completed), so is AlltoAll completed using the events of the last 2 parts?

Empty trace file

Describe the bug
Hi authors, I got an empty trace file when I followed the usage example. I feel confused about it. Here is my code. There is no parsed_gpu_event in any trace file that is satisfied with this condition https://github.com/microsoft/NPKit/blob/main/npkit_for_nccl_v2.10.3-1/samples/npkit/npkit_post_process.py.diff#L79

To Reproduce
Steps to reproduce the behavior:

$ git clone https://github.com/nvidia/nccl nccl-v2.10.3-1
$ cd nccl-v2.10.3-1
$ git checkout 7e51592
$ find ../npkit_for_nccl_v2.10.3-1/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
$ make -j src.build NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
$ cd samples
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4/ CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/npkit/npkit_result/npkit_src/
$ CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/  mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./  ./all_reduce_perf -b 8 -e 128M -f 2 -g 1
$ python npkit_post_process.py --npkit_dump_dir=/home/xinglinpan/nccl-tests/build --npkit_event_header_path=/home/xinglinpan/npkit/nccl-v2.10.3-1/src/include/npkit/npkit_event.h --output_dir=./

Logs

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid   3229 on       gpu9 device  0 [0x3d] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid   3230 on       gpu9 device  1 [0x3e] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid   3231 on       gpu9 device  2 [0xb1] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid   3232 on       gpu9 device  3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum    18.34    0.00    0.00  1e-07    56.46    0.00    0.00  0e+00
          16             4     float     sum    20.02    0.00    0.00  3e-08    24.77    0.00    0.00  3e-08
          32             8     float     sum    21.58    0.00    0.00  3e-08    23.57    0.00    0.00  3e-08
          64            16     float     sum    28.89    0.00    0.00  3e-08    25.10    0.00    0.00  3e-08
         128            32     float     sum    18.51    0.01    0.01  3e-08    17.26    0.01    0.01  3e-08
         256            64     float     sum    17.96    0.01    0.02  3e-08    18.64    0.01    0.02  3e-08
         512           128     float     sum    16.71    0.03    0.05  3e-08    20.96    0.02    0.04  1e-08
        1024           256     float     sum    27.02    0.04    0.06  1e-07    27.71    0.04    0.06  1e-07
        2048           512     float     sum    28.58    0.07    0.11  1e-07    24.29    0.08    0.13  1e-07
        4096          1024     float     sum    27.17    0.15    0.23  2e-07    30.42    0.13    0.20  2e-07
        8192          2048     float     sum    30.83    0.27    0.40  2e-07    36.36    0.23    0.34  2e-07
       16384          4096     float     sum    30.71    0.53    0.80  2e-07    40.37    0.41    0.61  2e-07
       32768          8192     float     sum    268.6    0.12    0.18  2e-07    43.18    0.76    1.14  2e-07
       65536         16384     float     sum    58.56    1.12    1.68  2e-07    64.99    1.01    1.51  2e-07
      131072         32768     float     sum    102.0    1.28    1.93  2e-07    105.5    1.24    1.86  2e-07
      262144         65536     float     sum    167.2    1.57    2.35  2e-07    178.3    1.47    2.21  2e-07
      524288        131072     float     sum    276.3    1.90    2.85  2e-07    239.1    2.19    3.29  2e-07
     1048576        262144     float     sum    360.9    2.91    4.36  2e-07    358.3    2.93    4.39  2e-07
     2097152        524288     float     sum    726.3    2.89    4.33  2e-07    729.2    2.88    4.31  2e-07
     4194304       1048576     float     sum   1442.7    2.91    4.36  2e-07   2084.9    2.01    3.02  2e-07
     8388608       2097152     float     sum   4161.9    2.02    3.02  2e-07   2828.9    2.97    4.45  2e-07
    16777216       4194304     float     sum   6272.6    2.67    4.01  2e-07   6720.3    2.50    3.74  2e-07
    33554432       8388608     float     sum    13248    2.53    3.80  2e-07    13023    2.58    3.86  2e-07
    67108864      16777216     float     sum    26097    2.57    3.86  2e-07    25792    2.60    3.90  2e-07
   134217728      33554432     float     sum    49738    2.70    4.05  2e-07    49205    2.73    4.09  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.71279

Platform

  • Device: GeForce RTX 2080Ti * 4
  • OS: Linux gpu9 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • CUDA version: 10.2
  • NCCL version: 2.7.8-1
  • PyTorch version: 1.9.1
  • Python Version: 3.8

Some question regarding time scale in npkit_trace_generator.py

Hello! I have some (possibly silly) questions about the way post-processing timestamps in npkit_trace_generator.py:

As the script says the time unit displayed in the trace is in nanosecond (

trace['displayTimeUnit'] = 'ns'
) , and to my comprehension, the parsed_cpu_event['timestamp'] (
'ts': parsed_cpu_event['timestamp'] / cpu_clock_scale,
) is the number of CPU clock counts. As most CPU clock has the frequency of 1e9 Hz (which also applies to my experiments with the cpu_clock_period_num displaying num as 1 and cpu_clock_period_den displaying den as 1e9 from my dump files). So, if we want the time unit displayed in the trace as nanosecond, should 1e6 here (
return den / num / 1e6
) be changed to 1e9? Also, I think 1e6 here (
return float(freq_in_khz) * 1e3 / 1e6
) should also be changed to 1e9. Is there anything wrong with my comprehension about the time and thank you in advance for your help!

Thanks a lot again!

What does the index in the tracing result mean?

Hi, I derived some tracing results but I feel confused about the index of he plot. I wonder what does the numbers (index) 1000 - 1008 and 2000 - 2015 mean. To my comprehension, 1000+ and 2000+ represents different channels. Is that right? And what each index (1000,
1001, ...,1008, 2000, 2001, ..., 2015) represents? Thanks a lot!
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.