microsoft / npkit Goto Github PK
View Code? Open in Web Editor NEWNCCL Profiling Kit
License: MIT License
NCCL Profiling Kit
License: MIT License
I used NPKIT to generate profiler files on two machines, and I found that the time of the files on the two machines did not seem to be aligned.
Here is an example. Process 3 and Process 4 are from different machines and the trace files don't seem to be aligned.
Here is my code with NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
.
CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun --prefix /home/xinglinpan/mpi/openmpi-4.1.4/ -np 8 -x NPKIT_DUMP_DIR=./ -x LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib/:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ -x NCCL_DEBUG=TRACE -x NCCL_DEBUG_SUBSYS=GRAPH -H gpu9:4,gpu10:4 ./build/alltoall_perf -b 64M -e 64M -f 2 -g 1 -n 1 -w 0
By the way, the event seems to be divided into 6 parts. By setting n=0
, I found that there are still 4 parts (i.e., only initialization is completed), so is AlltoAll completed using the events of the last 2 parts?
Describe the bug
Hi authors, I got an empty trace file when I followed the usage example. I feel confused about it. Here is my code. There is no parsed_gpu_event in any trace file that is satisfied with this condition https://github.com/microsoft/NPKit/blob/main/npkit_for_nccl_v2.10.3-1/samples/npkit/npkit_post_process.py.diff#L79
To Reproduce
Steps to reproduce the behavior:
$ git clone https://github.com/nvidia/nccl nccl-v2.10.3-1
$ cd nccl-v2.10.3-1
$ git checkout 7e51592
$ find ../npkit_for_nccl_v2.10.3-1/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
$ make -j src.build NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
$ cd samples
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make MPI=1 MPI_HOME=/home/xinglinpan/mpi/openmpi-4.1.4/ CUDA_HOME=/usr/local/cuda-10.2/ NCCL_HOME=/home/xinglinpan/npkit/npkit_result/npkit_src/
$ CUDA_HOME=/usr/local/cuda-10.2/ LD_LIBRARY_PATH=/home/xinglinpan/npkit/nccl-v2.10.3-1/build/lib:/home/xinglinpan/mpi/openmpi-4.1.4/lib/ mpirun -np 4 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca coll_hcoll_enable 0 -mca plm_rsh_no_tree_spawn 1 -mca plm_rsh_num_concurrent 8192 -x NCCL_UCX_TLS=rc_x,cuda_copy,cuda_ipc -x NCCL_UCX_RNDV_THRESH=0 -x NCCL_UCX_RNDV_SCHEME=get_zcopy -x UCX_RC_MLX5_TM_ENABLE=y -x NPKIT_DUMP_DIR=./ ./all_reduce_perf -b 8 -e 128M -f 2 -g 1
$ python npkit_post_process.py --npkit_dump_dir=/home/xinglinpan/nccl-tests/build --npkit_event_header_path=/home/xinglinpan/npkit/nccl-v2.10.3-1/src/include/npkit/npkit_event.h --output_dir=./
Logs
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 3229 on gpu9 device 0 [0x3d] NVIDIA GeForce RTX 2080 Ti
# Rank 1 Pid 3230 on gpu9 device 1 [0x3e] NVIDIA GeForce RTX 2080 Ti
# Rank 2 Pid 3231 on gpu9 device 2 [0xb1] NVIDIA GeForce RTX 2080 Ti
# Rank 3 Pid 3232 on gpu9 device 3 [0xb2] NVIDIA GeForce RTX 2080 Ti
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 18.34 0.00 0.00 1e-07 56.46 0.00 0.00 0e+00
16 4 float sum 20.02 0.00 0.00 3e-08 24.77 0.00 0.00 3e-08
32 8 float sum 21.58 0.00 0.00 3e-08 23.57 0.00 0.00 3e-08
64 16 float sum 28.89 0.00 0.00 3e-08 25.10 0.00 0.00 3e-08
128 32 float sum 18.51 0.01 0.01 3e-08 17.26 0.01 0.01 3e-08
256 64 float sum 17.96 0.01 0.02 3e-08 18.64 0.01 0.02 3e-08
512 128 float sum 16.71 0.03 0.05 3e-08 20.96 0.02 0.04 1e-08
1024 256 float sum 27.02 0.04 0.06 1e-07 27.71 0.04 0.06 1e-07
2048 512 float sum 28.58 0.07 0.11 1e-07 24.29 0.08 0.13 1e-07
4096 1024 float sum 27.17 0.15 0.23 2e-07 30.42 0.13 0.20 2e-07
8192 2048 float sum 30.83 0.27 0.40 2e-07 36.36 0.23 0.34 2e-07
16384 4096 float sum 30.71 0.53 0.80 2e-07 40.37 0.41 0.61 2e-07
32768 8192 float sum 268.6 0.12 0.18 2e-07 43.18 0.76 1.14 2e-07
65536 16384 float sum 58.56 1.12 1.68 2e-07 64.99 1.01 1.51 2e-07
131072 32768 float sum 102.0 1.28 1.93 2e-07 105.5 1.24 1.86 2e-07
262144 65536 float sum 167.2 1.57 2.35 2e-07 178.3 1.47 2.21 2e-07
524288 131072 float sum 276.3 1.90 2.85 2e-07 239.1 2.19 3.29 2e-07
1048576 262144 float sum 360.9 2.91 4.36 2e-07 358.3 2.93 4.39 2e-07
2097152 524288 float sum 726.3 2.89 4.33 2e-07 729.2 2.88 4.31 2e-07
4194304 1048576 float sum 1442.7 2.91 4.36 2e-07 2084.9 2.01 3.02 2e-07
8388608 2097152 float sum 4161.9 2.02 3.02 2e-07 2828.9 2.97 4.45 2e-07
16777216 4194304 float sum 6272.6 2.67 4.01 2e-07 6720.3 2.50 3.74 2e-07
33554432 8388608 float sum 13248 2.53 3.80 2e-07 13023 2.58 3.86 2e-07
67108864 16777216 float sum 26097 2.57 3.86 2e-07 25792 2.60 3.90 2e-07
134217728 33554432 float sum 49738 2.70 4.05 2e-07 49205 2.73 4.09 2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 1.71279
Platform
I have 8 machines, each with a single GPU. When following the build instructions for NCCL I get traces for both CPU and GPU events, but after following the steps for MSCCL I only get traces for CPU events. Below is each step taken to try and get GPU traces with MSCCL.
git clone https://github.com/microsoft/NPKit.git
cd NPKit
git clone https://github.com/microsoft/msccl msccl-master-e52c525
cd msccl-master-e52c525
git checkout e52c525
find ../npkit_for_msccl_master_e52c525/ | grep '.diff$' | awk '{print "git apply "$1}' | bash
make -j src.build src.build NVCC_GENCODE="-arch=sm_80" NPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT"
cd ..
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests/
make MPI=1 MPI_HOME=/usr/local/openmpi/ NCCL_HOME=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build -j
cd ..
mkdir dump_files
mkdir trace_files
# root directory copied to all machines
mpirun -H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara rm -f /home/jasonfantl/NPKit/MSCCL/NPKit/dump_files/* && \
mpirun \
--tag-output \
-H navi,quiritis,saria,oshun,midna,parvati,rhiannon,tara \
-x PATH \
-x LD_PRELOAD=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build/lib/libnccl.so.2 \
-x LD_LIBRARY_PATH=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/build:/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/local/openmpi/lib:$LD_LIBRARY_PATH \
-x NCCL_P2P_DISABLE=1 \
-x NCCL_SHM_DISABLE=1 \
-x NCCL_SOCKET_IFNAME=wan0 \
-x NCCL_NET=IB \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_HCA=mlx5 \
-x NCCL_NET_GDR_LEVEL=SYS \
-x NCCL_ALGO=MSCCL \
-x NCCL_PROTO=LL \
-x NPKIT_DUMP_DIR=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
-x MSCCL_XML_FILES=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl_samples/msccl_algo_sample.xml \
/home/jasonfantl/NPKit/MSCCL/NPKit/nccl-tests/build/all_reduce_perf -b 1048576 -e 1048576 -f 2 -g 1 -c 1 -n 100 -w 100 -z 0
python /home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/samples/npkit/npkit_post_process.py \
--npkit_dump_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/dump_files \
--npkit_event_header_path=/home/jasonfantl/NPKit/MSCCL/NPKit/msccl-master-e52c525/src/include/npkit/npkit_event.h \
--output_dir=/home/jasonfantl/NPKit/MSCCL/NPKit/trace_files
A potentially useful note: When trying different settings I noticed that when NCCL_ALGO=RING, then NCCL_PROTO=LL (with -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT) doesn't produce GPU traces, but NCCL_PROTO=LL128 (with -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT) does (and I believe there's a typo in npkit_post_process.py line 77, curr_cpu_base_time needs to be replaced with curr_gpu_base_time in order to parse).
The current MSCCL example npkit_runner.sh uses NPKIT=1 as the build flag, which does not seem to enable any traces at all. I saw the MSCCL example had recently used -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_ENTRY -DENABLE_NPKIT_EVENT_MSCCL_REDUCE_EXIT, which also didn't produce traces.
What are the correct build flags to generate GPU traces with MSCCL?
Hi, Can NPkit be used with NCCL 2.18v
I found in the example for MSCCL that the workload is launched by MSCCL tests, does that mean NPKit can only trace workloads generated by the MSCCL tests? Or NPKit can trace arbitrary workloads that involves GPU communications?
Hi, I derived some tracing results but I feel confused about the index of he plot. I wonder what does the numbers (index) 1000 - 1008 and 2000 - 2015 mean. To my comprehension, 1000+ and 2000+ represents different channels. Is that right? And what each index (1000,
1001, ...,1008, 2000, 2001, ..., 2015) represents? Thanks a lot!
Hi! Thank you for developing this helpful profiling tool.
I would like to profile the detailed information of alltoall communication. However, I noticed that the default environment variable settings are configured for allreduce. I have experimented with several settings, and so far, only "-DENABLE_NPKIT_EVENT_NET_SEND_ENTRY xxx" has produced useful profiling results.
Could you please provide the appropriate variable settings for alltoall and send-receive profiling on multiple nodes? The test bin is the official nccl-test alltoall_perf.
Thank you for your assistance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.