microsoft / superbenchmark Goto Github PK

A validation and profiling tool for AI infrastructure

License: MIT License

Dockerfile 2.55% Python 65.32% Makefile 0.04% Shell 0.08% CMake 1.00% Cuda 5.41% Jinja 0.01% C++ 22.10% JavaScript 0.43% CSS 0.09% HTML 0.27% Batchfile 0.07% HLSL 2.62%

benchmark ai-system superbench azure hacktoberfest

superbenchmark's Introduction

SuperBench

Azure Pipelines	Build Status
cpu-unit-test
cuda-unit-test
ansible-integration-test

SuperBench is a validation and profiling tool for AI infrastructure.

📢 v0.10.0 has been released!

Check aka.ms/superbench for more details.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

superbenchmark's People

Contributors

Stargazers

Watchers

Forkers

yukirora parsa011 stevenans985900 santakd tobeyqin qpc-database yzygitzh asathiya007 standardgalactic guoshzhao kaiyux ericwangcn karlotimmerman smoe1 groenenboomj jaredamd olehb rocm rafsalas19 g-arj cp5555 python-repository-hub mht-sharma jeseszhang1010 test-mass-forker-org-1 zhaojp-frank isabella232 ryoyang alugorey edenbuaa darkwhite29 youcef4k rhewett moreh-dev monnetb huonglarne pnunna93 cli99 jackman337 ficoguti jorgeesg dearborn-open-ai sorokinvld umaparhar-msft jaredbbowden fegums garywangcn

superbenchmark's Issues

cublaslt_gemm microbenchmark fails with running with large matrix sizes.

When I run microbenchmark cublaslt_gemm with B=64, M=8192, K=8192, N=8192, it fails in cudaMalloc.

I debugged this and figured out the result of multiplication B * M * K and B * M * N and B * K * N are all more than 4GB in BF16 so 32 bit int data type cannot hold result of the multiplication. I managed to make local changes to fix this. But once I get past cudaMalloc, cublasCreate(&handle) call fails with CUBLAS_STATUS_NOT_INITIALIZED.

These are the steps reproduce the error
cd superbench/benchmarks/microbencmarks/cublaslt_gemm
cmake -S ./
make
./cublaslt_gemm -b 64 -m 8192 -k 8192 -n 8192 -i 1000 -t bf16

Run benchmark failed (superbenchmark-0.8.0)

What's the issue, what's expected?:

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
[2023-04-14 19:17:19,195 u22:21920][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/cublas-function/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/cudnn-function/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/gemm-flops/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/gpu-burn/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/mem-bw/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/nccl-bw:default/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/nccl-bw:gdr-only/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchmarks/ort-inference/rank0/results.json
[2023-04-14 19:17:19,199 u22:21920][runner.py:275][ERROR] Invalid content in JSON file: /home/edison/Downloads/superbenchmark-0.8.0/outputs/2023-04-14_19-16-21/nodes/u22/benchm

How to reproduce it?:
OS: ubuntu 22.04.02
GPU: GeForce RTX 3060 x1

wget https://github.com/microsoft/superbenchmark/archive/refs/tags/v0.8.0.tar.gz
tar xf v0.8.0.tar.gz
cd superbenchmark-0.8.0/
python3 -m venv --system-site-packages ./venv
source ./venv/bin/activate
python3 -m pip install .
python3 -m pip install --upgrade pip setuptools==65.7
make postinstall
cp superbench/config/default.yaml sb.yaml # and change the proc_num: 8 to proc_num: 1
nano local.ini
set +H
sb deploy -f local.ini --host-password=mysshpassword

docker images # check docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
superbench/superbench latest 36fe2cd49200 2 hours ago 19.5GB

docker run -it --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=0 --shm-size=1g --ulimit memlock=-1 superbench/superbench
nvidia-smi #it works.
exit

sb run -f local.ini -c sb.yaml --host-password=mysshpassword

Log message or shapshot?:

see attached

Additional information:
2023-04-14_19-16-21.tar.gz

V0.2.0 Release Plan

Release Manager

@TobeyQin

Endgame

Feature freeze: May. 10
Code freeze: May. 28
Demo date: May. 28
Bug Bash date: May. 28
Release date: Jun. 4

Main Features

SuperBench Framework Implementation

SB Benchmarks Implementation -- @guoshzhao

- Design Doc
SB Benchmark Base
- Benchmark Base
- Model Base
- Microbenchmark Base
Environment Build Pipeline ETA:
- Design Doc
- Implementation

SB Agent Implementation -- @abuccts

- Design Doc
SB CLI
- SB CLI Implementation
- Integration with SB Runner
- Integration with SB Executor
SB Executor
- SB Executor Implementation
SB Runner ETA:
- SB Runner Implementation
- Integration with SB Executor

Benchmark Tasks

E2E Benchmarks (including metrics: _float, _half, _float_throughput, _half_throughput) -- @guoshzhao

CNN models -- User PyTorch TORCHVISION.MODELS sub-package

ResNet: ResNet-50, ResNet-101, ResNet-152

DenseNet: DenseNet-169, DenseNet-201

VGG: VGG-11, VGG-13, VGG-16, VGG-19

BERT -- Use huggingface Transformers

BERT

BERT LARGE

LSTM -- Use PyTorch TORCH.NN sub-package
GPT-2 -- Use huggingface Transformers

Micro Benchmarks

GEMM FLOPS (Tool: Nvidia Cutlass Tool) -- @guoshzhao ETA: May 21

Metrics	Unit	Description
FP64	GFLOPS	FP64 FLOPS without TensorCore
FP32	GFLOPS	FP32 FLOPS without TensorCore
FP16	GFLOPS	FP16 FLOPS without TensorCore
FP64(TC)	GFLOPS	FP64 FLOPS with TensorCore
TF32(TC)	GFLOPS	TF32 FLOPS with TensorCore
FP16(TC)	GFLOPS	FP16 FLOPS with TensorCore
BF16(TC)	GFLOPS	BF16 FLOPS with TensorCore
INT8(TC)	GOPS	INT8 FLOPS with TensorCore
INT4(TC)	GOPS	INT4 FLOPS with TensorCore

KernelLaunch (Tool: MSR-A build) -- @guoshzhao ETA: May 21

Metrics	Unit	Description
Kernel_Launch_Event_Time	Time (ms)	Dispatch latency measured in GPU time using cudaEventRecord()/hipEventRecord()
Kernel_Launch_Wall_Time	Time (ms)	Dispatch latency measured in CPU time

Kernel (Tool: MSR-A build) -- @yukirora ETA: May 21

Metrics	Unit	Description
cublasSgemm	Time (ms)	Cublas Kernel Process Time for cublasSgemm
cublasSgemmStridedBatched	Time (ms)	Time for cublasSgemmStridedBatched
cublasGemmStridedBatchedEx	Time (ms)	Time for cublasGemmStridedBatchedEx
cublasGemmEx	Time (ms)	Time for cublasGemmEx
cublasCgemm3mStridedBatched	Time (ms)	Time for cublasCgemm3mStridedBatched
cublasCgemm	Time (ms)	Time for cublasCgemm
Mul_During_NCCL	Time (ms)	Time for Mul_During_NCCL
MatMul_During_NCCL	Time (ms)	Time for MatMul_During_NCCL
MM_AllReduce_Opsharding	Time (ms)	Time for MM_AllReduce_Opsharding
MM_AllGather_Concat_Opsharing	Time (ms)	Time for MM_AllGather_Concat_Opsharing

Document refine -- @TobeyQin ETA: May 28

Update README file

Add SuperBenchmark architecture and refine the goals
Define results format

Test Plan

Test Overall Pipeline

Set Environment and SuperBench Installation Test
Run through the entire benchmarking process

CLI commands execute test

E2E model benchmark test

Micro benchmark test

Utils

Uni-Test platform

Add pipelines for CPU/GPU tests

[Enhancement] - Add HPL random generator to gemm-flops with ROCm

What would you like to be added:
For GEMM-FLOPs test with ROCm, pass flags to rocblas-bench specify what random tensor generation is used.
--initialization rand_int
--initialization hpl

Why is this needed:
--initialization rand_int to use simple random tensor generation, which results relatively more 0 values
--initialization hpl to use HPL style random tensor generation, which results relatively less 0 values

Without this feature, how does current superbenchmark work：
more 0 values results better benchmark results, which is by default for ROCm before 5.1 release

Components that may involve changes:
GEMM-FLOPs

Brief description of your proposal if any:
reference:
https://github.com/ROCmSoftwarePlatform/rocBLAS/releases/tag/rocm-5.1.0
https://ontrack.amd.com/browse/MSRCHA-325
ROCm/rocBLAS@9e9ced4

docker images does not run

ubuntu1804. After download, it fails to run. nvidia-docker installed but erronously stated it is not installed.


(venv) nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo docker run superbench/superbench:v0.4.0-cuda11.1.1
Unable to find image 'superbench/superbench:v0.4.0-cuda11.1.1' locally
v0.4.0-cuda11.1.1: Pulling from superbench/superbench
6a5697faee43: Pulling fs layer 
ba13d3bc422b: Pulling fs layer 
a254829d9e55: Pulling fs layer 
ff2daf3cdab6: Pulling fs layer 
9867a212b99b: Pulling fs layer 
da2dc255298e: Pulling fs layer 
45c66138abc4: Pulling fs layer 
69f8f14337fe: Pulling fs layer 
ca6a80844c87: Pulling fs layer 
f1cef55f2f91: Pulling fs layer 
7da10256993e: Pulling fs layer 
6b2e44626eea: Pulling fs layer 
6e6939188865: Pulling fs layer 
e935c7b1a998: Pulling fs layer 
6ff8cf358a74: Pulling fs layer 
590ea530411f: Pulling fs layer 
e5e48f41197f: Pulling fs layer 
125a5da70c41: Pulling fs layer 
9279dc6b257d: Pulling fs layer 
f594d963eb87: Pulling fs layer 
1b18685e6f7a: Pulling fs layer 
9591a6fe4536: Pulling fs layer 
e935c7b1a998: Waiting 
bdb38838130b: Pulling fs layer 
f9b433e418df: Pulling fs layer 
1115e15a521a: Pulling fs layer 
10b0575de683: Pulling fs layer 
6ff8cf358a74: Waiting 
40ad6f0e66a8: Pulling fs layer 
922bdc233ecf: Pulling fs layer 
9867a212b99b: Waiting 
2fed69baa886: Pulling fs layer 
f594d963eb87: Waiting 
590ea530411f: Waiting 
1b18685e6f7a: Waiting 
da2dc255298e: Waiting 
e5e48f41197f: Waiting 
9591a6fe4536: Waiting 
24d6f5e10b64: Pulling fs layer 
28cb839879c0: Pulling fs layer 
125a5da70c41: Waiting 
45c66138abc4: Waiting 
9279dc6b257d: Waiting 
1a297942cbed: Pull complete 
7088b2887299: Pull complete 
0cd61a107eb7: Pull complete 
f4abf5809bbd: Pull complete 
5ce8cc51c2d6: Pull complete 
c19296d8b165: Pull complete 
65b415727830: Pull complete 
92ef4ed872d5: Pull complete 
8a6a6385784d: Pull complete 
7159e82f10c2: Pull complete 
8903926b1920: Pull complete 
b497b1efac36: Pull complete 
e0466816640f: Pull complete 
2992460da256: Pull complete 
a5ff18c9283b: Pull complete 
e039b86398d9: Pull complete 
802a59289df4: Pull complete 
f65af8b3e314: Pull complete 
828253d36d6c: Pull complete 
6543d2035b7e: Pull complete 
b16beecccd1d: Pull complete 
205f3c109cf1: Pull complete 
9fdd474bb4ec: Pull complete 
bde9957e1552: Pull complete 
c173907422c4: Pull complete 
ab107e9c96c0: Pull complete 
0b6e632691f0: Pull complete 
426d52c2e7a6: Pull complete 
Digest: sha256:80661452672edbd2017d36f8fc9033bb3083a32120f35efed1191339c6437482
Status: Downloaded newer image for superbench/superbench:v0.4.0-cuda11.1.1


=============
== PyTorch ==
=============

NVIDIA Release 20.12 (build 17950526)
PyTorch Version 1.8.0a0+1606899

Container image Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for PyTorch.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --ipc=host ...
(venv) nonroot@nonroot-MS-7B22:~/superbenchmark$ which nvidia-docker
/usr/bin/nvidia-docker

'sb deploy' is expected to exit with non-zero when failed

Summary

This issue was found using v0.6.0 release. In the system, the ansible was not setup properly because of a test environment issue. When I ran 'sb deploy -f local.ini -i superbench/superbench:v0.6.0-cuda11.1.1'. It has error message like below. Although the error message said ansible.py return code 127, the sb program exit with 0.

[2022-09-09 18:49:10,573 N000000:30359][runner.py:43][INFO] Runner writes to: /home/aiscadmin/superbench/outputs/2022-09-09_18-49-10.
[2022-09-09 18:49:10,622 N000000:30359][runner.py:48][INFO] Runner will run: ['gpu-burn', 'nccl-bw:default', 'nccl-bw:gdr-only', 'ib-loopback', 'mem-bw', 'gpu-copy-bw:correctness', 'gpu-copy-bw:perf', 'kernel-launch', 'gemm-flops', 'cudnn-function', 'cublas-function', 'matmul', 'sharding-matmul', 'computation-communication-overlap', 'ort-inference', 'tensorrt-inference', 'gpt_models', 'bert_models', 'lstm_models', 'resnet_models', 'densenet_models', 'vgg_models']
[2022-09-09 18:49:10,622 N000000:30359][runner.py:165][INFO] Preparing SuperBench environment.
[2022-09-09 18:49:10,622 N000000:30359][ansible.py:125][INFO] Run playbook deploy.yaml ...
The command was not found or was not executable: ansible-playbook.
[2022-09-09 18:49:10,628 N000000:30359][ansible.py:80][WARNING] Run failed, return code 127.

$ echo $?
0

How to repro

Setup superbench normally. Before running 'sb deploy', remove the ~/ .ansible directory. Then run 'sb deploy' like above.

why is it probing for nviida when running on MI?

It does not tell anything specific about running on either nvidia or amd but your platform certainly can not make distinguis!

[2023-04-16 07:09:53,869 abys245:321][executor.py:235][INFO] Executor is going to execute resnet_models/pytorch-resnet101.
[0]: /opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
[2023-04-16 07:09:54,404 abys245:321][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: False, pin memory: False, force fp32: False.
[2023-04-16 07:09:54,405 abys245:321][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[1]: /opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
[2023-04-16 07:09:54,440 abys245:322][model_base.py:201][INFO] Model placement - model: pytorch-resnet101, GPU availablility: False, pin memory: False, force fp32: False.
[2023-04-16 07:09:54,440 abys245:322][pytorch_base.py:93][INFO] Distributed training is enabled - model: pytorch-resnet101, distributed implementation: ddp.
[2023-04-16 07:09:54,442 abys245:322][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
[2023-04-16 07:09:54,442 abys245:322][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-16 07:09:54,442 abys245:322][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-16 07:09:54,448 abys245:321][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-resnet101, message: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
[2023-04-16 07:09:54,448 abys245:321][executor.py:131][INFO] benchmark: pytorch-resnet101, return code: 4, result: {'return_code': [4]}.
[2023-04-16 07:09:54,448 abys245:321][executor.py:138][ERROR] Executor failed in resnet_models/pytorch-resnet101.
[2023-04-16 07:09:54,905 abys245:74026][ansible.py:79][INFO] Run succeed, return code 0.
[2023-04-16 07:09:54,906 abys245:74026][ansible.py:127][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Synchronize Output Directory] ********************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost                  : ok=2    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
[2023-04-16 07:09:57,458 abys245:74026][ansible.py:79][INFO] Run succeed, return code 0.
(venv) jd@lab101:~/nm/git/superbenchmark$

Found no NVIDIA driver on your system.

What's the issue, what's expected?:
The torch inside the docker can not find the my GPU.

How to reproduce it?:
Install superbenchmark as normal.

sb run -f local.ini -c resnet.yaml --host-password=mypassword

GPU: Quadro RTX 6000, driver is 530.30.02
nvidia-smi

Log message or shapshot?:
/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system.

Additional information:
OS: ubuntu 22.04.02

how can I fix this problem?

ib validation benchmark should support mixed IB device naming schema

This is for superbench latest code.

Current superbench ib validation benchmark is designed to have consistent IB device names across the nodes. A user must specify this name or a default one is used. The following command will be created to pass to ib command (like ib_write_bw)
https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/ib_validation_performance.py#L310

This design has a problem: In some environments, the IB device naming are not consistent. e.g. some VM calls the IB device mlx5_0. Some VM calls it mlx5_ib0. There is no way to run ib-validation benchmark on these VMs together.

Expected:
IB validation benchmark should work if some IB device is called mlx5_0, some VM calls it mlx5_ib0 (or other name). One design : in the run config yaml, a user specifies the index of the IB device (e.g. 0,1,2). superbench figures out the actual physical device name at runtime on each VM (e.g. mlx5_0, mlx5_ib0 etc). 'ibstat -l' can list the IB device names.

V0.7.0 Test Plan

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Done
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done
Hopper	1* 8 * H100	PyTorch 1.x	CUDA11.8	Done

single-node Micro-benchmark Test

tensort-inference

Fix Transformers version to avoid Tensorrt-inference failure (#441)

cublas-function/cudnn-function

Support list of custom config string in cudnn-functions and cublas-functions (#414)

Support correctness check in cublas-functions (#450, #452)

mem-bw

Add wait time option to resolve mem-bw unstable issue (#438)

SuperBench Improvement

Support non-zero return code (#410, #411,#425)

Support log flushing to the result file during runtime (#445)

Update sb version to include revision hash and date (#427)

Hopper GPU and FP8 related benchmarks

docker building

Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)

micro-benchmark

Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)

Support cuBLASLt FP16 and FP8 GEMM (#451, #455)

Debug ome Cublas and cudnn kernels crash issue

model-benchmark

Support FP8 in Bert model training (#446)

New in bug bash

[x]

[x]

multiple-node test

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
NDm A100 v4	32 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done

distributed Micro-benchmark test

ib-traffic

Support pair-wise pattern in IB validation benchmark(#453 )

Support 'pattern' in 'mpi' mode to run tasks in parallel(#447)

nccl-bw

Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode(#437, #458)

Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark(#454)

New in bug bash

[x]

[x]

V0.5.0 Test Plan

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1
NDm A100 v4	2 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1
Hayabusa	1* 16 * MI200	PyTorch 1.9	ROCm 5.0
Hayabusa	2 * 16 * MI200	PyTorch 1.9	ROCm 5.0
NC96_v4	1 * 4 * A100 PCIe	PyTorch 1.8	CUDA 11.1

Test Cases

Micro-benchmark Test

Support bi-directional bandwidth benchmark

Support data checking and make it optional

GEMM benchmark (NVIDIA only)

Support T4 and A10 in GEMM benchmark

GPU-burn benchmark (NVIDIA only)

Model-benchmark Test

Pytorch models

Sync results on root rank for e2e model benchmarks in distributed mode

Support customized env in local and torch.distributed mode

Add support for pytorch>=1.9.0

Keep BatchNorm as fp32 for pytorch cnn models cast to fp16

Remove FP16 samples type converting time

Support FAMBench

Inference Benchmark Improvement

Add percentile metrics for ort and pytorch inference benchmarks

Add configuration with inference benchmark

SuperBench Improvement

Add command to support listing all optional parameters for benchmarks.

Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file

Support timeout to detect the benchmark failure and stop the process automatically

Improve Output Interface

Tools

data diagnosis
- Support multi-benchmark check
- Support output in excel and html format
- Support result output for all nodes in data diagnosis
Support result summary in excel ,md and html format

where the cudnn_benchmark binary located?

# python3 examples/benchmarks/cudnn_function.py
[2022-07-31 05:32:08,788 ac5e130cece6:26637][micro_base.py:129][ERROR] The binary does not exist - benchmark: cudnn-function, binary name: cudnn_benchmark, binary directory: None.
[2022-07-31 05:32:08,788 ac5e130cece6:26637][cudnn_function.py:21][INFO] benchmark: cudnn-function, return code: 31, result: {'return_code': [31]}

Some test does not support CS 8.9(RTX 4080/4090)

What's the issue, what's expected?:
[2023-04-16 12:26:24,006 u22:880][cuda_gemm_flops_performance.py:77][ERROR] Unsupported architecture - benchmark: gemm-flops, compute capability: 8.9, supports 7.0 7.5 8.0 8.6 9.0

How to reproduce it?:
Run superbenchmark with RTX 4080/4090.

Log message or shapshot?:

Additional information:

sb-exec.log

pytorch cannot find libopen-orted-mpir.so

What's the issue, what's expected?:
pytorch cannot find libopen-orted-mpir.so

Log message or shapshot?:

ERROR: libopen-orted-mpir.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/knack/cli.py", line 233, in invoke
cmd_result = self.invocation.execute(args)
File "/usr/local/lib/python3.8/dist-packages/knack/invocation.py", line 224, in execute
cmd_result = parsed_args.func(params)
File "/usr/local/lib/python3.8/dist-packages/knack/commands.py", line 146, in call
return self.handler(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/knack/commands.py", line 253, in _command_handler
result = op(client, **command_args) if client else op(**command_args)
File "/usr/local/lib/python3.8/dist-packages/superbench/cli/_handler.py", line 208, in exec_command_handler
executor.exec()
File "/usr/local/lib/python3.8/dist-packages/superbench/executor/executor.py", line 247, in exec
context = BenchmarkRegistry.create_benchmark_context(
File "/usr/local/lib/python3.8/dist-packages/superbench/common/utils/lazy_import.py", line 42, in getattr
self._import()
File "/usr/local/lib/python3.8/dist-packages/superbench/common/utils/lazy_import.py", line 31, in _import
self._callback()
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/init.py", line 15, in
'superbench.benchmarks.registry', 'BenchmarkRegistry', lambda: list(
File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 848, in exec_module
File "", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/model_benchmarks/init.py", line 7, in
from superbench.benchmarks.model_benchmarks.pytorch_bert import PytorchBERT
File "/usr/local/lib/python3.8/dist-packages/superbench/benchmarks/model_benchmarks/pytorch_bert.py", line 6, in
import torch
File "/usr/local/lib/python3.8/dist-packages/torch/init.py", line 191, in
_load_global_deps()
File "/usr/local/lib/python3.8/dist-packages/torch/init.py", line 153, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: libopen-orted-mpir.so: cannot open shared object file: No such file or directorynon-zero return code

additional information
Ubuntu 20.04, python3.8；OpenMPI 4.04；
I have libopen-orted-mpir.so in /usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-orted-mpir.so
and I have written the path in ~/.bashrc;
I have tried to ask chatgpt4, and he cannot solve this issue.

[Bug] ansible_runner updated to 2.3.2 last week, superbench fails with python 3.6

What's the issue, what's expected?:
from __future__ import annotations is added to 2.3.2 version of ansible_runner, which seems to be only supported on python 3.7+

How to reproduce it?:
Build superbench on ubuntu 18.04 docker image ( Ubuntu 18.04 by default comes with python 3.6 )

superbench runtime needs to flush log to the result file

When superbench (version 0.6.0-rc1) runs a test, the test output is saved in some memory. It didn't flush the log to output file. This is hard for users to track the test process, especially for interactive tests, long running tests (>=5minutes).

This feature is especially important because sometime a test may hang. Without output, it is hard to tell whether it is actually hang or not.

Expected: superbench should flush the outputs to the log file immediately.

Fail to run on Ubuntu1804 nvidia RTX2070

What's the issue, what's expected?:
Fail to run the benchmark, see log.
sb deploy log shows: "could not select device driver"
How to reproduce it?:

Ubuntu1804
kernel: 5.4.0-80-generic
python --version
Python 3.8.8

01:00.0 VGA compatible controller: NVIDIA Corporation Device 1e84 (rev a1)
	Subsystem: NVIDIA Corporation Device 139e
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Tue Jul 27 23:01:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 41%   35C    P8    14W / 215W |      1MiB /  7979MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Log message or shapshot?:
(see attached)

Additional information:

'sb result diagnosis' should exit with valid error code when input data format has error

What's the issue, what's expected?:
This is superbench release/0.6. The 'sb result diagnosis' command didn't exit with error code properly when input data file has wrong format. e.g. for an empty data file, it has the right error detection which can be told from stdout/stderr, but the exit code is 0. see repro steps.

While empty file is just a rare case, this problem shows a general issue in superbench.

How to reproduce it?:
Create an empty data file. Create valid rule and baseline files. Then run (For privacy reason, the timestamp and hostname were removed from the following log)

$ sb result diagnosis --data-file outputs/b/results-summary.jsonl --rule-file rule1.y --baseline-file baseline1.json --output-file-format json --output-all --output-dir diag

...[file_handler.py:41][ERROR] Analyzer: invalid raw data fomat - 'node'
...[rule_base.py:106][ERROR] RuleBase: empty raw data
...[data_diagnosis.py:405][INFO] DataDiagnosis: Begin to process 0 nodes
...[data_diagnosis.py:111][ERROR] DataDiagnosis: get criteria failed
...[data_diagnosis.py:407][INFO] DataDiagnosis: Processed finished
...[data_diagnosis.py:428][INFO] DataDiagnosis: Output results to diag1/diagnosis_summary.json

$ echo $?
0

docker image /root/hostfile cannot be updated with redeploy

This is using superbench v0.6.0-rc1-cuda11.1.1.

After I deployed and ran some tests on a set of nodes (e.g. n1,n2), I wanted to change the nodes to be (n1,n3), I removed the running containers, re-run deploy with (n1,n3). However, the /root/hostfile inside the docker image is still the old (n1,n2).

Expected: there could be an easy way to switch to the new (n1,n3). Either using sb deploy or sb run.

superbench failed at default most typical run config

What's the issue, what's expected?:


TASK [Starting Container] ******************************************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": "docker rm --force sb-workspace ||: && docker run -itd --name=sb-workspace  --privileged --net=host --ipc=host  --gpus=all    -w /root -v /root/sb-workspace:/root -v /mnt:/mnt  -v /var/run/docker.sock:/var/run/docker.sock  --entrypoint /bin/bash superbench/superbench && docker exec sb-workspace bash -c  \"chown -R root:root ~ && \\\n  sed -i 's/[# ]*Port.*/Port 22066/g' /etc/ssh/sshd_config && \\\n  service ssh restart && sb help\"\n", "delta": "0:00:36.069805", "end": "2023-04-15 03:01:11.455660", "msg": "non-zero return code", "rc": 125, "start": "2023-04-15 03:00:35.385855", "stderr": "Error response from daemon: No such container: sb-workspace\ndocker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]].", "stderr_lines": ["Error response from daemon: No such container: sb-workspace", "docker: Error response from daemon: could not select device driver \"\" with capabilities: [[gpu]]."], "stdout": "28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c", "stdout_lines": ["28784ba8358530ee44bf82ec37213a691e8573b1b52231c794533c0db781483c"]}

PLAY RECAP *********************************************************************
localhost                  : ok=10   changed=5    unreachable=0    failed=1    skipped=1    rescued=0    ignored=0
[2023-04-15 03:01:11,663 jd-MS-7B22:26239][ansible.py:82][WARNING] Run failed, return code 2.
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list
[sudo] password for jd:
CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
jd@jd-MS-7B22:~/gg/git/superbenchmark$ sudo docker container list --all
CONTAINER ID   IMAGE                   COMMAND       CREATED       STATUS    PORTS     NAMES
28784ba83585   superbench/superbench   "/bin/bash"   5 hours ago   Created             sb-workspace
jd@jd-MS-7B22:~/gg/git/superbenchmark$

How to reproduce it?:
follow your own instruction at https://aka.ms/superbench.

Log message or shapshot?:
above

Additional information:
ubuntu 22.04 bare metal, gtx 2070, cuda 12.x

Executor/Run benchmark failed messages while running vgg models with superbench

What's the issue, what's expected?:
I tried to run some vgg models with superbench, on 8-GPU A100-80G machine, but some of them failed with messages shown below.

How to reproduce it?:
Run Command:
sb run --no-docker -l localhost -c --output-dir

Log message or shapshot?:
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Model placement - model: pytorch-vgg16, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg16, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg16, message: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=3, timeout=0:05:00)�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: trying to initialize the default process group twice!�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
benchmark: pytorch-vgg16, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg16.�[0m
Executor is going to execute model-benchmarks:vgg:float/pytorch-vgg19.�[0m
Model placement - model: pytorch-vgg19, GPU availablility: True, pin memory: False, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-vgg19, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-vgg19, message: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29501 (errno: 98 - Address already in use).�[0m
benchmark: pytorch-vgg19, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in model-benchmarks:vgg:float/pytorch-vgg19.�[0m[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29501 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29501 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.

Additional information:

/sys/fs/cgroup/cpuacct/cpuacct missing causing superbench failures

Docker container: nvidia/cuda:11.6.1-cudnn8-devel-ubuntu20.04
GPU 0: NVIDIA A100 80GB PCIe

[2022-11-23T15:02:17.260Z] Running model...

[2022-11-23T15:02:17.260Z] > docker exec dd7780c3a5f9 bash -c "cd superbenchmark && bash run.sh atoa_small_hayabusa.yaml atoa_small_hayabusa_performance.csv"

[2022-11-23T15:18:24.479Z] NVIDIA GPU detected.

[2022-11-23T15:18:24.479Z] sb exec --config-file   atoa_small_ndv4.yaml    2>&1 | tee log.txt

[2022-11-23T15:18:24.479Z] [2022-11-23 15:02:18,200 rocm-framework-a100-1:400][executor.py:224][INFO] Executor is going to execute gpt_models/pytorch-gpt2-small.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,202 rocm-framework-a100-1:466][monitor.py:100][INFO] Start monitoring.

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:18,203 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,205 rocm-framework-a100-1:466][monitor.py:226][ERROR] Failed to read process cpu ticks information - error message: [Errno 2] No such file or directory: '/sys/fs/cgroup/cpuacct/cpuacct.stat'

[2022-11-23T15:18:24.480Z] [2022-11-23 15:02:19,206 rocm-framework-a100-1:466][monitor.py:105][ERROR] Failed to launch the monitor process - error message: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] Process Monitor-1:

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 102, in run

[2022-11-23T15:18:24.480Z]     self.__sample()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 126, in __sample

[2022-11-23T15:18:24.480Z]     self.__sample_host_metrics(record)

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 152, in __sample_host_metrics

[2022-11-23T15:18:24.480Z]     cpu_usage = (container_ticks_e -

[2022-11-23T15:18:24.480Z] TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] During handling of the above exception, another exception occurred:

[2022-11-23T15:18:24.480Z] 

[2022-11-23T15:18:24.480Z] Traceback (most recent call last):

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap

[2022-11-23T15:18:24.480Z]     self.run()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 106, in run

[2022-11-23T15:18:24.480Z]     self.stop()

[2022-11-23T15:18:24.480Z]   File "/usr/local/lib/python3.8/dist-packages/superbench/monitor/monitor.py", line 117, in stop

[2022-11-23T15:18:24.480Z]     self.join()

[2022-11-23T15:18:24.480Z]   File "/usr/lib/python3.8/multiprocessing/process.py", line 147, in join

[2022-11-23T15:18:24.480Z]     assert self._parent_pid == os.getpid(), 'can only join a child process'

[2022-11-23T15:18:24.480Z] AssertionError: can only join a child process

superbenchmark/superbench/monitor/monitor.py

Line 83 in 6e357fb

self._cpu_file = '/sys/fs/cgroup/cpuacct/cpuacct.stat'

Installation failing when following instruction [forcefully closed without fixing]

Ubuntu1804


nonroot@nonroot-MS-7B22:~/superbenchmark$ pwd
/home/nonroot/superbenchmark
nonroot@nonroot-MS-7B22:~/superbenchmark$ git remote -v
origin	https://github.com/microsoft/superbenchmark (fetch)
origin	https://github.com/microsoft/superbenchmark (push)
nonroot@nonroot-MS-7B22:~/superbenchmark$ 

nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo python3 -m pip install .
Keyring is skipped due to an exception: org.freedesktop.DBus.Error.NoServer: Failed to connect to socket /tmp/dbus-9Lms1YHOlb: Connection refused
WARNING: The directory '/home/nonroot/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Processing /home/nonroot/superbenchmark
  Preparing metadata (setup.py) ... done
Collecting ansible_runner>=2.0.0rc1
  Downloading ansible_runner-2.1.1-py3-none-any.whl (83 kB)
     |████████████████████████████████| 83 kB 4.7 MB/s             
Collecting colorlog>=4.7.2
  Downloading colorlog-6.6.0-py2.py3-none-any.whl (11 kB)
Collecting jinja2>=2.10.1
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 12.2 MB/s            
Requirement already satisfied: joblib>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (1.1.0)
Collecting jsonlines>=2.0.0
  Downloading jsonlines-3.0.0-py3-none-any.whl (8.5 kB)
Collecting knack>=0.7.2
  Downloading knack-0.9.0-py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 94.4 MB/s            
Requirement already satisfied: matplotlib>=3.0.0 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (3.3.4)
Collecting natsort>=7.1.1
  Downloading natsort-8.0.2-py3-none-any.whl (37 kB)
Collecting omegaconf==2.0.6
  Downloading omegaconf-2.0.6-py3-none-any.whl (36 kB)
Collecting openpyxl>=3.0.7
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 44.7 MB/s            
Requirement already satisfied: pandas>=1.1.5 in /usr/local/lib/python3.6/dist-packages (from superbench==0.4.0) (1.1.5)
Collecting pyyaml>=5.3
  Downloading PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
     |████████████████████████████████| 603 kB 59.7 MB/s            
Collecting seaborn>=0.11.2
  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)
     |████████████████████████████████| 292 kB 62.9 MB/s            
Collecting tcping>=0.1.1rc1
  Downloading tcping-0.1.1rc1.tar.gz (4.1 kB)
  Preparing metadata (setup.py) ... done
Collecting xlrd>=2.0.1
  Downloading xlrd-2.0.1-py2.py3-none-any.whl (96 kB)
     |████████████████████████████████| 96 kB 45.4 MB/s            
Collecting xlsxwriter>=1.3.8
  Downloading XlsxWriter-3.0.2-py3-none-any.whl (149 kB)
     |████████████████████████████████| 149 kB 69.3 MB/s            
Collecting xmltodict>=0.12.0
  Downloading xmltodict-0.12.0-py2.py3-none-any.whl (9.2 kB)
Collecting ansible_base>=2.10.9
  Downloading ansible-base-2.10.16.tar.gz (6.1 MB)
     |████████████████████████████████| 6.1 MB 25.8 MB/s            
  Preparing metadata (setup.py) ... done
Requirement already satisfied: dataclasses in /home/nonroot/.local/lib/python3.6/site-packages (from omegaconf==2.0.6->superbench==0.4.0) (0.8)
Requirement already satisfied: typing-extensions in /home/nonroot/.local/lib/python3.6/site-packages (from omegaconf==2.0.6->superbench==0.4.0) (4.0.1)
Requirement already satisfied: cryptography in /usr/lib/python3/dist-packages (from ansible_base>=2.10.9->superbench==0.4.0) (2.1.4)
Collecting packaging
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
     |████████████████████████████████| 40 kB 72.5 MB/s            
Requirement already satisfied: six in /home/nonroot/.local/lib/python3.6/site-packages (from ansible_runner>=2.0.0rc1->superbench==0.4.0) (1.16.0)
Collecting pexpect>=4.5
  Downloading pexpect-4.8.0-py2.py3-none-any.whl (59 kB)
     |████████████████████████████████| 59 kB 63.9 MB/s            
Collecting python-daemon
  Downloading python_daemon-2.3.0-py2.py3-none-any.whl (35 kB)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.10.1->superbench==0.4.0) (2.0.1)
Collecting attrs>=19.2.0
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
     |████████████████████████████████| 60 kB 63.3 MB/s            
Collecting argcomplete
  Downloading argcomplete-2.0.0-py2.py3-none-any.whl (37 kB)
Collecting jmespath
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Requirement already satisfied: tabulate in /usr/local/lib/python3.6/dist-packages (from knack>=0.7.2->superbench==0.4.0) (0.8.9)
Collecting pygments
  Downloading Pygments-2.11.2-py3-none-any.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 99.9 MB/s            
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (1.3.1)
Requirement already satisfied: numpy>=1.15 in /home/nonroot/.local/lib/python3.6/site-packages (from matplotlib>=3.0.0->superbench==0.4.0) (1.19.5)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (8.4.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0.0->superbench==0.4.0) (3.0.7)
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Requirement already satisfied: pytz>=2017.2 in /usr/lib/python3/dist-packages (from pandas>=1.1.5->superbench==0.4.0) (2018.3)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.6/dist-packages (from seaborn>=0.11.2->superbench==0.4.0) (1.5.4)
Collecting click
  Downloading click-8.0.3-py3-none-any.whl (97 kB)
     |████████████████████████████████| 97 kB 64.2 MB/s            
Collecting prettytable
  Downloading prettytable-2.5.0-py3-none-any.whl (24 kB)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect>=4.5->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.7.0)
Requirement already satisfied: importlib-metadata<5,>=0.23 in /usr/local/lib/python3.6/dist-packages (from argcomplete->knack>=0.7.2->superbench==0.4.0) (4.8.3)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prettytable->tcping>=0.1.1rc1->superbench==0.4.0) (0.2.5)
Requirement already satisfied: docutils in /usr/local/lib/python3.6/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.18.1)
Requirement already satisfied: lockfile>=0.10 in /usr/local/lib/python3.6/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (0.12.2)
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from python-daemon->ansible_runner>=2.0.0rc1->superbench==0.4.0) (39.0.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata<5,>=0.23->argcomplete->knack>=0.7.2->superbench==0.4.0) (3.6.0)
Building wheels for collected packages: superbench, ansible-base, tcping
  Building wheel for superbench (setup.py) ... done
  Created wheel for superbench: filename=superbench-0.4.0-py3-none-any.whl size=139599 sha256=f08e0e47cd18aadcd92ad96228b346a44daa7d569eb24ad2269e8d3ebb472819
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/dc/bb/c9/0181c21c034d8eb5089301c92e8e8b249ecc42b0a8569ef352
  Building wheel for ansible-base (setup.py) ... done
  Created wheel for ansible-base: filename=ansible_base-2.10.16-py3-none-any.whl size=1871330 sha256=550d11df3e25bf809017d59c36152caf22db6031f3823c725064ac06325d9600
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/ac/3a/eb/1953c987dfe9515f0b3c0770e22520361beedf030ec746b716
  Building wheel for tcping (setup.py) ... done
  Created wheel for tcping: filename=tcping-0.1.1rc1-py3-none-any.whl size=6400 sha256=8e0a41e98d6c0f4d7ea34b176f95b1cc7444edd7eeddf591d3a7fcc948d4f381
  Stored in directory: /tmp/pip-ephem-wheel-cache-dsy5_fgx/wheels/79/b4/79/cd1464d78ff94847f17dde162e88301861ffcdbae7b57279f0
Successfully built superbench ansible-base tcping
Installing collected packages: pyyaml, python-daemon, pygments, prettytable, pexpect, packaging, jmespath, jinja2, et-xmlfile, click, attrs, argcomplete, xmltodict, xlsxwriter, xlrd, tcping, seaborn, openpyxl, omegaconf, natsort, knack, jsonlines, colorlog, ansible-runner, ansible-base, superbench
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.12
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo pip3 install --upgrade PyYAML
WARNING: The directory '/home/nonroot/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Requirement already satisfied: PyYAML in /usr/lib/python3/dist-packages (3.12)
Collecting PyYAML
  Downloading PyYAML-6.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (603 kB)
     |████████████████████████████████| 603 kB 4.2 MB/s            
Installing collected packages: PyYAML
  Attempting uninstall: PyYAML
    Found existing installation: PyYAML 3.12
ERROR: Cannot uninstall 'PyYAML'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
nonroot@nonroot-MS-7B22:~/superbenchmark$ sudo python3 -m pip install .^C
nonroot@nonroot-MS-7B22:~/superbenchmark$ pwd
/home/nonroot/superbenchmark
nonroot@nonroot-MS-7B22:~/superbenchmark$ git remote -v
origin	https://github.com/microsoft/superbenchmark (fetch)
origin	https://github.com/microsoft/superbenchmark (push)
nonroot@nonroot-MS-7B22:~/superbenchmark$

superbench logging: 'color' output should be optional

In current superbench code, the logger has a hard-coded 'color' mode. While this design is good for interactive mode, it is bad when processing the logs programmatically in system like Geneva.

The following is superbench output's raw text:
^[[0;33m[1,4]:^[[0m[^[[36m2022-08-29 22:50:16,377 node1:287^[[0m][^[[34mexecutor.py:125^[[0m][^[[32mINFO^[[0m] Executor succeeded in nccl-bw:nvlink.^[[0m^[[0m^M

Expected: the color mode should be an option to 'sb' command. or it can be turned off by some environmental variable.
The following is the expectd raw text:

[1,4]:[2022-08-29 22:50:16,377 ND40rsv201000009:287][executor.py:125][INFO] Executor succeeded in nccl-bw:nvlink.

[Enhancement] maybe algo argument can be omitted in cudnn-function?

By using superbench cudnn-function microbenchmark to benchmark conv2d, It's hard to choose the algo arguments in cudnn-function unit test, because different shape's best performance comes up with different algo, I'm currently traversing all the seven algos to test the conv2d bench, but it really takes a bunch of time.

 for algo in algos:
            custom_config_str = '{' + '"algo":{2},"arrayLength":2,"convType":0,"dilationA":[{0},{0}],"filterStrideA":[{1},{1}],'.format(conv_info['D'], conv_info['S'], algo) \
                + '"filterDims":[{0},{1},{2},{2}],"inputDims":[{0},{3},{4},{5}],"inputStride":[{6},{7},{8},{9}],"inputType":0,'.format(conv_info['N'], conv_info['F'], conv_info['K'], conv_info['C'], conv_info['H'], conv_info['W'], input_strides[0], input_strides[1], input_strides[2], input_strides[3])\
                + '"mode":1,"name":"cudnnConvolutionForward","outputDims":[{0},{1},{2},{3}],'.format(conv_info['N'], conv_info['F'], conv_info['HO'], conv_info['WO'])\
                + '"outputStride":[{0},{1},{2},{3}],"padA":[{4},{4}],"tensorOp":false'.format(output_strides[0], output_strides[1], output_strides[2], output_strides[3], pad_top, pad_left) \
                + '}'
            parameters = '--num_warmup 8 --num_steps 100 --num_in_step 1000 --config_json_str {0}'.format(custom_config_str)
            context = BenchmarkRegistry.create_benchmark_context(
                'cudnn-function', platform=Platform.CUDA, parameters=parameters
            )

From my point of view, we can omit the algo because cudnn library already provided three functions to automatically choose the best algo:

cudnnFindConvolutionForwardAlgorithm
cudnnFindConvolutionBackwardAlgorithm
cudnnFindConvolutionBackwardFilterAlgorithm

why '/dev/nvidia-uvm' is a required file check for nvidia GPU?

What's the issue, what's expected?:
This is v0.6.0. In (superbench code gpu.py) [(https://github.com/microsoft/superbenchmark/blob/main/superbench/common/devices/gpu.py#L24), it checks whether a GPU is nvidia by checking both '/dev/nvidiactl' and '/dev/nvidia-uvm'. The question is: why does it require /dev/nvidia-uvm? I found some GPU type like 'Tesla K80' with cuda 11.4 doesn't have this.

How to reproduce it?:
Logon to a machine with 'Tesla K80' + cuda 11.4.

V0.5.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: April 10th, 2022
Bug Bash date: April 11th, 2022
Release date: April 22th, 2022

Main Features

Micro-benchmark Improvement

- Support nccl bandwidth benchmark only with NIC in NCCL/RCCL Bandwidth Test (#299)
- Support bi-directional bandwidth benchmark in GPU Copy Bandwidth Test (#285, #298, #302)
- Support data checking in GPU Copy Bandwidth Test (#301)
- Update rccl-tests submodule to fix divide by zero error (#306)
- Add GPU-Burn as microbenchmark (#324)

Model-benchmark Improvement

- Sync results on root rank for e2e model benchmarks in distributed mode (#287)
- Support customized env in local and torch.distributed mode (#295)
- Add support for pytorch>=1.9.0 (#305)
- Keep BatchNorm as fp32 for pytorch cnn models cast to fp16 (#322)
- Remove FP16 samples type converting time (#330, #332)
- Support FAMBench (#338)

Inference Benchmark Improvement

- Revise the default setting for inference benchmark (#311, #329)
- Add percentile metrics for inference benchmarks (#283)
- Support T4 and A10 in GEMM benchmark (#294)
- Add configuration with inference benchmark (#311)

SuperBench Improvement

- Add command to support listing all optional parameters for benchmarks. (#279)
- Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file (#284)
- Support timeout to detect the benchmark failure and stop the process automatically (#288)
- Add rocm5.0 dockerfile (#307)
- Improve Output Interface (#333)

Data Diagnosis & Analysis

- Support multi-benchmark check (#289)
- Support result summary in md, excel and html format （#320, #321, #335)
- Support data diagnosis in md and html format （#325)
- Support result output for all nodes in data diagnosis (#336, #339)
- Add document for result summary usage (#337)

Backlogs

SuperBench Improvement

- Support automatic configuration yaml selection on Azure VM

Inference Benchmark Improvement

- Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
- Support VGG, LSTM, and GPT-2 small in ORT Inference Backend

Data Diagnosis & Analysis

- Support boxplot and outlier analysis

Document

- Metric Reasoning Doc

V0.3.0 Release Plan

Release Manager

@TobeyQin

Endgame

Code freeze: 9/1/2021
Bug Bash date: 9/2/2021
Release date: 9/17/2021

Main Features

SuperBench Framework

SB Runner -- @abuccts

MPI mode implementation
PR: #146

SB Benchmarks -- @guoshzhao

Docker Base
PR: #179 and #180

Single-node Validation

Micro-benchmarks -- @guoshzhao @yukirora

- Memory (Tool: Nvidia Bandwidth Test Tool) -- @yukirora ETA: 5/28/2021
  PR: #114
  
  Metrics Unit Description
  
  H2D_Mem_BW_<GPU ID> GB/s host-to-GPU bandwidth for each GPU
  
  D2H_Mem_BW_<GPU ID> GB/s GPU-to-host bandwidth for each GPU

Metrics	Unit	Description
H2D_Mem_BW_<GPU ID>	GB/s	host-to-GPU bandwidth for each GPU
D2H_Mem_BW_<GPU ID>	GB/s	GPU-to-host bandwidth for each GPU

Device P2P Bandwidth (Tool: Nvidia p2pBandwidthLatencyTest Tool) -- Delayed

Metrics	Unit	Description
P2P_BW_Max	GB/s	The maximum bandwidth in Bidirectional P2P=Enabled Bandwidth Matrix for all GPUs
P2P_BW_Min	GB/s	The minimum bandwidth
P2P_BW_Avg	GB/s	The average bandwidth

IBLoopback (Tool: PerfTest – Standard RDMA Test Tool) -- @yukirora ETA: 7/30/2021
PR: #112 and #129

Metrics	Unit	Description
IB_Write	MB/s	The IB write loopback throughput with different message size
IB_Read	MB/s	The IB read loopback throughput with different message size
IB_Send	MB/s	The IB send loopback throughput with different message size

NCCL (Tool: Nvidia NCCL Test) -- @yukirora ETA: 7/30/2021
PR: #113 and #128

Metrics	Unit	Description
NCCL_AllReduce	GB/s	The NCCL AllReduce performance with different message size
NCCL_AllGather	GB/s	The NCCL AllGather performance with different message size
NCCL_broadcast	GB/s	The NCCL Broadcast performance with different message size
NCCL_reduce	GB/s	The NCCL Reduce performance with different message size
NCCL_reduce_scatter	GB/s	The NCCL ReduceScatter performance with different message size

Disk (Tool: FIO – Standard Disk Performance Tool) -- @yzygitzh ETA: 7/30/2021
PR: #127 and #132 and #161

Metrics	Unit	Description
Seq_Read	MB/s	Sequential read performance
Seq_Write	MB/s	Sequential write performance
Rand_Read	MB/s	Random read performance
Rand_Write	MB/s	Random write performance
Seq_R/W_Read	MB/s	Read performance in sequential read/write, fixed measurement (read:write = 4:1)
Seq_R/W_Write	MB/s	Write performance in sequential read/write (read:write = 4:1)
Rand_R/W_Read	MB/s	Read performance in random read/write (read:write = 4:1)
Rand_R/W_Write	MB/s	Write performance in random read/write (read:write = 4:1)

- H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build) -- @yzygitzh ETA: 8/6/2021
  PR: #162 and #169
  
  Metrics Unit Description
  
  H2D_SM_BW_<GPU ID> GB/s host-to-GPU bandwidth using GPU kernel for each GPU
  
  D2H_SM_BW_<GPU ID> GB/s GPU-to-host bandwidth using GPU kernel for each GPU

Metrics	Unit	Description
H2D_SM_BW_<GPU ID>	GB/s	host-to-GPU bandwidth using GPU kernel for each GPU
D2H_SM_BW_<GPU ID>	GB/s	GPU-to-host bandwidth using GPU kernel for each GPU

Support AMD

Docker Image Support -- @guoshzhao ETA: 7/16/2021

ROCm 4.2 PyTorch 1.7 PR: #164
ROCm 4.0 PyTorch 1.7 PR: #164

Micro Benchmarks

Kernel Launch (Tool: MSR-A build) -- @yukirora ETA: 7/30/2021
PR: #137 and #136

Metrics	Unit	Description
Kernel_Launch_Event_Time	Time (ms)	Dispatch latency measured in GPU time using hipEventRecord()
Kernel_Launch_Wall_Time	Time (ms)	Dispatch latency measured in CPU time

RCCL (Tool: AMD RCCL Test) -- @yukirora ETA: 7/30/2021
PR: #139 and #143

Metrics	Unit	Description
RCCL_AllReduce	GB/s	The RCCL AllReduce performance with different message size
RCCL_AllGather	GB/s	The RCCL AllGather performance with different message size
RCCL_broadcast	GB/s	The RCCL Broadcast performance with different message size
RCCL_reduce	GB/s	The RCCL Reduce performance with different message size
RCCL_reduce_scatter	GB/s	The RCCL ReduceScatter performance with different message size

GEMM FLOPS (Tool: AMD rocblas-bench Tool) -- @yukirora ETA: 8/27/2021
PR: #144 and #165

Metrics	Unit	Description
FP64	GFLOPS	FP64 FLOPS without MatrixCore
FP32	GFLOPS	FP32 FLOPS without MatrixCore
FP16	GFLOPS	FP16 FLOPS without MatrixCore
FP32(MC)	GFLOPS	TF32 FLOPS with MatrixCore
FP16(MC)	GFLOPS	FP16 FLOPS with MatrixCore
BF16(MC)	GFLOPS	BF16 FLOPS with MatrixCore
INT8(MC)	GOPS	INT8 FLOPS with MatrixCore
INT4(MC)	GOPS	INT4 FLOPS with MatrixCore

- Memory (Tool: HIP Bandwidth Test Tool) -- @yukirora ETA: 8/27/2021
  PR: #159 and #153
  
  Metrics Unit Description
  
  H2D_Mem_BW_<GPU ID> GB/s host-to-GPU bandwidth for each GPU
  
  D2H_Mem_BW_<GPU ID> GB/s GPU-to-host bandwidth for each GPU

Metrics	Unit	Description
H2D_Mem_BW_<GPU ID>	GB/s	host-to-GPU bandwidth for each GPU
D2H_Mem_BW_<GPU ID>	GB/s	GPU-to-host bandwidth for each GPU

E2E Benchmarks -- @guoshzhao ETA: 7/16/2021

- CNN models -- User PyTorch TORCHVISION.MODELS sub-package
  - ResNet: ResNet-50, ResNet-101, ResNet-152
  - DenseNet: DenseNet-169, DenseNet-201
  - VGG: VGG-11, VGG-13, VGG-16, VGG-19
- BERT -- Use huggingface Transformers
  - BERT
  - BERT LARGE
- LSTM -- Use PyTorch TORCH.NN sub-package
- GPT-2 -- Use huggingface Transformers

Result Summary -- @cp5555

Generate a report to summarize the results -- @guoshzhao ETA: 7/30/2021
PR: #147, #149, and #157
Support basic analysis feature (boxplot figure, outlier detection, etc.)

Bug Fix

VGG models failed on A100 GPU with batch_size=128 #115
PR: #134

Other Improvement

Contribution related -- @lynex
- Contribute rule (#131)
- system information collection (#160)
Document -- @TobeyQin
- Add release process doc (#130)
- Add design documents (#125)
- Add developer guide doc for coding style (#155)
- Add contribution rules (#131)
- Add docker image list (#154)
- Add initial validation results
- ~~Add metric reasoning doc -- @cp5555 @guoshzhao~~
Process monitor
- Add Heart beat to monitor process health
- Auto kill all processes on all nodes
Coding style -- @abuccts
- Add vscode online

Backlogs

Multi-Node Benchmarks

Mellanox ClusterKit
GPCNeT

UI Design

Executor/Run benchmark failed messages while running bert-large model with superbench

What's the issue, what's expected?:
bert-large model failed with error messages show below when tried to run with superbench on A100-80GB machine with 8 GPUs. Please help provide suggestions to fix this.

How to reproduce it?:
Run Command:
sb run --no-docker -l localhost -c --output-dir

Log message or shapshot?:
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Broken pipe�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0], 'fp32_train_step_time': [96.17431712523103], 'fp32_train_throughput': [333.46321017523894], 'fp16_train_step_time': [66.33570070005953], 'fp16_train_throughput': [484.904031859023]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
benchmark: pytorch-bert-base, return code: 0, result: {'return_code': [0]}.�[0m
Executor succeeded in bert_models/pytorch-bert-base.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Executor is going to execute bert_models/pytorch-bert-large.�[0m
Model placement - model: pytorch-bert-large, GPU availablility: True, pin memory: True, force fp32: False.�[0m
Distributed training is enabled - model: pytorch-bert-large, distributed implementation: ddp.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=4, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m
Run benchmark failed - benchmark: pytorch-bert-large, message: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=5, timeout=0:05:00)�[0m
benchmark: pytorch-bert-large, return code: 4, result: {'return_code': [4]}.�[0m
Executor failed in bert_models/pytorch-bert-large.�[0m/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated

Additional information:

V0.3.0 Test Plan

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
NDv4	1 * 1 * A100	PyTorch 1.8	CUDA 11.1	Succeeded
NDv4	1 * 8 * A100	PyTorch 1.8	CUDA 11.1	Succeeded
NDv4	2 * 8 * A100	PyTorch 1.8	CUDA 11.1	Done
	1 * 2 * V100	PyTorch 1.8	CUDA 11.1	Succeeded
HPE	1 * 8 * MI100	PyTorch 1.7	ROCm 4.2	Done

Test Cases

Benchmarks Test

E2E Benchmarks

CNN models

BERT

LSTM

GPT-2

Micro Benchmarks

Kernel Launch

GEMM FLOPS

Memory

NCCL/RCCL

IB

Disk

Other Features

- Docker Images Check
- Document Correctness Check

V0.8.0 Test Plan

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
NDv5 SXM	1* 8 * H100	PyTorch 1.x	CUDA11.8	Done
ND A100 v4/NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.x	CUDA 11.8
ND A100 v4/NDm A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1

Hopper GPU and FP8 related benchmarks

microbenchmark

Add distributed inference benchmark (#493)

Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm (#492 and #494)

e2e benchmark

Support TE FP8 in BERT/GPT2 models (#496, #499)

SuperBench existing benchmark improvement

microbenchmark improvement

Support flexible warmup and non-random data initialization in cublas-benchmark (Benchmarks: Revision - Support flexible warmup and non-random data initialization in cublas-benchmark #479)

Support error tolerance in micro-benchmark for CuDNN function (#490)

e2e benchmark improvement

Fix torch.dist init issue with multiple models (#495)

CPU benchmark

Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. (#473)

Add HPL Benchmark for HPC Linpack Benchmark. (#482)

SuperBench Improvement

install pipeline

Remove fixed rccl version in rocm5.1.x docker file (#476)

Upgrade networkx version to fix installation compatibility issue (#478)

Pin setuptools version to v65.7.0 (#483)

Limit ansible_runner version for Python3.6 (#485)

monitor

Support cgroup V2 when read system metrics in Monitor

multi-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
NDv5 SXM	2* 8 * H100	PyTorch 1.x	CUDA11.8

Hopper GPU and FP8 related benchmarks

microbenchmark

Add distributed inference benchmark (#493)

V0.7.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: Jan. 3rd, 2023
Bug Bash date: Jan 13th, 2023
Release date: Jan 20th, 2023

Main Features

SuperBench Improvement

- Support non-zero return code when “sb deploy” or “sb run” fails in Ansible (Related to #410 and #411) (#425)
- Support log flushing to the result file during runtime (Related to #390) (#445)
- Update version to include revision hash and date (#427)
- Support 'pattern' in 'mpi' mode to run tasks in parallel (#430, #458)
- Support topo-aware, all-pair, and K-batch pattern in 'mpi' mode (#437， #447)
- Fix Transformers version to avoid Tensorrt failure (#441)
- Add CUDA11.8 Docker image for Nvidia arch90 GPUs (#449)
- Support sb deploy without docker pulling (#466)

Micro-benchmark Improvement

- Support list of custom config string in cudnn-functions and cublas-functions (#414)
- Support correctness check in cublas-functions (#450, #452)
- Support GEMM-FLOPS for Nvidia arch90 GPUs (#456)
- Support cuBLASLt FP16 and FP8 GEMM (#451, #455, #460)
- Add wait time option to resolve mem-bw unstable issue (#438)
- Fix bug for incorrect datatype judgement in cublas-function source code. (#462)

Model-benchmark Improvement

- Support FP8 in Bert model training (#446, #461)

Distributed Benchmark Improvement

- Support pair-wise pattern in IB validation benchmark. (#453)
- Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark. (#454)

Backlog

Inference Benchmark Improvement

Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
Support more TensorRT parameters (Related to #366)

Document

Metric Reasoning Doc

[Suggestion] Why tensorrt backend uses trtexec instead of tensorrt python interface?

From my point of view, use python interface we can insert cudaprofilestart() and cudaprofilestop() to better prof our program, because if we use trtexec, superbench will start anothor thread to execute and nvprof can not correctly prof the real command, and, directly profile trtexec will prof the compilation progress and runtime progress, in most of the case, we only need the last one.

tensorrt python interface example:

import tensorrt as trt
import common
import time
import pycuda.driver as cuda
import torch
import os

TRT_LOGGER = trt.Logger()


def inference(context, test_data):
    inputs, outputs, bindings, stream = common.allocate_buffers(context.engine)
    result = []
    inputs[0].host = test_data

    _, elapsed_time = common.do_inference_v2(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)

    return result, elapsed_time

# This function builds an engine from a Onnx model.
def build_engine(model_file, batch_size=32):
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(common.EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser, builder.create_builder_config() as trt_config:

        # Attention that, builder should be set to 1 because of the implementation of allocate_buffer
        builder.max_batch_size = 1
        # builder.max_workspace_size = common.GiB(1)
        trt_config.max_workspace_size = common.GiB(4)

        
        # Parse onnx model
        with open(model_file, 'rb') as model:
            if not parser.parse(model.read()):
                print ('ERROR: Failed to parse the ONNX file.')
                for error in range(parser.num_errors):
                    print (parser.get_error(error))
                return None


        # This design may not be correct if output more than one
        """
        for i in range(network.num_layers):
            layer = network.get_layer(i)
            layer.precision = trt.int8
            layer.set_output_type(0, trt.int8)
        """


        # network.mark_output(model_tensors.find(ModelData.OUTPUT_NAME))
        # Build engine and do int8 calibration.
        # engine = builder.build_cuda_engine(network)
        engine = builder.build_engine(network, trt_config)
        return engine

onnx_path = "/workspace/v-leiwang3/benchmark/nnfusion_models/resnet50.float32.1.onnx"
dummy_input = torch.rand(1, 3, 224, 224).numpy()

engine = build_engine(onnx_path)
context = engine.create_execution_context()

# warmup
for i in range(5):
    _, time = inference(context, dummy_input)

# iteration
time_set = []
for i in range(100):
    _, time = inference(context, dummy_input)
    time_set.append(time)

print(f'average time: {sum(time_set)/len(time_set)* 1000} ms')

superbench mpi job should use the proper Ethernet interface

Several superbench tests (e.g. nccltests, ib-traffic ) use openmpi to launch the tests in multiple nodes. Some node type is designed to have multiple ethernet interfaces (e.g. azure2, eth0, docker0, ib0, ib1 etc). The working IPv4 ethernet interface is not the default 'eth0' (e.g. azure2).

While a user can manually check its node type and figure out which ethernet type to use for the MPI (e.g. --mca btl_tcp_if_include azure2 --mca oob_tcp_if_include azure2), it is not generic across diff node types.

Expected: because superbench launches the MPI command, superbench should detect the proper ethernet interface to use. And add it to the openmpi command line.

The following is one way to find this interface using bash. It will be much simpler to use python to do this.

get_eth_interfaces() {
IPV4List=$(ip -4 -f inet a |grep mtu|awk '{print $2}' | sed ':a; N; $!ba; s/\n//g')
for ifname in $(ls /sys/class/net); do
if [[ -f /sys/class/net/$ifname/type && $(cat /sys/class/net/$ifname/type) -eq 1 && ! -f /sys/class/net/$ifname/bridge ]]; then
isIPV4=$(echo ${IPV4List} | grep "$ifname:" | wc -l)
isDocker=$(echo $ifname | grep docker | wc -l)
if [[ "${isIPV4}" == "1" && "${isDocker}" == "0" ]]; then
echo $ifname
fi
fi
done
}

[Bug Report] ONNX export failed on adaptive_avg_pool2d at tensorrt micro bench.

I am currently working on the superbench/superbench:v0.4.0-cuda11.1.1 docker workspace to measure benchmark.

To get different model's benchmark with tensorrt, I customized the superbenchmark/examples/benchmarks/tensorrt_inference_performance.py like below

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

"""Micro benchmark example for TensorRT inference performance.

Commands to run:
    python3 examples/benchmarks/tensorrt_inference_performance.py
"""
import sys
from statistics import mode
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.common.utils import logger

if __name__ == '__main__':
    batch = int(sys.argv[1])
    model = sys.argv[2]
    precision = sys.argv[3]
    parameters = '--batch_size {0} --pytorch_models {1} --precision {2} --seq_length 8 --iterations 105'.format(batch, model, precision)

    context = BenchmarkRegistry.create_benchmark_context('tensorrt-inference', platform=Platform.CUDA, parameters=parameters)
    benchmark = BenchmarkRegistry.launch_benchmark(context)
    if benchmark:
        logger.info(
            'benchmark: {}, return code: {}, result: {}'.format(
                benchmark.name, benchmark.return_code, benchmark.result
            )
        )

execution:

nvprof --log-file benches/TensorRT/vgg11/fp32_batch_1_prof.txt /opt/conda/bin/python /opt/superbench/examples/benchmarks/tensorrt_inference_performance.py 1 vgg11 fp32 | tee benches/TensorRT/vgg11/fp32_batch_1_time.txt

log :

root@616b67a69ab7:/opt/superbench# nvprof --log-file benches/TensorRT/vgg11/fp32_batch_1_prof.txt /opt/conda/bin/python /opt/superbench/examples/benchmarks/tensorrt_inference_performance.py 1 vgg11 fp32 | tee benches/TensorRT/vgg11/fp32_batch_1_time.txt
/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py:256: UserWarning: `add_node_names' can be set to True only when 'operator_export_type' is `ONNX`. Since 'operator_export_type' is not set to 'ONNX', `add_node_names` argument will be ignored.
warnings.warn("`{}' can be set to True only when 'operator_export_type' is "
/opt/conda/lib/python3.8/site-packages/torch/onnx/utils.py:256: UserWarning: `do_constant_folding' can be set to True only when 'operator_export_type' is `ONNX`. Since 'operator_export_type' is not set to 'ONNX', `do_constant_folding` argument will be ignored.
warnings.warn("`{}' can be set to True only when 'operator_export_type' is "
/opt/conda/lib/python3.8/site-packages/torch/onnx/symbolic_helper.py:182: UserWarning: ONNX export failed on adaptive_avg_pool2d because input size not accessible not supported
warnings.warn("ONNX export failed on " + op + " because " + msg + " not supported")
[2022-05-06 12:33:25,995 616b67a69ab7:18330][micro_base.py:167][INFO] Execute command - round: 0, benchmark: tensorrt-inference, command: /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99.
[2022-05-06 12:33:40,844 616b67a69ab7:18330][micro_base.py:176][ERROR] Microbenchmark execution failed - round: 0, benchmark: tensorrt-inference, error message: &&&& RUNNING TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99
[05/06/2022-12:33:26] [I] === Model Options ===
[05/06/2022-12:33:26] [I] Format: ONNX
[05/06/2022-12:33:26] [I] Model: /root/.cache/torch/hub/onnx/vgg11.onnx
[05/06/2022-12:33:26] [I] Output:
[05/06/2022-12:33:26] [I] === Build Options ===
[05/06/2022-12:33:26] [I] Max batch: explicit
[05/06/2022-12:33:26] [I] Workspace: 8192 MiB
[05/06/2022-12:33:26] [I] minTiming: 1
[05/06/2022-12:33:26] [I] avgTiming: 8
[05/06/2022-12:33:26] [I] Precision: FP32
[05/06/2022-12:33:26] [I] Calibration:
[05/06/2022-12:33:26] [I] Refit: Disabled
[05/06/2022-12:33:26] [I] Safe mode: Disabled
[05/06/2022-12:33:26] [I] Save engine:
[05/06/2022-12:33:26] [I] Load engine:
[05/06/2022-12:33:26] [I] Builder Cache: Enabled
[05/06/2022-12:33:26] [I] NVTX verbosity: 0
[05/06/2022-12:33:26] [I] Tactic sources: Using default tactic sources
[05/06/2022-12:33:26] [I] Input(s)s format: fp32:CHW
[05/06/2022-12:33:26] [I] Output(s)s format: fp32:CHW
[05/06/2022-12:33:26] [I] Input build shape: input=1x3x224x224+1x3x224x224+1x3x224x224
[05/06/2022-12:33:26] [I] Input calibration shapes: model
[05/06/2022-12:33:26] [I] === System Options ===
[05/06/2022-12:33:26] [I] Device: 0
[05/06/2022-12:33:26] [I] DLACore:
[05/06/2022-12:33:26] [I] Plugins:
[05/06/2022-12:33:26] [I] === Inference Options ===
[05/06/2022-12:33:26] [I] Batch: Explicit
[05/06/2022-12:33:26] [I] Input inference shape: input=1x3x224x224
[05/06/2022-12:33:26] [I] Iterations: 105
[05/06/2022-12:33:26] [I] Duration: 3s (+ 200ms warm up)
[05/06/2022-12:33:26] [I] Sleep time: 0ms
[05/06/2022-12:33:26] [I] Streams: 1
[05/06/2022-12:33:26] [I] ExposeDMA: Disabled
[05/06/2022-12:33:26] [I] Data transfers: Enabled
[05/06/2022-12:33:26] [I] Spin-wait: Disabled
[05/06/2022-12:33:26] [I] Multithreading: Disabled
[05/06/2022-12:33:26] [I] CUDA Graph: Disabled
[05/06/2022-12:33:26] [I] Separate profiling: Disabled
[05/06/2022-12:33:26] [I] Skip inference: Disabled
[05/06/2022-12:33:26] [I] Inputs:
[05/06/2022-12:33:26] [I] === Reporting Options ===
[05/06/2022-12:33:26] [I] Verbose: Disabled
[05/06/2022-12:33:26] [I] Averages: 10 inferences
[05/06/2022-12:33:26] [I] Percentile: 99
[05/06/2022-12:33:26] [I] Dump refittable layers:Disabled
[05/06/2022-12:33:26] [I] Dump output: Disabled
[05/06/2022-12:33:26] [I] Profile: Disabled
[05/06/2022-12:33:26] [I] Export timing to JSON file:
[05/06/2022-12:33:26] [I] Export output to JSON file:
[05/06/2022-12:33:26] [I] Export profile to JSON file:
[05/06/2022-12:33:26] [I]
[05/06/2022-12:33:26] [I] === Device Information ===
[05/06/2022-12:33:26] [I] Selected Device: NVIDIA Tesla V100-PCIE-16GB
[05/06/2022-12:33:26] [I] Compute Capability: 7.0
[05/06/2022-12:33:26] [I] SMs: 80
[05/06/2022-12:33:26] [I] Compute Clock Rate: 1.38 GHz
[05/06/2022-12:33:26] [I] Device Global Memory: 16160 MiB
[05/06/2022-12:33:26] [I] Shared Memory per SM: 96 KiB
[05/06/2022-12:33:26] [I] Memory Bus Width: 4096 bits (ECC enabled)
[05/06/2022-12:33:26] [I] Memory Clock Rate: 0.877 GHz
[05/06/2022-12:33:26] [I]
----------------------------------------------------------------
Input filename: /root/.cache/torch/hub/onnx/vgg11.onnx
ONNX IR version: 0.0.6
Opset version: 10
Producer name: pytorch
Producer version: 1.8
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[05/06/2022-12:33:40] [W] [TRT] /workspace/TensorRT/parsers/onnx/onnx2trt_utils.cpp:218: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[05/06/2022-12:33:40] [I] [TRT] /workspace/TensorRT/parsers/onnx/ModelImporter.cpp:139: No importer registered for op: adaptive_avg_pool2d. Attempting to import as plugin.
[05/06/2022-12:33:40] [I] [TRT] /workspace/TensorRT/parsers/onnx/builtin_op_importers.cpp:3716: Searching for plugin: adaptive_avg_pool2d, plugin_version: 1, plugin_namespace:
[05/06/2022-12:33:40] [E] [TRT] INVALID_ARGUMENT: getPluginCreator could not find plugin adaptive_avg_pool2d version 1
While parsing node number 22 [adaptive_avg_pool2d]:
ERROR: /workspace/TensorRT/parsers/onnx/builtin_op_importers.cpp:3718 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[05/06/2022-12:33:40] [E] Failed to parse onnx file
[05/06/2022-12:33:40] [E] Parsing model failed
[05/06/2022-12:33:40] [E] Engine creation failed
[05/06/2022-12:33:40] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /opt/tensorrt/bin/trtexec --onnx=/root/.cache/torch/hub/onnx/vgg11.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99
.
[2022-05-06 12:33:40,844 616b67a69ab7:18330][tensorrt_inference_performance.py:23][INFO] benchmark: tensorrt-inference, return code: 32, result: {'return_code': [32]}

It seems that the trt onnx importer can not support the adaptive_avg_pool2d op?

Please cc.

Run failed (Failed to get information on remote file)

What's the issue, what's expected?:
sb deploy -f local.ini

TASK [Copying Context] *********************************************************
fatal: [localhost]: FAILED! => {"msg": "Failed to get information on remote file (/home/edison/sb-workspace/.ssh/config): sudo: a password is required\n"}

PLAY RECAP *********************************************************************
localhost : ok=8 changed=2 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
[2022-11-21 17:31:49,645 u22:6029][ansible.py:80][WARNING] Run failed, return code 2.

log message:

[2022-11-21 18:27:12,121 u22:6465][file_handler.py:79][INFO] No benchmark config provided, using config file /home/edison/.local/lib/python3.10/site-packages/superbench/config/default.yaml.
[2022-11-21 18:27:12,156 u22:6465][ansible.py:59][INFO] {'host_pattern': 'all', 'cmdline': '--forks 1 --inventory /home/edison/Downloads/superbenchmark/local.ini'}
[2022-11-21 18:27:12,163 u22:6465][runner.py:42][INFO] Runner uses config: {'superbench': {'benchmarks': {'bert_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['bert-base',
'bert-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'computation-communication-overlap': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'cpu-memory-bw-latency': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'tests': ['bandwidth_matrix',
'latency_matrix',
'max_bandwidth']}},
'cublas-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'cudnn-function': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'densenet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['densenet169',
'densenet201'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'disk-benchmark': {'enable': False,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'block_devices': ['/dev/nvme0n1']}},
'gemm-flops': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'gpcnet-network-load-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpcnet-network-test': {'enable': False,
'modes': [{'env': {'UCX_NET_DEVICES': 'mlx5_0:1'},
'mca': {'btl': '^uct',
'btl_tcp_if_include': 'eth0',
'pml': 'ucx'},
'name': 'mpi',
'proc_num': 1}]},
'gpt_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['gpt2-small',
'gpt2-large'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'gpu-burn': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'doubles': True,
'tensor_core': True,
'time': 300}},
'gpu-copy-bw:correctness': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'check_data': True,
'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod'],
'num_loops': 1,
'num_warm_up': 0,
'size': 4096}},
'gpu-copy-bw:perf': {'enable': True,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'copy_type': ['sm',
'dma'],
'mem_type': ['htod',
'dtoh',
'dtod']}},
'ib-loopback': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=0,2,4,6 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4},
{'name': 'local',
'parallel': True,
'prefix': 'PROC_RANK={proc_rank} '
'IB_DEVICES=1,3,5,7 '
'NUMA_NODES=1,0,3,2',
'proc_num': 4}]},
'ib-traffic': {'enable': False,
'modes': [{'name': 'mpi',
'proc_num': 8}],
'parameters': {'gpu_dev': '$LOCAL_RANK',
'ib_dev': 'mlx5_$LOCAL_RANK',
'msg_size': 8388608,
'numa_dev': '$((LOCAL_RANK/2))'}},
'kernel-launch': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'lstm_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['lstm'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'mem-bw': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank} '
'numactl -N '
'$(({proc_rank}/2))',
'proc_num': 8}]},
'nccl-bw:default': {'enable': True,
'modes': [{'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'nccl-bw:gdr-only': {'enable': True,
'modes': [{'env': {'NCCL_IB_DISABLE': '0',
'NCCL_IB_PCI_RELAXED_ORDERING': '1',
'NCCL_MIN_NCHANNELS': '16',
'NCCL_NET_GDR_LEVEL': '5',
'NCCL_P2P_DISABLE': '1',
'NCCL_SHM_DISABLE': '1'},
'name': 'local',
'parallel': False,
'proc_num': 1}],
'parameters': {'ngpus': 8}},
'ort-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1}},
'resnet_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['resnet50',
'resnet101',
'resnet152'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}},
'sharding-matmul': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]},
'tcp-connectivity': {'enable': False,
'modes': [{'name': 'local',
'parallel': False}],
'parameters': {'port': 22}},
'tensorrt-inference': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}],
'parameters': {'batch_size': 1,
'precision': 'int8',
'pytorch_models': ['resnet50',
'resnet101',
'resnet152',
'densenet169',
'densenet201',
'bert-base',
'bert-large'],
'seq_length': 224}},
'vgg_models': {'enable': True,
'frameworks': ['pytorch'],
'models': ['vgg11',
'vgg13',
'vgg16',
'vgg19'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}],
'parameters': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']}}},
'enable': None,
'monitor': {'enable': True,
'sample_duration': 1,
'sample_interval': 10},
'var': {'common_model_config': {'batch_size': 1,
'duration': 0,
'model_action': ['train'],
'num_steps': 128,
'num_warmup': 16,
'precision': ['float32',
'float16']},
'default_local_mode': {'enable': True,
'modes': [{'name': 'local',
'parallel': True,
'prefix': 'CUDA_VISIBLE_DEVICES={proc_rank}',
'proc_num': 8}]},
'default_pytorch_mode': {'enable': True,
'frameworks': ['pytorch'],
'modes': [{'name': 'torch.distributed',
'node_num': 1,
'proc_num': 8}]}}},
'version': 'v0.6'}.
[2022-11-21 18:27:12,163 u22:6465][runner.py:43][INFO] Runner writes to: /home/edison/Downloads/superbenchmark/outputs/2022-11-21_18-27-12.
[2022-11-21 18:27:12,179 u22:6465][runner.py:48][INFO] Runner will run: ['gpu-burn', 'nccl-bw:default', 'nccl-bw:gdr-only', 'ib-loopback', 'mem-bw', 'gpu-copy-bw:correctness', 'gpu-copy-bw:perf', 'kernel-launch', 'gemm-flops', 'cudnn-function', 'cublas-function', 'matmul', 'sharding-matmul', 'computation-communication-overlap', 'ort-inference', 'tensorrt-inference', 'gpt_models', 'bert_models', 'lstm_models', 'resnet_models', 'densenet_models', 'vgg_models']
[2022-11-21 18:27:12,179 u22:6465][runner.py:165][INFO] Preparing SuperBench environment.
[2022-11-21 18:27:12,179 u22:6465][ansible.py:125][INFO] Run playbook deploy.yaml ...

PLAY [Facts Gathering] *********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

PLAY [Context Preparation] *****************************************************

TASK [Generating SSH Config] ***************************************************
changed: [localhost]

TASK [Generating SSH Key Pair] *************************************************
changed: [localhost]

PLAY [Check GPU Environment] ***************************************************

TASK [Checking NVIDIA GPU Environment] *****************************************
ok: [localhost] => (item=/dev/nvidiactl)
ok: [localhost] => (item=/dev/nvidia-uvm)

TASK [Checking AMD GPU Environment] ********************************************
ok: [localhost] => (item=/dev/kfd)
ok: [localhost] => (item=/dev/dri)

TASK [Set GPU Facts] ***********************************************************
ok: [localhost]

TASK [Print GPU Checking Result] ***********************************************
ok: [localhost] => {
"msg": [
"NVIDIA GPU detected",
"AMD GPU not operational, pls confirm amdgpu kernel module is loaded"
]
}

PLAY [Remote Deployment] *******************************************************

TASK [Creating Workspace] ******************************************************
ok: [localhost] => (item=/home/edison/sb-workspace)
ok: [localhost] => (item=/home/edison/sb-workspace/.ssh)

TASK [Copying Context] *********************************************************
fatal: [localhost]: FAILED! => {"msg": "Failed to get information on remote file (/home/edison/sb-workspace/.ssh/config): sudo: a password is required\n"}

PLAY RECAP *********************************************************************
localhost : ok=8 changed=2 unreachable=0 failed=1 skipped=0 rescued=0 ignored=0
[2022-11-21 18:27:14,383 u22:6465][ansible.py:80][WARNING] Run failed, return code 2.

How to reproduce it?:
sudo apt-get install sshpass
git clone -b v0.6.0 https://github.com/microsoft/superbenchmark
cd superbenchmark
python3 -m pip install .
make postinstall

create local.ini
[all]
localhost ansible_connection=local

sb deploy -f local.ini

Log message or shapshot?:

Additional information:
sb --version

0.6.0
Python (Linux) 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0]
Python location '/usr/bin/python3'

sshpass -V

sshpass 1.09

OS: Ubuntu 22.04
NVIDIA Driver: 525.53

TensorRT parameter passing can be enhanced

trtexec has a lot of arguments, but superbench only covers a fraction of them, and the default value trtexec set is not suitable for benchmarking our program, for example, when I use superbench to profile a resnet50.onnx with tensorrt backend, the command that superbench generated is :

/opt/tensorrt/bin/trtexec --onnx=/workspace/v-leiwang3/.torch/hub/onnx/resnet50.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --iterations=105 --percentile=99.

How ever I found this command executed more than 200 executions on our V100 GPU, it was caused by the default arguments --duration was set to 3, which means trtexec will profile the model at least 3s, but for 100 iterations on resnet50, it only takes about 1.5 second, so the default value of --duration should be set to 0 to srtictly execute with given iterations.

And, for warmup step, trtexec also provides --warmUp options to set warmup step, so my expected command should be :

/opt/tensorrt/bin/trtexec --onnx=/workspace/v-leiwang3/.torch/hub/onnx/resnet50.onnx --explicitBatch --optShapes=input:1x3x224x224 --workspace=8192 --fp16 --avgRuns=10 --warmUp=5 --iterations=100 --percentile=99. --duration=0

V0.9.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: July 5th, 2023
Bug Bash date: July 8th, 2023
Release date: July 19th, 2023

Main Features

SuperBench Improvement

- Support Ctrl+C and interrupt to stop all SuperBench testing. (#530)
- Support CPU docker (#480)
- Support Windows Docker for VDI/Gaming GPU (#534)
- Support DirectX for Nvidia and AMD GPU (#536)
- Add System Config Info feature in SB runner. (#532)
- Support DirectX test pipeline (#545)

Micro-benchmark Improvement

- Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth (#486 and #546)
- Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS (#488 and #542)
- Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth (#487 and #547)
- Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency (#543 and #548)
- Support best algorithm selection in cudnn-function. (Related to #384) (#540)
- Revise step time collection in distributed inference benchmark (#524)

Model Benchmark Improvement

- Fix early stop logic due to num_steps. (#522)
- Support TensorRT models on Nvidia H100 (#541)

Documentation

- Document Improvement (#528, #529)
- Improve documentation for System Config Info. (#532)
- Update outdate references (#539)
- Update outdate references in micro-benchmarks.md (#544)

Backlog

Micro-benchmark Improvement

Add HPL random generator to gemm-flops with ROCm (Related to #518)
Support Monitoring for AMD GPUs
Support cuDNN Backend API in cudnn-function.
Add DirectXGPURenderFPS Benchmark to measure the FPS of rendering simple frames

Inference Benchmark Improvement

Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
Support more TensorRT parameters (Related to #366)

VGG models failed on A100 GPU with batch_size=128

What's the issue, what's expected?:

VGG models failed on A100 GPU with batch_size=128, report NCCL error or OS crash. All the commands are runned on python 3.8 venv.

How to reproduce it?:

copy default.yaml to current working directory (root directory of superbenchmark repo) and named "config.yaml"
change parameter in config.yaml, set vgg_models' batch_size=128
set "enable: vgg_models" to run vgg models only
run "sb run -f ./host.ini -c config.yaml"
process failed with NCCL error

Log message or shapshot?:

Additional information:

NVIDIA Driver version: 460.39
Python version: 3.8
PyTorch version: 1.8

V0.6.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: August 22nd
Bug Bash date: August 22nd
Release date: September 4th

Main Features

SuperBench Improvement

- Support running on host directly without Docker (#358, #362)
- Support running sb command inside docker image (#356)
- Support ROCm 5.1.1 (#353, #354)
- Support ROCm 5.1.3 (#361)
- Fix bugs in data diagnosis (#355)
- Fix cmake and build issues (#360)
- Support automatic configuration yaml selection on Azure VM (#365)
- Refine error message when GPU is not detected. (#368)
- Add return code for Timeout (#383)
- Update Dockerfile for NCCL/RCCL version, tag name, and verbose output. (#371)
- Support node_num=1 in mpi mode (#372)
- Update Python setup for require packages (#387)
- Enhance parameter parsing to allow spaces in value (#397)
- Support NO_COLOR for SuperBench output (#404)

Micro-benchmark Improvement

- Fix issues in ib loopback benchmark (#369)
- Fix stability issue in ib loopback benchmark (#386)

Distributed Benchmark Improvement

- Pair-wise IB benchmark (#363)
- Bug Fix in IB benchmark (#370, #375, #377, #396)
- Topology-aware IB benchmark (#373, #381)

Data Diagnosis & Analysis

- Add failure check function in data_diagnosis.py (#378)
- Support Json and Jsonl in Diagnosis. (#388)
- Add support to store values of metrics in data diagnosis. (#392, #399)
- Support exit code of sb result diagnosis (#403)
- Format int type and unify empty value to N/A in diagnosis output files (#406)

Backlog

Inference Benchmark Improvement

Support VGG, LSTM, and GPT-2 small in TensorRT Inference Backend
Support VGG, LSTM, and GPT-2 small in ORT Inference Backend
Support more TensorRT parameters (Related to #366)

Data Diagnosis & Analysis

Support boxplot and outlier analysis

Document

Metric Reasoning Doc

V0.4.0 Test Plan

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Succeeded
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done
ND A100 v4	2 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Succeeded
HPE	1 * 8 * MI100	PyTorch 1.7	ROCm 4.2	Succeeded
HPE	2 * 8 * MI100	PyTorch 1.7	ROCm 4.2	Succeeded
NC64as_T4_v3	4 * T4	PyTorch 1.8	CUDA 11.1	Done

Test Cases

Benchmarks Test

Model Benchmarks

ORT model on AMD

Support FP32 mode without TF32

Micro Benchmarks

GPU Memory Validation

GPU Copy Bandwidth

ORT inference on Nvidia

TensorRT inference on Nvidia

Micro Benchmarks(distributed)

IB validation

TCP validation

GPCNet validation

NCCL/RCCL

Tools

- Monitor
- Data diagnosis

Other Features

- Docker Images Check

sb should return non-zero exit code when executor.py failed

What's the issue, what's expected?:
This is using v0.6.0 release. The benchmark gemm-flops is run on a platform where the GPU is probably not supported (Tesla K80). The superbench has internal error like "Executor failed in gemm-flops, invalid context.". However, at the end, it returns exit code 0.

Expected: it should return non-zero exit code for this type of errors.

How to reproduce it?:
On a VM with Tesla K80 GPU (or CPU), run gemm-flops benchmark.

Log message or shapshot?:
[2022-09-09 20:38:20,578 N000000:38919][runner.py:392][INFO] Runner is going to run gemm-flops in local mode, proc rank 1.
[2022-09-09 20:38:20,580 N000000:38919][ansible.py:107][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'PROC_RANK=1 CUDA_VISIBLE_DEVICES=1 timeout 1200 sb exec --output-dir outputs/2022-09-09_20-38-14 -c sb.config.yaml -C superbench.enable=gemm-flops' on remote ...
[2022-09-09 20:38:20,580 N000000:38919][ansible.py:72][INFO] Run as sudo ...

localhost | CHANGED | rc=0 >>
[2022-09-09 20:38:22,577 N000000:246][executor.py:235][INFO] Executor is going to execute gemm-flops.
[2022-09-09 20:38:23,363 N000000:246][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU
[2022-09-09 20:38:23,364 N000000:246][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.

localhost | CHANGED | rc=0 >>
[2022-09-09 20:38:22,702 N000000:260][executor.py:235][INFO] Executor is going to execute gemm-flops.
[2022-09-09 20:38:23,479 N000000:260][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU
[2022-09-09 20:38:23,479 N000000:260][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.
[2022-09-09 20:38:23,731 N000000:38918][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:23,860 N000000:38919][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:23,862 N000000:38433][ansible.py:125][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Synchronize Output Directory] ********************************************
changed: [localhost]

PLAY RECAP *********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
[2022-09-09 20:38:26,514 N000000:38433][ansible.py:78][INFO] Run succeed, return code 0.
[2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank0/results.json
[2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank1/results.json
2022-09-09 20:38:26.746808: Command exit code: 0
Finished all. errors=0, runtime=13.1 s

Additional information:

Any plan to support Nvidia Jetson XX family?

What would you like to be added:

Support for nvidia Jetson devices

Why is this needed:

ML model execution of edge devices (nvidia jetson)

third_party build error when using CUDA Dockerfile

When I try to build dockerfile using dockerfile/cuda11.1.1.dockerfile, I get the following error:

~/superbenchmark main !1 ?2 ❯ docker buildx build \             
  --platform linux/amd64 --cache-to type=inline,mode=max \
  --tag superbench-dev --file dockerfile/cuda11.1.1.dockerfile .
[+] Building 172.9s (8/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s
 => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                                                   0.0s
 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.1s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s                                                                                               => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s                                                                                               => [internal] load build context                                                                                                                                                                   0.6s                                                                                               => => transferring context: 788.47kB                                                                                                                                                               0.5s                                                                                               => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.2s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 173.4s (8/18)                                                                                                                                                                                 automake     build-essential     curl     dmidecode     git     jq     libaio-dev     lib   => [internal] load build definition from cuda11.1.1.dockerfile                                                                                                                                     0.0s0.8.tgz -O docker.tgz &&     tar --extract --file docker.tgz --strip-components 1 --director   => => transferring dockerfile: 4.00kB                                                                                                                                                              0.0sshd &&     sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/s   => [internal] load .dockerignore                                                                                                                                                                   0.0sD_LINUX-5.2-2.2.3.0-ubuntu20.04-x86_64.tgz &&     tar xzf MLNX_OFED_LINUX-5.2-2.2.3.0-ubun  16 => => transferring context: 35B                                                                                                                                                                    0.0s
 => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                                                                                                                   1.4s
 => [internal] load build context                                                                                                                                                                   0.6s
 => => transferring context: 788.47kB                                                                                                                                                               0.5s
 => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b853bb45499f8d                                                                           [+] Building 183.3s (8/18)  [+] Buil[+] Building 665.5s (16/18)
 => [internal] load build definition from cuda11.1.1.dockerfile                                                0.0s  => => transferring dockerfile: 4.00kB                                                                         0.0st => [internal] load .dockerignore                                                                              0.0s
 => => transferring context: 35B                                                                               0.0s  => [internal] load metadata for nvcr.io/nvidia/pytorch:20.12-py3                                              1.4st => [internal] load build context                                                                              0.6s
 => => transferring context: 788.47kB                                                                          0.5s  => [ 1/14] FROM nvcr.io/nvidia/pytorch:20.12-py3@sha256:cc14c0cf580989bb1ff39fa78ca697b77a8860b17acead4a60b8  0.0sH => CACHED [ 2/14] RUN apt-get update &&     apt-get install -y --no-install-recommends     autoconf     auto  0.0s
 => [ 3/14] RUN cd /tmp &&     wget https://download.docker.com/linux/static/stable/x86_64/docker-20.10.8.tgz  9.5s  => [ 4/14] RUN mkdir -p /root/.ssh &&     touch /root/.ssh/authorized_keys &&     mkdir -p /var/run/sshd &&   0.6s/ => [ 5/14] RUN cd /tmp &&     wget -q http://content.mellanox.com/ofed/MLNX_OFED-5.2-2.2.3.0/MLNX_OFED_LIN  277.4s
 => [ 6/14] RUN cd /opt &&     wget -q https://azhpcstor.blob.core.windows.net/azhpc-images-store/hpcx-v2.8.  62.9s
 => [ 7/14] RUN cd /tmp &&     git clone https://github.com/Mellanox/nccl-rdma-sharp-plugins.git &&     cd n  22.1s
 => [ 8/14] RUN cd /tmp &&     git clone -b v2.10.3-1 https://github.com/NVIDIA/nccl.git &&     cd nccl &&   264.6s
 => [ 9/14] RUN cd /tmp &&     mkdir -p mlc &&     cd mlc &&     wget --user-agent="Mozilla/5.0 (X11; Fedora;  0.8s
 => [10/14] WORKDIR /opt/superbench                                                                            0.1s
 => [11/14] ADD third_party third_party                                                                        0.1s
 => ERROR [12/14] RUN make -j 40 -C third_party cuda                                                          25.8s
------
 > [12/14] RUN make -j 40 -C third_party cuda:
#0 0.415 make: Entering directory '/opt/superbench/third_party'
#0 0.415 mkdir -p /opt/superbench/bin
#0 0.418 mkdir -p /opt/superbench/lib
#0 0.445 if [ -d cuda-samples ]; then rm -rf cuda-samples; fi
#0 0.445 bash -c "source /opt/hpcx/hpcx-init.sh && hpcx_load && make CC=mpicc -C GPCNET all && hpcx_unload"
#0 0.465 git clone -b v11.1 https://github.com/NVIDIA/cuda-samples.git ./cuda-samples
#0 0.468 Cloning into './cuda-samples'...
#0 0.493 make[1]: Entering directory '/opt/superbench/third_party'
#0 0.493 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 0.495 make[1]: Leaving directory '/opt/superbench/third_party/GPCNET'
#0 0.495 make[1]: *** No rule to make target 'all'.  Stop.
#0 0.496 make: *** [Makefile:98: gpcnet] Error 2
#0 0.496 make: *** Waiting for unfinished jobs....
#0 20.08 Note: switching to 'c4e2869a2becb4b6d9ce5f64914406bf5e239662'.
#0 20.08
#0 20.08 You are in 'detached HEAD' state. You can look around, make experimental
#0 20.08 changes and commit them, and you can discard any commits you make in this
#0 20.08 state without impacting any branches by switching back to a branch.
#0 20.08
#0 20.08 If you want to create a new branch to retain commits you create, you may
#0 20.08 do so (now or later) by using -c with the switch command. Example:
#0 20.08
#0 20.08   git switch -c <new-branch-name>
#0 20.08
#0 20.08 Or undo this operation with:
#0 20.08
#0 20.08   git switch -
#0 20.08
#0 20.08 Turn off this advice by setting config variable advice.detachedHead to false
#0 20.08
#0 20.56 cd ./cuda-samples/Samples/bandwidthTest && make clean && make TARGET_ARCH=x86_64 SMS="70 75 80 86"
#0 20.59 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.59 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.61 rm -f bandwidthTest bandwidthTest.o
#0 20.62 rm -rf ../../bin/x86_64/linux/release/bandwidthTest
#0 20.62 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.62 make[1]: warning: jobserver unavailable: using -j1.  Add '+' to parent make rule.
#0 20.62 make[1]: Entering directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 20.65 /usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest.o -c bandwidthTest.cu
#0 25.31 /usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o bandwidthTest bandwidthTest.o
#0 25.62 mkdir -p ../../bin/x86_64/linux/release
#0 25.62 cp bandwidthTest ../../bin/x86_64/linux/release
#0 25.63 make[1]: Leaving directory '/opt/superbench/third_party/cuda-samples/Samples/bandwidthTest'
#0 25.63 cp -v ./cuda-samples/Samples/bandwidthTest/bandwidthTest /opt/superbench/bin/
#0 25.63 './cuda-samples/Samples/bandwidthTest/bandwidthTest' -> '/opt/superbench/bin/bandwidthTest'
#0 25.63 make: Leaving directory '/opt/superbench/third_party'
------
error: failed to solve: executor failed running [/bin/sh -c make -j ${NUM_MAKE_JOBS} -C third_party cuda]: exit code: 2

I cloned the recent main branch and the commit UUID is a9634ef

The problem is in step 12 of the docker build.

Please help. Thanks.

Passing multiple test configurations to cublas microbenchmark

What would you like to be added:
Support multiple configurations as a list in the YAML config file for cublas testing

Why is this needed:
Can create a sweep of custom tests to run

Without this feature, how does current superbenchmark work：
Use the default list of test configs or pass only 1 test config

Components that may involve changes:
The logic in https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/cublas_function.py#L248
that can handle a list of different dictionaries

Brief description of your proposal if any:

V0.6.0 Test Plan

Test Cases

single-node test

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	1 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Done
NDm A100 v4	1 * 8 * A100 80GB SXM	PyTorch 1.8	CUDA 11.1	Done
Hayabusa	1* 16 * MI200	PyTorch 1.9	ROCm 5.1	Done

single-node Micro-benchmark Test

ib-loopback

Fix issues in ib loopback benchmark (#369)

Fix stability issue in ib loopback benchmark (#386)

fix port conflict in ib loopback (#375)

Rccl-test/nccl-test

Update Dockerfile for NCCL/RCCL version, tag name, and verbose output (#371)

Support node_num=1 in mpi mode(#372)

SuperBench Improvement

Support running on host directly without Docker(#356, #358, #362)

Support automatic configuration yaml selection on Azure VM

Add return code for Timeout(#383,#385)

Support ROCm 5.1.1 (#353, #354), Support ROCm 5.1.3 (#361)

Tools

data diagnosis

Fix bugs in data diagnosis (#355)

Add failure check function in data_diagnosis.py (#378)

Support Json and Jsonl in Diagnosis. (#388)

Add support to store values of metrics in data diagnosis. (#392)

New in bug bash

Make baseline file optional in data diagnosis and fix bugs (#399)

Update error handling to support exit code of sb result diagnosis (#403)

Format int type and unify empty value to N/A in diagnosis output files (#406)

Upgrade colorlog for NO_COLOR support (#404)

Enhance timeout cleanup to avoid possible hanging (#405)

multiple-node test

Test Table

Machine Type	#Node * #GPU * GPU Type	PyTorch Version	Accelerated Computing Toolkit	Status
ND A100 v4	32 * 8 * A100 40GB SXM	PyTorch 1.8	CUDA 11.1	Done

distributed Micro-benchmark test

ib-traffic

Support multiple IB/GPU Pair-wise IB benchmark (#363)

Bug Fix in IB benchmark in all-pair mode(#370, #377)

Topology-aware IB benchmark (#373, #381)

New in bug bash

Auto generate ibstat file by pssh (#402)

Enable latency test in ib traffic validation distributed benchmark(#396)

superbench 0.6.0-rc1: setup requires not existing pkg requests>=2.28.1

On a computer with python 3.6.9, 'python -m pip install --upgrade pip' will report error

ERROR: Could not find a version that satisfies the requirement requests>=2.28.1 (from superbench) (from versions: 0.2.0, 0.2.1, 0.2.2, 0.2.3, 0.2.4, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.4.0, 0.4.1, 0.5.0, 0.5.1, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.7.4, 0.7.5, 0.7.6, 0.8.0, 0.8.1, 0.8.2, 0.8.3, 0.8.4, 0.8.5, 0.8.6, 0.8.7, 0.8.8, 0.8.9, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.10.0, 0.10.1, 0.10.2, 0.10.3, 0.10.4, 0.10.6, 0.10.7, 0.10.8, 0.11.1, 0.11.2, 0.12.0, 0.12.1, 0.13.0, 0.13.1, 0.13.2, 0.13.3, 0.13.4, 0.13.5, 0.13.6, 0.13.7, 0.13.8, 0.13.9, 0.14.0, 0.14.1, 0.14.2, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 2.0.0, 2.0.1, 2.1.0, 2.2.0, 2.2.1, 2.3.0, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.5.0, 2.5.1, 2.5.2, 2.5.3, 2.6.0, 2.6.1, 2.6.2, 2.7.0, 2.8.0, 2.8.1, 2.9.0, 2.9.1, 2.9.2, 2.10.0, 2.11.0, 2.11.1, 2.12.0, 2.12.1, 2.12.2, 2.12.3, 2.12.4, 2.12.5, 2.13.0, 2.14.0, 2.14.1, 2.14.2, 2.15.1, 2.16.0, 2.16.1, 2.16.2, 2.16.3, 2.16.4, 2.16.5, 2.17.0, 2.17.1, 2.17.2, 2.17.3, 2.18.0, 2.18.1, 2.18.2, 2.18.3, 2.18.4, 2.19.0, 2.19.1, 2.20.0, 2.20.1, 2.21.0, 2.22.0, 2.23.0, 2.24.0, 2.25.0, 2.25.1, 2.26.0, 2.27.0, 2.27.1)
ERROR: No matching distribution found for requests>=2.28.1

expected: the setup should use lower version of requests. so that lower version of python3 works.

V0.8.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: March 28th, 2023
Bug Bash date: March 29th, 2023
Release date: April 7th, 2023

Main Features

SuperBench Improvement

- Support SuperBench Executor running on Windows (#475)
- Remove fixed rccl version in rocm5.1.x docker file (#476)
- Upgrade networkx version to fix installation compatibility issue (#478)
- Pin setuptools version to v65.7.0 (#483)
- Limit ansible_runner version for Python3.6 (#485)
- Support cgroup V2 when read system metrics in Monitor (#491, #502)
- Fix analyzer bug in python3.8 due to pandas api change (#504)
- Collect real-time GPU power in Monitor (#507)
- Remove unreachable condition when write host list (#512)
- Update to cuda12.1, nccl 2.17.1, hpcx 2.14, and mlc 3.10 (#513)
- Fix wrong unit of cpu-memory-bw-latency in doc (#515)

Micro-benchmark Improvement

- Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate. (#473)
- Add HPL Benchmark for HPC Linpack Benchmark. (#482)
- Support flexible warmup and non-random data initialization in cublas-benchmark (#479)
- Support error tolerance in micro-benchmark for CuDNN function (#490, #506)
- Add distributed inference benchmark (#493 and #505)
- Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm (#492, #494, and #503)

Model Benchmark Improvement

- Fix torch.dist init issue with multiple models (#495)
- Support TE FP8 in BERT/GPT2 models (#496, #499)
- Add num_workers configurable in model benchmark (#511)

Support for multi-NIC concurrent test - RDMA

What would you like to be added:
Support for multi-NIC concurrent test in GPU-GPU RDMA

Why is this needed:
Concurrent NIC RDMA test will be able to test the capability and stability at node level

Without this feature, how does current superbenchmark work：
Current RDMA test is capable of validate NIC#0

Components that may involve changes:
https://github.com/microsoft/superbenchmark/blob/main/superbench/benchmarks/micro_benchmarks/ib_validation_performance/ib_validation_performance.cc

Brief description of your proposal if any:
Add support for user assigned NIC number, add support for concurrent NICs, during RDMA test

V0.4.0 Release Plan

Release Manager

@cp5555

Endgame

Code freeze: Dec. 12th, 2021
Bug Bash date: Dec. 13th, 2021
Release date: Dec. 24th, 2021

Main Features

Microbenchmark

CPU Memory Validation (Tool: Intel Memory Latency Checker) (#126)

Metrics	Unit	Description
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bw	bandwidth (GB/s)	Former NUMA to latter NUMA memory bandwidth.
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lat	time (us)	Former NUMA to latter NUMA memory latency.
cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bw	bandwidth (GB/s)	Whole-CPU maximum memory bandwidth, full read.
cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bw	bandwidth (GB/s)	Whole-CPU maximum memory bandwidth, read : write = 3 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bw	bandwidth (GB/s)	Whole-CPU maximum memory bandwidth, read : write = 2 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bw	bandwidth (GB/s)	Whole-CPU maximum memory bandwidth, read : write = 1 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bw	bandwidth (GB/s)	Whole-CPU maximum memory bandwidth, with stream-triad like pattern.

GPU Copy Bandwidth (Tool: Built by MSRA) (#230)

Metrics	Unit	Description
gpu-copy-bw/cpu_to_gpu[0-9]+_by_gpu[0-9]+_using_(sm\|dma)_under_numa[0-9]+_bw	GB/s	The bandwidth reading from all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs
gpu-copy-bw/\gpu[0-9]+_to_cpu_by_gpu[0-9]+_using_(sm\|dma)_under_numa[0-9]+_bw	GB/s	The bandwidth writing to all NUMA nodes' host memory using DMA engine or GPU SM by all GPUs
gpu-copy-bw/\gpu[0-9]+_to_gpu[0-9]+_by_gpu[0-9]+_using_(sm\|dma)_under_numa[0-9]+_bw	GB/s	The bandwidth reading from or writing to all GPUs using DMA engine or GPU SM by all GPUs with peer communication enabled

Distributed Networking Benchmarks

Support IB Networking Validation （#191)

Metrics	Unit	Description
ib-traffic/${command}${line}${pair}${server}${client}_bw	GB/s	the average bandwidth of ib command(ib_write_bw, ib_send_bw, ib_read_bw) run between the <pair>th node pair in the <line>th line of the config
ib-traffic/${command}${line}${pair}${server}${client}_lat	usec	the max latency of ib command(ib_write_lat, ib_send_lat, ib_read_lat) run between the <pair>th node pair in the <line>th line of the config

Support TCP Validation (Tool: TCPing) (#217)

Metrics	Unit	Description
tcp-connectivity/${hostname/ip}_successed_count		successed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_failed_count		failed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_success_rate		success rate(successed/total) of tcp connection between current node and other nodes
tcp-connectivity/${hostname/ip}_time_min	ms	mininum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_max	ms	maximum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_avg	ms	average latency of tcp connections between current node and other nodes

Support GPCNet Validation (#228 and #229)

Metrics	Unit	Description
gpcnet-network-test/rr_two-sided_lat_${stat}	time (us)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing
gpcnet-network-test/rr_two-sided+sync_bw_${stat}	bandwidth (MiB/s/rank)	fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing
gpcnet-network-test/multiple_allreduce_time_${stat}	time (us)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing
gpcnet-network-test/rr_get_lat_${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing
gpcnet-network-test/rr_two-sided_bw_${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/nat_two-sided_bw_${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/multiple_alltoall_bw_${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing
gpcnet-network-load-test/rr_two-sided_lat_x_${stat}	factor (x)	summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/rr_two-sided+sync_bw_x_${stat}	factor (x)	summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/multiple_allreduce_x_${stat}	factor (x)	summary about congestion impact factor of the network test algorithm

SuperBench Improvement -- @guoshzhao

- Add pipeline for AMD docker (#194)
- Integrate system config info script with SuperBench (#199)
- Support FP32 mode without TF32 (#213)
- Refine the UT for microbenchmark (#268)
- Unify metric names for all benchmarks (#252)

More E2E Models for AMD and Inference -- @lynex

Add ORT Model on AMD GPU platform (#227)
Model: Bert-large, Distilbert-base, GPT-2, facebook/Bart-large and Roberta-large

Metrics	Unit	Description
onnxruntime-ort-models/bert_large_uncased_ngpu_1_throughput	samples/s	The throughput of bert large uncased model on 1 GPU
onnxruntime-ort-models/bert_large_uncased_ngpu_8_throughput	samples/s	The throughput of bert large uncased model on 8 GPU
onnxruntime-ort-models/distilbert_base_uncased_ngpu_1_throughput	samples/s	The throughput of distilbert base uncased model on 1 GPU
onnxruntime-ort-models/distilbert_base_uncased_ngpu_8_throughput	samples/s	The throughput of distilbert base uncased model on 8 GPU
onnxruntime-ort-models/gpt2_ngpu_1_throughput	samples/s	The throughput of gpt2 model on 1 GPU
onnxruntime-ort-models/gpt2_ngpu_8_throughput	samples/s	The throughput of gpt2 model on 8 GPU
onnxruntime-ort-models/facebook_bart_large_ngpu_1_throughput	samples/s	The throughput of facebook bart large model on 1 GPU
onnxruntime-ort-models/facebook_bart_large_ngpu_8_throughput	samples/s	The throughput of facebook bart large model on 8 GPU
onnxruntime-ort-models/roberta_large_ngpu_1_throughput	samples/s	The throughput of roberta large model on 1 GPU
onnxruntime-ort-models/roberta_large_ngpu_8_throughput	samples/s	The throughput of roberta large model on 8 GPU

Add Inference Backend TensorRT (#236, #254)

Name	Unit	Description
tensorrt-inference/${model}_gpu_time_mean	time (ms)	The mean GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_gpu_time_99	time (ms)	The 99th percentile GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_mean	time (ms)	The mean H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_99	time (ms)	The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_end_to_end_time_mean	time (ms)	The mean duration from when the H2D of a query is called to when the D2H of the same query is completed.
tensorrt-inference/${model}_end_to_end_time_99	time (ms)	The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.

- Add Inference Backend ORT for Nvidia (#245)
  
  Name Unit Description
  
  ort-inference/{precision}_{model}_time time (ms) The mean latency to execute one batch of inference.

Name	Unit	Description
ort-inference/{precision}_{model}_time	time (ms)	The mean latency to execute one batch of inference.

Data Diagnosis & Analysis -- @yukirora

- Support baseline-based data diagnosis(#242)
- Support basic analysis feature (boxplot figure, outlier detection, etc.)(#248)

Monitor -- @guoshzhao

- Add monitor framework for CPU, memory, disk, GPU, etc. (#240)
- Integrate monitor with SuperBench (#259)

Document

- Support Benchmark List (#233, #237, #238, #271)
- Monitor Document (#265)
- Data Diagnosis Document(#249)

Backlogs

SuperBench Improvement

- Improve Output Interface
- Auto kill all processes on all nodes
- Add Heart beat to monitor process health