mlcommons / training_results_v0.5 Goto Github PK

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.

Home Page: https://mlcommons.org/en/training-normal-05/

License: Apache License 2.0

Shell 0.63% Python 48.37% HTML 30.36% JavaScript 0.47% Jupyter Notebook 13.99% Dockerfile 0.03% Java 0.07% C++ 1.97% Go 1.67% Makefile 0.01% Scala 0.16% Cuda 2.09% CMake 0.06% C 0.02% CSS 0.05% Lua 0.01% Starlark 0.05%

training_results_v0.5's Introduction

Submissions

MLPerf™ Training v0.5 results

training_results_v0.5's People

Contributors

Stargazers

Watchers

Forkers

cliffwoolley ieaglefalcon cavdard wang12tao hanchen2000 greenfigo2015 snehilverma41 armandmcqueen stanford-futuredata codyaustun renganxu lanchongyizu aaron276h huaweitechnology sheldon-xiong meilingwang1 aakashkardam nalinaly ntkingstar dekhtiarjonathan quinnrong srivera1 chrisyooh stwang57 tw-nchc vnata jthelin jbalma georges-nvidia gychen1991 yangshuoheng anhhai986 myelintek mtsang cactuswang grandbasis ciaranshu dragonfly-zh zhengzhimiao hubbucket-team atooq mathpopo isdanni fenz elcuervo fannierpeng mlperf jiayisunx delock yxz312 trellixvulnteam

training_results_v0.5's Issues

【object_detection】ZeroDivisionError: float division by zero

I have tested in object detection with nvidia code.The difference was I only used two nodes instead of eight, But the following error occurred,
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 105, in train_one_epoch
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
scaled_losses.backward()
File "/opt/conda/lib/python3.6/contextlib.py", line 88, in exit
self._optimizer.param_groups, loss_scale)
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
self._optimizer.param_groups, loss_scale)
1. / scale)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
1. / scale)
ZeroDivisionError: float division by zero
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 319, in main
model = train(cfg, random_number_generator, args.local_rank, args.distributed, args, args.fp16)
File "tools/train_net.py", line 173, in train
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]

My config_DGX1_MULTI.sh is following:

DL params

EXTRA_PARAMS="--min_bbox_map 0.377 --min_mask_map 0.339"
EXTRA_CONFIG=(
"SOLVER.BASE_LR" "0.16"
"SOLVER.MAX_ITER" "40000"
"SOLVER.WARMUP_FACTOR" "0.000256"
"SOLVER.WARMUP_ITERS" "625"
"SOLVER.WARMUP_METHOD" "mlperf_linear"
"SOLVER.STEPS" "(9000, 12000)"
"DATALOADER.IMAGES_PER_BATCH_TRAIN" "2"
"MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN" "2000"
)

System run parms

DGXNNODES=2
DGXSYSTEM=DGX1_multi
WALLTIME=12:00:00

System config params

DGXNGPU=8
DGXSOCKETCORES=14
DGXHT=2 # HT is on is 2, HT off is 1
I suspect that EXTRA CONFIG will be modified. Can someone give me some guidance? Thanks in advance!

RuntimeError: cuda runtime error (10) : invalid device ordinal

Hello,

I am trying to re-run benchmark results/v0.5.0/nvidia/submission/code/object_detection/pytorch/ and getting cuda error. Anyone with hints ? Thanks.

Machine: Intel Xeon with 8 x V100-SXM2
OS: CentOS Linux release 7.5.1804 (Core)
nVidia driver 410.79
Cuda 10.0
nvidia-docker2 2.0.3-1.docker18.06.1.ce

Commands has been run:
cd /bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/pytorch
DATADIR=/bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/detectron/lib/datasets/data/coco LOGDIR=/bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/logs DGXSYSTEM=DGX2 ./run.sub

Log:
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 239, in main
main()
torch.cuda.set_device(args.local_rank)
main()
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
File "tools/train_net.py", line 239, in main
File "tools/train_net.py", line 239, in main
main()
File "tools/train_net.py", line 239, in main
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
torch._C._cuda_setDevice(device)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
main()
File "tools/train_net.py", line 239, in main
main()
File "tools/train_net.py", line 239, in main
torch._C._cuda_setDevice(device)
main()
torch._C._cuda_setDevice(device)
File "tools/train_net.py", line 239, in main
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 239, in main
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34

download_dataset.sh for object detection is outdated (nvidia/submission)

The first tarball url is out of date and no longer working. The new link to be used is: https://dl.fbaipublicfiles.com/detectron/coco/coco_annotations_minival.tgz

Data pre process script is missing

I can not pre process the raw DATA on any of those benchmarks submitted by Nvidia for benchmarking. can you please provide the data preprocessing script?

Nvidia/translation ModuleNotFoundError: No module named 'strided_batched_gemm'

The directory fairseq has some cpp or cu files which cannot be imported.

anyone know the difference between private tf and public tf in intel

[single_stage_detector] Which backbone should we use? Is there any designated resnet34 backbone for tensorflow or we should just use our self trained backbone?

We are trying to run some tests of SSD with tensorflow, while there is no information about which backbone we should use.
Is there any designated resnet34 backbone for tensorflow or we should just use our self trained backbone?

pytorch/maskrcnn:url of extract_dataset.sh is invalid

when I run command as below

wget https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh

The bug is shown as below

wget https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh
--2021-04-20 18:34:51--  https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh
Proxy request sent, awaiting response... 404 Not Found
2021-04-20 18:34:53 ERROR 404: Not Found.

Muti-node training on cloud instances

What would be the best way to run multi-node training on cloud compute instances? Similar to multi-node DGX1/DGX2 training using slurm?

[single_stage_detector]RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109)

When I perform single_stage_detector benchmark, I got failed message as below:

:::MLPv0.5.0 ssd 1548297287.858427048 (/workspace/single_stage_detector/ssd300.py:69) num_defaults_per_cell: [4, 6, 6, 6, 4, 4]
Traceback (most recent call last):
File "train.py", line 710, in
main()
File "train.py", line 703, in main
success = train300_mlperf_coco(args)
File "train.py", line 513, in train300_mlperf_coco
ssd300.module = torch.jit.trace(module_to_jit, example_input)
File "/opt/conda/lib/python3.6/site-packages/torch/jit/init.py", line 565, in trace
module._create_method_from_trace('forward', func, example_inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/ssd300.py", line 184, in forward
layers = self.model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/base_model.py", line 99, in forward
layer1_activation = self.layer1(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 73, in forward
self.eps, self.fuse_relu, self.training, z)
File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 36, in forward
y, save_mean, save_var, reserve = C.bn_fwd_nhwc_cudnn(x, s, b, rm, riv, mom, epsilon, fuse_relu)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109)

Is any dataset or model file lack for this issue?
I had prepared dataset in /data/coco2017/ but not related model file include.
Thanks.

Will Object Detection benchmark run on Tesla P100s? or DGX-1 & DGX-2 only?

I am wondering if anyone has run the object detecton benchmark with pytorch on Tesla P100s.
Here's a link to the code I want to run on P100s, but the config files are for DGX-1 & DGX-2 only. Are there simple config changes that will allow this to run on Tesla P100?
https://github.com/mlperf/results/tree/master/v0.5.0/nvidia/submission/code/object_detection/pytorch