Coder Social home page Coder Social logo

mlcommons / training_results_v0.5 Goto Github PK

View Code? Open in Web Editor NEW
35.0 17.0 54.0 202.81 MB

This repository contains the results and code for the MLPerf™ Training v0.5 benchmark.

Home Page: https://mlcommons.org/en/training-normal-05/

License: Apache License 2.0

Shell 0.63% Python 48.37% HTML 30.36% JavaScript 0.47% Jupyter Notebook 13.99% Dockerfile 0.03% Java 0.07% C++ 1.97% Go 1.67% Makefile 0.01% Scala 0.16% Cuda 2.09% CMake 0.06% C 0.02% CSS 0.05% Lua 0.01% Starlark 0.05%

training_results_v0.5's Introduction

Submissions

MLPerf™ Training v0.5 results

training_results_v0.5's People

Contributors

cliffwoolley avatar guschmue avatar nathanw-mlc avatar nvpaulius avatar petermattson avatar thekanter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

training_results_v0.5's Issues

【object_detection】ZeroDivisionError: float division by zero

I have tested in object detection with nvidia code.The difference was I only used two nodes instead of eight, But the following error occurred,
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 105, in train_one_epoch
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
scaled_losses.backward()
File "/opt/conda/lib/python3.6/contextlib.py", line 88, in exit
self._optimizer.param_groups, loss_scale)
next(self.gen)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/opt.py", line 50, in scale_loss
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
self._optimizer.param_groups, loss_scale)
1. / scale)
File "/opt/conda/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/amp/scaler.py", line 38, in unscale_and_u
pdate
1. / scale)
ZeroDivisionError: float division by zero
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 319, in main
model = train(cfg, random_number_generator, args.local_rank, args.distributed, args, args.fp16)
File "tools/train_net.py", line 173, in train
random_number_generator,
File "/workspace/object_detection/maskrcnn_benchmark/engine/trainer.py", line 190, in do_train
use_distributed, use_amp=arguments["use_amp"]

My config_DGX1_MULTI.sh is following:

DL params

EXTRA_PARAMS="--min_bbox_map 0.377 --min_mask_map 0.339"
EXTRA_CONFIG=(
"SOLVER.BASE_LR" "0.16"
"SOLVER.MAX_ITER" "40000"
"SOLVER.WARMUP_FACTOR" "0.000256"
"SOLVER.WARMUP_ITERS" "625"
"SOLVER.WARMUP_METHOD" "mlperf_linear"
"SOLVER.STEPS" "(9000, 12000)"
"DATALOADER.IMAGES_PER_BATCH_TRAIN" "2"
"MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN" "2000"
)

System run parms

DGXNNODES=2
DGXSYSTEM=DGX1_multi
WALLTIME=12:00:00

System config params

DGXNGPU=8
DGXSOCKETCORES=14
DGXHT=2 # HT is on is 2, HT off is 1
I suspect that EXTRA CONFIG will be modified. Can someone give me some guidance? Thanks in advance!

RuntimeError: cuda runtime error (10) : invalid device ordinal

Hello,

I am trying to re-run benchmark results/v0.5.0/nvidia/submission/code/object_detection/pytorch/ and getting cuda error. Anyone with hints ? Thanks.

Machine: Intel Xeon with 8 x V100-SXM2
OS: CentOS Linux release 7.5.1804 (Core)
nVidia driver 410.79
Cuda 10.0
nvidia-docker2 2.0.3-1.docker18.06.1.ce

Commands has been run:
cd /bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/pytorch
DATADIR=/bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/detectron/lib/datasets/data/coco LOGDIR=/bfs/hpc_cluster/work/mlperf/src/results/v0.5.0/nvidia/submission/code/object_detection/logs DGXSYSTEM=DGX2 ./run.sub

Log:
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
THCudaCheck FAIL file=torch/csrc/cuda/Module.cpp line=34 error=10 : invalid device ordinal
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 239, in main
main()
torch.cuda.set_device(args.local_rank)
main()
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
File "tools/train_net.py", line 239, in main
File "tools/train_net.py", line 239, in main
main()
File "tools/train_net.py", line 239, in main
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
torch._C._cuda_setDevice(device)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
main()
File "tools/train_net.py", line 239, in main
main()
File "tools/train_net.py", line 239, in main
torch._C._cuda_setDevice(device)
main()
torch._C._cuda_setDevice(device)
File "tools/train_net.py", line 239, in main
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34
Traceback (most recent call last):
File "tools/train_net.py", line 328, in
main()
File "tools/train_net.py", line 239, in main
torch.cuda.set_device(args.local_rank)
File "/opt/conda/lib/python3.6/site-packages/torch/cuda/init.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:34

Data pre process script is missing

I can not pre process the raw DATA on any of those benchmarks submitted by Nvidia for benchmarking. can you please provide the data preprocessing script?

pytorch/maskrcnn:url of extract_dataset.sh is invalid

when I run command as below

wget https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh

The bug is shown as below

wget https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh
--2021-04-20 18:34:51--  https://raw.githubusercontent.com/mlperf/training/master/object_detection/caffe2/extract_dataset.sh
Proxy request sent, awaiting response... 404 Not Found
2021-04-20 18:34:53 ERROR 404: Not Found.

[single_stage_detector]RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109)

When I perform single_stage_detector benchmark, I got failed message as below:

:::MLPv0.5.0 ssd 1548297287.858427048 (/workspace/single_stage_detector/ssd300.py:69) num_defaults_per_cell: [4, 6, 6, 6, 4, 4]
Traceback (most recent call last):
File "train.py", line 710, in
main()
File "train.py", line 703, in main
success = train300_mlperf_coco(args)
File "train.py", line 513, in train300_mlperf_coco
ssd300.module = torch.jit.trace(module_to_jit, example_input)
File "/opt/conda/lib/python3.6/site-packages/torch/jit/init.py", line 565, in trace
module._create_method_from_trace('forward', func, example_inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/ssd300.py", line 184, in forward
layers = self.model(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/base_model.py", line 99, in forward
layer1_activation = self.layer1(data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 475, in call
result = self._slow_forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 465, in _slow_forward
result = self.forward(*input, **kwargs)
File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 73, in forward
self.eps, self.fuse_relu, self.training, z)
File "/workspace/single_stage_detector/nhwc/batch_norm.py", line 36, in forward
y, save_mean, save_var, reserve = C.bn_fwd_nhwc_cudnn(x, s, b, rm, riv, mom, epsilon, fuse_relu)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input. (nhwc_bn_fwd_train_cudnn_impl at /tmp/pip-req-build-pq90lolk/csrc/nhwc/batch_norm.cu:109)

Is any dataset or model file lack for this issue?
I had prepared dataset in /data/coco2017/ but not related model file include.
Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.