mlcommons / training_results_v1.0 Goto Github PK

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.

Home Page: https://mlcommons.org/en/training-normal-10/

License: Apache License 2.0

Dockerfile 0.30% Shell 3.36% Python 33.94% CMake 0.36% C++ 17.28% Cuda 13.85% Jupyter Notebook 29.66% Perl 0.02% Starlark 0.18% Makefile 0.05% TypeScript 0.61% HTML 0.12% CSS 0.01% JavaScript 0.11% Awk 0.09% C 0.05% Visual Basic 6.0 0.02%

training_results_v1.0's People

Contributors

Stargazers

Watchers

Forkers

johnbensnyder feitianxiaojuju yh-wu tristonc zha0q1 zlsh80826 dongju-chae sneaxiy yanxing-shi muditver 30004870 shristisingh12 pixelz-inc kuangllbnu shangw-nvidia ltechkorea baodii dblalock johncruyff14 romainpkq anirban-ghosh snlpatel001213 guozhibin2014 eweill-nv itayhubara limin2021 teru3760 bharat3012 superchristinek johnsonms zhnin jqueguiner liangan1 snserhello 1406429350 dgieedelman isyueqian caozhongz nettrixtobin zigzagcai betterdongw kavin525zhang lucqueen upvenly yuzhou03

training_results_v1.0's Issues

Command is missing the path to the cks directory for the model.ckpt-28252.pt

When I run the command

python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt

It works as expected. However, it writes the needed file (model.ckpt-28252.pt) to /workspace/bert. When I exit the container the file is deleted as well. The correct command should be

python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint /cks/model.ckpt-28252.pt

Issue with DGX config file (what will be the required changes in config file)

We are trying to run RNNT benchmark on our DGX station(https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-station/368040-DGX-Station-DS-R11.pdf). Please help to set the right config parameters. Here are our logs after executing "CONT=mlperf-nvidia:rnn_speech_recognition-pytorch DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> METADATA_DIR=<path/to/metadata/dir> SENTENCEPIECES_DIR=<path/to/sentencepieces/dir> bash ./run_with_docker.sh" command:

: baseline_DGXA100_8x8x32x1
: mlperf-nvidia:rnn_speech_recognition-pytorch
: 10
++ date +%y%m%d%H%M%S%N
: 211008164556434440824
: 1
: /raid/speech_processing/pytorch/datasets
: /raid/speech_processing/pytorch/results
: ./api_logs
: 17584
: 01:00:00
readonly _config_file=./config_baseline_DGXA100_8x8x32x1.sh
_config_file=./config_baseline_DGXA100_8x8x32x1.sh
readonly _logfile_base=/raid/speech_processing/pytorch/results/result_211008164556434440824
_logfile_base=/raid/speech_processing/pytorch/results/result_211008164556434440824
readonly _cont_name=rnn_speech_recognition
_cont_name=rnn_speech_recognition
_cont_mounts=("--volume=${DATADIR}:/datasets/" "--volume=${LOGDIR}:/results" "--volume=${METADATA_DIR}:/metadata" "--volume=${SENTENCEPIECES_DIR}:/sentencepieces")
'[' '' -eq 1 ']'
./run_with_docker.sh: line 30: [: : integer expression expected
++ source /etc/os-release
+++ NAME=Ubuntu
+++ VERSION='18.04.5 LTS (Bionic Beaver)'
+++ ID=ubuntu
+++ ID_LIKE=debian
+++ PRETTY_NAME='Ubuntu 18.04.5 LTS'
+++ VERSION_ID=18.04
+++ HOME_URL=https://www.ubuntu.com/
+++ SUPPORT_URL=https://help.ubuntu.com/
+++ BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/
+++ PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
+++ VERSION_CODENAME=bionic
+++ UBUNTU_CODENAME=bionic
++ source /etc/dgx-release
+++ DGX_NAME='DGX Station'
+++ DGX_PRETTY_NAME='NVIDIA DGX Station'
+++ DGX_SWBUILD_DATE=2017-10-31
+++ DGX_SWBUILD_VERSION=3.1.3
+++ DGX_COMMIT_ID=31e745794370d852fdb0a178ef022a872f58efdf
+++ DGX_SERIAL_NUMBER=0154917000004
+++ DGX_OTA_VERSION=3.1.4
+++ DGX_OTA_DATE='Wed Jan 31 15:04:20 IST 2018'
+++ DGX_OTA_VERSION=3.1.7
+++ DGX_OTA_DATE='Tue Nov 27 15:28:38 IST 2018'
+++ DGX_OTA_VERSION=4.0.4
+++ DGX_OTA_DATE='Thu Dec 13 15:09:08 IST 2018'
+++ DGX_OTA_VERSION=4.0.6
+++ DGX_OTA_DATE='Wed Aug 7 19:13:38 IST 2019'
+++ DGX_OTA_VERSION=4.0.7
+++ DGX_OTA_DATE='Mon Sep 14 09:48:51 IST 2020'
++ echo 'Ubuntu 18.04.5 LTS / NVIDIA DGX Station 4.0.7'
MLPERF_HOST_OS='Ubuntu 18.04.5 LTS / NVIDIA DGX Station 4.0.7'
export MLPERF_HOST_OS
mkdir -p /raid/speech_processing/pytorch/results
source ./config_baseline_DGXA100_8x8x32x1.sh
++ export DGXNNODES=8
++ DGXNNODES=8
+++ sed 's/^config_//'
+++ sed 's/.sh$//'
++++ readlink -f ./config_baseline_DGXA100_8x8x32x1.sh
+++ basename /raid/speech_processing/pytorch/config_baseline_DGXA100_8x8x32x1.sh
++ export DGXSYSTEM=baseline_DGXA100_8x8x32x1
++ DGXSYSTEM=baseline_DGXA100_8x8x32x1
++ export DGXNGPU=8
++ DGXNGPU=8
++ export DGXSOCKETCORES=24
++ DGXSOCKETCORES=24
++ export DGXNSOCKET=2
++ DGXNSOCKET=2
++ export DGXHT=2
++ DGXHT=2
++ export GRAD_ACCUMULATION_STEPS=1
++ GRAD_ACCUMULATION_STEPS=1
++ export DATADIR=/raid/datasets/rnnt/LibriSpeech/
++ DATADIR=/raid/datasets/rnnt/LibriSpeech/
++ export BATCHSIZE=32
++ BATCHSIZE=32
++ export EVAL_BATCHSIZE=2
++ EVAL_BATCHSIZE=2
++ export WALLTIME=01:00:00
++ WALLTIME=01:00:00
++ export VAL_FREQUENCY=1
++ VAL_FREQUENCY=1
++ export MAX_SYMBOL=300
++ MAX_SYMBOL=300
++ export EPOCH=90
++ EPOCH=90
++ export SEED=23975
++ SEED=23975
++ export LR=0.007
++ LR=0.007
++ export WEIGHTS_INIT_SCALE=0.5
++ WEIGHTS_INIT_SCALE=0.5
++ export DATA_CPU_THREADS=8
++ DATA_CPU_THREADS=8
mapfile -t _config_env
++ env -i bash -c '. ./config_baseline_DGXA100_8x8x32x1.sh && compgen -e'
++ grep -E -v '^(PWD|SHLVL)'
_config_env+=(MLPERF_HOST_OS)
mapfile -t _config_env
++ for v in "${_config_env[@]}"
++ echo --env=BATCHSIZE
++ for v in "${_config_env[@]}"
++ echo --env=DATADIR
++ for v in "${_config_env[@]}"
++ echo --env=DATA_CPU_THREADS
++ for v in "${_config_env[@]}"
++ echo --env=DGXHT
++ for v in "${_config_env[@]}"
++ echo --env=DGXNGPU
++ for v in "${_config_env[@]}"
++ echo --env=DGXNNODES
++ for v in "${_config_env[@]}"
++ echo --env=DGXNSOCKET
++ for v in "${_config_env[@]}"
++ echo --env=DGXSOCKETCORES
++ for v in "${_config_env[@]}"
++ echo --env=DGXSYSTEM
++ for v in "${_config_env[@]}"
++ echo --env=EPOCH
++ for v in "${_config_env[@]}"
++ echo --env=EVAL_BATCHSIZE
++ for v in "${_config_env[@]}"
++ echo --env=GRAD_ACCUMULATION_STEPS
++ for v in "${_config_env[@]}"
++ echo --env=LR
++ for v in "${_config_env[@]}"
++ echo --env=MAX_SYMBOL
++ for v in "${_config_env[@]}"
++ echo --env=SEED
++ for v in "${_config_env[@]}"
++ echo --env=VAL_FREQUENCY
++ for v in "${_config_env[@]}"
++ echo --env=WALLTIME
++ for v in "${_config_env[@]}"
++ echo --env=WEIGHTS_INIT_SCALE
++ for v in "${_config_env[@]}"
++ echo --env=MLPERF_HOST_OS
docker run --rm --init --detach --net=host --uts=host --ipc=host --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --name=rnn_speech_recognition --volume=/raid/speech_processing/pytorch/datasets:/datasets/ --volume=/raid/speech_processing/pytorch/results:/results --volume=/raid/speech_processing/pytorch/tokenized:/metadata --volume=/raid/speech_processing/pytorch/sentenpieces:/sentencepieces mlperf-nvidia:rnn_speech_recognition-pytorch sleep infinity
0a6482c870f1468836b708d3f340a3dfec7b49ef2b1c352b2c5bf588803cf29c
docker exec -it rnn_speech_recognition true
[[ baseline_DGXA100_8x8x32x1 == \D\G\X\A\1\0\0* ]]
++ seq 1 10
for _experiment_index in $(seq 1 "${NEXP}")
tee /raid/speech_processing/pytorch/results/result_211008164556434440824_1.txt
tee: /raid/speech_processing/pytorch/results/result_211008164556434440824_1.txt: Permission denied
echo 'Beginning trial 1 of 10'
Beginning trial 1 of 10
docker exec -it rnn_speech_recognition python -c ''
'[' 1 -eq 1 ']'
sync
docker exec -it rnn_speech_recognition python -c '
from mlperf import logging
logging.log_event(key=logging.constants.CACHE_CLEAR, value=True)'
:::MLLOG {"namespace": "", "time_ms": 1633691758520, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "", "lineno": 3}}
docker exec -it --env=BATCHSIZE --env=DATADIR --env=DATA_CPU_THREADS --env=DGXHT --env=DGXNGPU --env=DGXNNODES --env=DGXNSOCKET --env=DGXSOCKETCORES --env=DGXSYSTEM --env=EPOCH --env=EVAL_BATCHSIZE --env=GRAD_ACCUMULATION_STEPS --env=LR --env=MAX_SYMBOL --env=SEED --env=VAL_FREQUENCY --env=WALLTIME --env=WEIGHTS_INIT_SCALE --env=MLPERF_HOST_OS rnn_speech_recognition ./run_and_time.sh
./run_and_time.sh: line 24: [: : integer expression expected
STARTING TIMING RUN AT 2021-10-08 11:15:58 AM
running benchmark
python -u -m bind_launch --nsockets_per_node=2 --ncores_per_socket=24 --nproc_per_node=8
./run_and_time.sh: line 140: [: -ne: unary operator expected
libnuma: Warning: cpu argument 48-53 is out of range

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 54-59 is out of range

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
ENDING TIMING RUN AT 2021-10-08 11:16:02 AM
RESULT,RNN_SPEECH_RECOGNITION,,4,nvidia,2021-10-08 11:15:58 AM

Checkpoint to ".PT" conversion fail

Hello all,

I want to convert the checkpoint tensorflow model to a pytorch model, in a .pt format. I ran :

python convert_tf_checkpoint.py --tf_checkpoint /path/to/model.ckpt-xxxxx.index --bert_config_path /path/to/bert_config.json --output_checkpoint /path/to/model_out.pt

I am getting a undefined symbol error which is preventing me from doing the conversion.

ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Is there any solution for this?

Full stack traceback :

Traceback (most recent call last): File "convert_tf_checkpoint.py", line 17, in <module> from modeling import BertForPretraining, BertConfig File "/desktop/user/bert/dell_pytorch_BERT/pytorch/modeling.py", line 37, in <module> from apex.contrib.multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/__init__.py", line 1, in <module> from .self_multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/self_multihead_attn.py", line 9, in <module> from .fast_self_multihead_attn_func import fast_self_attn_func File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/fast_self_multihead_attn_func.py", line 2, in <module> import fast_self_multihead_attn ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Kindly provide details if someone has encountered a similar issue and were able to resolve.

Thank you.

mlcommons training_results_v1.0 pytorch bert model fails to run on V100 multi-GPU

Hi, all. Tried to run the mlcommons training_results_v1.0 pytorch bert model on V100 multi-GPU, but failed. Have modified the script of run_test.sh to as the following:

#!/bin/bash

python -m torch.distributed.launch --nproc_per_node=2 \
    /workspace/bert/run_pretraining.py \
    --seed=42 \
    --do_train \
    --target_mlm_accuracy=0.714 \
    --skip_checkpoint \
    --output_dir=/results \
    --fp16 \
    --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
    --gradient_accumulation_steps=1 \
    --log_freq=1 \
    --train_batch_size=4 \
    --learning_rate=4e-5 \
    --warmup_proportion=1.0 \
    --input_dir=/data/2048_shards_uncompressed  \
    --phase2 \
    --max_seq_length=512 \
    --max_predictions_per_seq=76 \
    --max_steps=100 \
    --init_checkpoint=/data/model.ckpt-28252.pt \
    --bert_config_path=/data/bert_config.json \
    --distributed_lamb   --dwu-num-rs-pg=1 --dwu-num-ar-pg=1 --dwu-num-blocks=1  \
    --eval_iter_start_samples=100000 --eval_iter_samples=100000 \
    --eval_batch_size=16 --eval_dir=/data/2048_shards_uncompressed \
    --fp16 --fused_gelu_bias --fused_mha --dense_seq_output --unpad --unpad_fmha --exchange_padding \

and run, but reports the following error:

......
......
......
......
......
......
Torch distributed is available.
Torch distributed is initialized.
Torch distributed is available.
Torch distributed is initialized.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
        model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)bert_model_segment = graph(bert_model_segment,

  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
  File "/workspace/bert/function.py", line 66, in graph
    bert_model_segment = graph(bert_model_segment,
  File "/workspace/bert/function.py", line 66, in graph
        outputs  = func_or_module(*sample_args)outputs  = func_or_module(*sample_args)

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
        ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)

  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
    context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)    
context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)
RuntimeError: RuntimeErrorExpected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.): 
Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
......

What could be the cause?

DataLoader cash when using FI_EFA_USE_DEVICE_RDMA=1

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.