Coder Social home page Coder Social logo

mlcommons / training_results_v1.0 Goto Github PK

View Code? Open in Web Editor NEW
36.0 36.0 44.0 242.98 MB

This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.

Home Page: https://mlcommons.org/en/training-normal-10/

License: Apache License 2.0

Dockerfile 0.30% Shell 3.36% Python 33.94% CMake 0.36% C++ 17.28% Cuda 13.85% Jupyter Notebook 29.66% Perl 0.02% Starlark 0.18% Makefile 0.05% TypeScript 0.61% HTML 0.12% CSS 0.01% JavaScript 0.11% Awk 0.09% C 0.05% Visual Basic 6.0 0.02%

training_results_v1.0's People

Contributors

nathanw-mlc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

training_results_v1.0's Issues

Command is missing the path to the cks directory for the model.ckpt-28252.pt

When I run the command

python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint model.ckpt-28252.pt

It works as expected. However, it writes the needed file (model.ckpt-28252.pt) to /workspace/bert. When I exit the container the file is deleted as well. The correct command should be

python convert_tf_checkpoint.py --tf_checkpoint /cks/model.ckpt-28252.index --bert_config_path /cks/bert_config.json --output_checkpoint /cks/model.ckpt-28252.pt

Issue with DGX config file (what will be the required changes in config file)

We are trying to run RNNT benchmark on our DGX station(https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-station/368040-DGX-Station-DS-R11.pdf). Please help to set the right config parameters. Here are our logs after executing "CONT=mlperf-nvidia:rnn_speech_recognition-pytorch DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> METADATA_DIR=<path/to/metadata/dir> SENTENCEPIECES_DIR=<path/to/sentencepieces/dir> bash ./run_with_docker.sh" command:

  • : baseline_DGXA100_8x8x32x1
  • : mlperf-nvidia:rnn_speech_recognition-pytorch
  • : 10
    ++ date +%y%m%d%H%M%S%N
  • : 211008164556434440824
  • : 1
  • : /raid/speech_processing/pytorch/datasets
  • : /raid/speech_processing/pytorch/results
  • : ./api_logs
  • : 17584
  • : 01:00:00
  • readonly _config_file=./config_baseline_DGXA100_8x8x32x1.sh
  • _config_file=./config_baseline_DGXA100_8x8x32x1.sh
  • readonly _logfile_base=/raid/speech_processing/pytorch/results/result_211008164556434440824
  • _logfile_base=/raid/speech_processing/pytorch/results/result_211008164556434440824
  • readonly _cont_name=rnn_speech_recognition
  • _cont_name=rnn_speech_recognition
  • _cont_mounts=("--volume=${DATADIR}:/datasets/" "--volume=${LOGDIR}:/results" "--volume=${METADATA_DIR}:/metadata" "--volume=${SENTENCEPIECES_DIR}:/sentencepieces")
  • '[' '' -eq 1 ']'
    ./run_with_docker.sh: line 30: [: : integer expression expected
    ++ source /etc/os-release
    +++ NAME=Ubuntu
    +++ VERSION='18.04.5 LTS (Bionic Beaver)'
    +++ ID=ubuntu
    +++ ID_LIKE=debian
    +++ PRETTY_NAME='Ubuntu 18.04.5 LTS'
    +++ VERSION_ID=18.04
    +++ HOME_URL=https://www.ubuntu.com/
    +++ SUPPORT_URL=https://help.ubuntu.com/
    +++ BUG_REPORT_URL=https://bugs.launchpad.net/ubuntu/
    +++ PRIVACY_POLICY_URL=https://www.ubuntu.com/legal/terms-and-policies/privacy-policy
    +++ VERSION_CODENAME=bionic
    +++ UBUNTU_CODENAME=bionic
    ++ source /etc/dgx-release
    +++ DGX_NAME='DGX Station'
    +++ DGX_PRETTY_NAME='NVIDIA DGX Station'
    +++ DGX_SWBUILD_DATE=2017-10-31
    +++ DGX_SWBUILD_VERSION=3.1.3
    +++ DGX_COMMIT_ID=31e745794370d852fdb0a178ef022a872f58efdf
    +++ DGX_SERIAL_NUMBER=0154917000004
    +++ DGX_OTA_VERSION=3.1.4
    +++ DGX_OTA_DATE='Wed Jan 31 15:04:20 IST 2018'
    +++ DGX_OTA_VERSION=3.1.7
    +++ DGX_OTA_DATE='Tue Nov 27 15:28:38 IST 2018'
    +++ DGX_OTA_VERSION=4.0.4
    +++ DGX_OTA_DATE='Thu Dec 13 15:09:08 IST 2018'
    +++ DGX_OTA_VERSION=4.0.6
    +++ DGX_OTA_DATE='Wed Aug 7 19:13:38 IST 2019'
    +++ DGX_OTA_VERSION=4.0.7
    +++ DGX_OTA_DATE='Mon Sep 14 09:48:51 IST 2020'
    ++ echo 'Ubuntu 18.04.5 LTS / NVIDIA DGX Station 4.0.7'
  • MLPERF_HOST_OS='Ubuntu 18.04.5 LTS / NVIDIA DGX Station 4.0.7'
  • export MLPERF_HOST_OS
  • mkdir -p /raid/speech_processing/pytorch/results
  • source ./config_baseline_DGXA100_8x8x32x1.sh
    ++ export DGXNNODES=8
    ++ DGXNNODES=8
    +++ sed 's/^config_//'
    +++ sed 's/.sh$//'
    ++++ readlink -f ./config_baseline_DGXA100_8x8x32x1.sh
    +++ basename /raid/speech_processing/pytorch/config_baseline_DGXA100_8x8x32x1.sh
    ++ export DGXSYSTEM=baseline_DGXA100_8x8x32x1
    ++ DGXSYSTEM=baseline_DGXA100_8x8x32x1
    ++ export DGXNGPU=8
    ++ DGXNGPU=8
    ++ export DGXSOCKETCORES=24
    ++ DGXSOCKETCORES=24
    ++ export DGXNSOCKET=2
    ++ DGXNSOCKET=2
    ++ export DGXHT=2
    ++ DGXHT=2
    ++ export GRAD_ACCUMULATION_STEPS=1
    ++ GRAD_ACCUMULATION_STEPS=1
    ++ export DATADIR=/raid/datasets/rnnt/LibriSpeech/
    ++ DATADIR=/raid/datasets/rnnt/LibriSpeech/
    ++ export BATCHSIZE=32
    ++ BATCHSIZE=32
    ++ export EVAL_BATCHSIZE=2
    ++ EVAL_BATCHSIZE=2
    ++ export WALLTIME=01:00:00
    ++ WALLTIME=01:00:00
    ++ export VAL_FREQUENCY=1
    ++ VAL_FREQUENCY=1
    ++ export MAX_SYMBOL=300
    ++ MAX_SYMBOL=300
    ++ export EPOCH=90
    ++ EPOCH=90
    ++ export SEED=23975
    ++ SEED=23975
    ++ export LR=0.007
    ++ LR=0.007
    ++ export WEIGHTS_INIT_SCALE=0.5
    ++ WEIGHTS_INIT_SCALE=0.5
    ++ export DATA_CPU_THREADS=8
    ++ DATA_CPU_THREADS=8
  • mapfile -t _config_env
    ++ env -i bash -c '. ./config_baseline_DGXA100_8x8x32x1.sh && compgen -e'
    ++ grep -E -v '^(PWD|SHLVL)'
  • _config_env+=(MLPERF_HOST_OS)
  • mapfile -t _config_env
    ++ for v in "${_config_env[@]}"
    ++ echo --env=BATCHSIZE
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DATADIR
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DATA_CPU_THREADS
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXHT
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXNGPU
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXNNODES
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXNSOCKET
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXSOCKETCORES
    ++ for v in "${_config_env[@]}"
    ++ echo --env=DGXSYSTEM
    ++ for v in "${_config_env[@]}"
    ++ echo --env=EPOCH
    ++ for v in "${_config_env[@]}"
    ++ echo --env=EVAL_BATCHSIZE
    ++ for v in "${_config_env[@]}"
    ++ echo --env=GRAD_ACCUMULATION_STEPS
    ++ for v in "${_config_env[@]}"
    ++ echo --env=LR
    ++ for v in "${_config_env[@]}"
    ++ echo --env=MAX_SYMBOL
    ++ for v in "${_config_env[@]}"
    ++ echo --env=SEED
    ++ for v in "${_config_env[@]}"
    ++ echo --env=VAL_FREQUENCY
    ++ for v in "${_config_env[@]}"
    ++ echo --env=WALLTIME
    ++ for v in "${_config_env[@]}"
    ++ echo --env=WEIGHTS_INIT_SCALE
    ++ for v in "${_config_env[@]}"
    ++ echo --env=MLPERF_HOST_OS
  • docker run --rm --init --detach --net=host --uts=host --ipc=host --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --name=rnn_speech_recognition --volume=/raid/speech_processing/pytorch/datasets:/datasets/ --volume=/raid/speech_processing/pytorch/results:/results --volume=/raid/speech_processing/pytorch/tokenized:/metadata --volume=/raid/speech_processing/pytorch/sentenpieces:/sentencepieces mlperf-nvidia:rnn_speech_recognition-pytorch sleep infinity
    0a6482c870f1468836b708d3f340a3dfec7b49ef2b1c352b2c5bf588803cf29c
  • docker exec -it rnn_speech_recognition true
  • [[ baseline_DGXA100_8x8x32x1 == \D\G\X\A\1\0\0* ]]
    ++ seq 1 10
  • for _experiment_index in $(seq 1 "${NEXP}")
  • tee /raid/speech_processing/pytorch/results/result_211008164556434440824_1.txt
    tee: /raid/speech_processing/pytorch/results/result_211008164556434440824_1.txt: Permission denied
  • echo 'Beginning trial 1 of 10'
    Beginning trial 1 of 10
  • docker exec -it rnn_speech_recognition python -c ''
  • '[' 1 -eq 1 ']'
  • sync
  • docker exec -it rnn_speech_recognition python -c '
    from mlperf import logging
    logging.log_event(key=logging.constants.CACHE_CLEAR, value=True)'
    :::MLLOG {"namespace": "", "time_ms": 1633691758520, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "", "lineno": 3}}
  • docker exec -it --env=BATCHSIZE --env=DATADIR --env=DATA_CPU_THREADS --env=DGXHT --env=DGXNGPU --env=DGXNNODES --env=DGXNSOCKET --env=DGXSOCKETCORES --env=DGXSYSTEM --env=EPOCH --env=EVAL_BATCHSIZE --env=GRAD_ACCUMULATION_STEPS --env=LR --env=MAX_SYMBOL --env=SEED --env=VAL_FREQUENCY --env=WALLTIME --env=WEIGHTS_INIT_SCALE --env=MLPERF_HOST_OS rnn_speech_recognition ./run_and_time.sh
    ./run_and_time.sh: line 24: [: : integer expression expected
    STARTING TIMING RUN AT 2021-10-08 11:15:58 AM
    running benchmark
    python -u -m bind_launch --nsockets_per_node=2 --ncores_per_socket=24 --nproc_per_node=8
    ./run_and_time.sh: line 140: [: -ne: unary operator expected
    libnuma: Warning: cpu argument 48-53 is out of range

<0-5,48-53> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 54-59 is out of range

<6-11,54-59> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 60-65 is out of range

<12-17,60-65> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 66-71 is out of range

<18-23,66-71> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 72-77 is out of range

<24-29,72-77> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 78-83 is out of range

<30-35,78-83> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 41,84-89 out of range

<36-41,84-89> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
libnuma: Warning: cpu argument 42-47,90-95 is out of range

<42-47,90-95> is invalid
usage: numactl [--all | -a] [--interleave= | -i ] [--preferred= | -p ]
[--physcpubind= | -C ] [--cpunodebind= | -N ]
[--membind= | -m ] [--localalloc | -l] command args ...
numactl [--show | -s]
numactl [--hardware | -H]
numactl [--length | -l ] [--offset | -o ] [--shmmode | -M ]
[--strict | -t]
[--shmid | -I ] --shm | -S
[--shmid | -I ] --file | -f
[--huge | -u] [--touch | -T]
memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
netdev:DEV the node connected to network device DEV
file:PATH the node the block device of path is connected to
ip:HOST the node of the network device host routes through
block:PATH the node of block device path
pci:[seg:]bus:dev[:func] The node of a PCI device
is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
can have g (GB), m (MB) or k (KB) suffixes
ENDING TIMING RUN AT 2021-10-08 11:16:02 AM
RESULT,RNN_SPEECH_RECOGNITION,,4,nvidia,2021-10-08 11:15:58 AM

Checkpoint to ".PT" conversion fail

Hello all,

I want to convert the checkpoint tensorflow model to a pytorch model, in a .pt format. I ran :

python convert_tf_checkpoint.py --tf_checkpoint /path/to/model.ckpt-xxxxx.index --bert_config_path /path/to/bert_config.json --output_checkpoint /path/to/model_out.pt

I am getting a undefined symbol error which is preventing me from doing the conversion.

ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Is there any solution for this?

Full stack traceback :

Traceback (most recent call last): File "convert_tf_checkpoint.py", line 17, in <module> from modeling import BertForPretraining, BertConfig File "/desktop/user/bert/dell_pytorch_BERT/pytorch/modeling.py", line 37, in <module> from apex.contrib.multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/__init__.py", line 1, in <module> from .self_multihead_attn import SelfMultiheadAttn File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/self_multihead_attn.py", line 9, in <module> from .fast_self_multihead_attn_func import fast_self_attn_func File "/home/user/.virtualenvs/ai/lib/python3.6/site-packages/apex/contrib/multihead_attn/fast_self_multihead_attn_func.py", line 2, in <module> import fast_self_multihead_attn ImportError: /home/user/.virtualenvs/ai/lib/python3.6/site-packages/fast_self_multihead_attn.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at4cuda4blas19_cublasGetErrorEnumE14cublasStatus_t

Kindly provide details if someone has encountered a similar issue and were able to resolve.

Thank you.

mlcommons training_results_v1.0 pytorch bert model fails to run on V100 multi-GPU

Hi, all. Tried to run the mlcommons training_results_v1.0 pytorch bert model on V100 multi-GPU, but failed. Have modified the script of run_test.sh to as the following:

#!/bin/bash

python -m torch.distributed.launch --nproc_per_node=2 \
    /workspace/bert/run_pretraining.py \
    --seed=42 \
    --do_train \
    --target_mlm_accuracy=0.714 \
    --skip_checkpoint \
    --output_dir=/results \
    --fp16 \
    --allreduce_post_accumulation --allreduce_post_accumulation_fp16 \
    --gradient_accumulation_steps=1 \
    --log_freq=1 \
    --train_batch_size=4 \
    --learning_rate=4e-5 \
    --warmup_proportion=1.0 \
    --input_dir=/data/2048_shards_uncompressed  \
    --phase2 \
    --max_seq_length=512 \
    --max_predictions_per_seq=76 \
    --max_steps=100 \
    --init_checkpoint=/data/model.ckpt-28252.pt \
    --bert_config_path=/data/bert_config.json \
    --distributed_lamb   --dwu-num-rs-pg=1 --dwu-num-ar-pg=1 --dwu-num-blocks=1  \
    --eval_iter_start_samples=100000 --eval_iter_samples=100000 \
    --eval_batch_size=16 --eval_dir=/data/2048_shards_uncompressed \
    --fp16 --fused_gelu_bias --fused_mha --dense_seq_output --unpad --unpad_fmha --exchange_padding \

and run, but reports the following error:

......
......
......
......
......
......
Torch distributed is available.
Torch distributed is initialized.
Torch distributed is available.
Torch distributed is initialized.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
  File "/workspace/bert/run_pretraining.py", line 1592, in <module>
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    args, final_loss, train_time_raw = main()
  File "/workspace/bert/run_pretraining.py", line 1141, in main
    model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)
  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
        model = fwd_loss_bwd_trainer.capture_bert_model_segment_graph(model, use_cuda_graph)bert_model_segment = graph(bert_model_segment,

  File "/workspace/bert/fwd_loss_bwd_trainer.py", line 43, in capture_bert_model_segment_graph
  File "/workspace/bert/function.py", line 66, in graph
    bert_model_segment = graph(bert_model_segment,
  File "/workspace/bert/function.py", line 66, in graph
        outputs  = func_or_module(*sample_args)outputs  = func_or_module(*sample_args)

  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 1009, in forward
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    sequence_output, pooled_output = self.bert(input_ids, token_type_ids, attention_mask,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 901, in forward
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    encoded_layers = self.encoder(embedding_output,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 577, in forward
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    hidden_states = layer_module(hidden_states, cu_seqlens, actual_seqlens, maxseqlen_in_batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 500, in forward
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    attention_output = self.attention(hidden_states, attention_mask, seqlen, batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/modeling.py", line 424, in forward
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    self_output = self.self(input_tensor, attention_mask, seqlen, batch, is_training = self.training)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
    return forward_call(*input, **kwargs)
  File "/workspace/bert/fmha.py", line 161, in forward
        ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)ctx = FMHAFun.apply(qkv.view(-1, 3, self.h, self.d), cu_seqlens, seqlens, p_dropout, max_s, is_training)

  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
  File "/opt/conda/lib/python3.8/site-packages/apex/contrib/fmha/fmha.py", line 36, in forward
    context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)    
context, S_dmask = mha.fwd(qkv, cu_seqlens, seqlens, p_dropout, max_s, is_training, None)
RuntimeError: RuntimeErrorExpected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.): 
Expected dprops->major == 8 && dprops->minor == 0 to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
......

What could be the cause?

DataLoader cash when using FI_EFA_USE_DEVICE_RDMA=1

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.