pytorch / benchmark Goto Github PK

View Code? Open in Web Editor NEW

807.0 227.0 257.0 245.12 MB

TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.

License: BSD 3-Clause "New" or "Revised" License

Python 89.69% Shell 0.94% Dockerfile 0.20% Makefile 0.01% MATLAB 0.20% Jupyter Notebook 8.77% C++ 0.04% Cuda 0.16%

pytorch benchmark

benchmark's Introduction

PyTorch Benchmarks

This is a collection of open source benchmarks used to evaluate PyTorch performance.

torchbenchmark/models contains copies of popular or exemplary workloads which have been modified to: (a) expose a standardized API for benchmark drivers, (b) optionally, enable backends such as torchinductor/torchscript, (c) contain a miniature version of train/test data and a dependency install script.

Installation

The benchmark suite should be self contained in terms of dependencies, except for the torch products which are intended to be installed separately so different torch versions can be benchmarked.

Using Pre-built Packages

We support Python 3.8+, and 3.11 is recommended. Conda is optional but suggested. To start with Python 3.11 in conda:

# Using your current conda environment:
conda install -y python=3.11

# Or, using a new conda environment:
conda create -n torchbenchmark python=3.11
conda activate torchbenchmark

If you are running NVIDIA GPU tests, we support both CUDA 11.8 and 12.1, and use CUDA 12.1 as default:

conda install -y -c pytorch magma-cuda121

Then install pytorch, torchvision, and torchaudio using conda:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

Or use pip: (but don't mix and match pip and conda for the torch family of libs! - see notes below)

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

Install the benchmark suite, which will recursively install dependencies for all the models. Currently, the repo is intended to be installed from the source tree.

git clone https://github.com/pytorch/benchmark
cd benchmark
python3 install.py

Install torchbench as a library

if you're interested in running torchbench as a library you can

python3 install.py
pip install git+https://www.github.com:pytorch/benchmark.git

python3 install.py
pip install . # add -e for an editable installation

The above

import torchbenchmark.models.densenet121
model, example_inputs = torchbenchmark.models.densenet121.Model(test="eval", device="cuda", batch_size=1).get_module()
model(*example_inputs)

Building From Source

Note that when building PyTorch from source, torchvision and torchaudio must also be built from source to make sure the C APIs match.

See detailed instructions to install torchvision here and torchaudio here. Make sure to enable CUDA (by USE_CUDA=1) if using CUDA. Then,

git clone https://github.com/pytorch/benchmark
cd benchmark
python3 install.py

Notes

Setup steps require network connectivity - make sure to enable a proxy if needed.
We suggest using the latest PyTorch nightly releases to run the benchmark. Stable versions are NOT tested or maintained.
torch, torchvision, and torchaudio must all be installed from the same build process. This means it isn't possible to mix conda torchvision with pip torch, or mix built-from-source torch with pip torchvision. It's important to match even the conda channel (nightly vs regular). This is due to the differences in the compilation process used by different packaging systems producing incompatible Python binary extensions.

Using a low-noise machine

Various sources of noise, such as interrupts, context switches, clock frequency scaling, etc. can all conspire to make benchmark results variable. It's important to understand the level of noise in your setup before drawing conclusions from benchmark data. While any machine can in principle be tuned up, the steps and end-results vary with OS, kernel, drivers, and hardware. To this end, torchbenchmark picks a favorite machine type it can support well, and provides utilities for automated tuning on that machine. In the future, we may support more machine types and would be happy for contributions here.

The currently supported machine type is an AWS g4dn.metal instance using Amazon Linux. This is one of the subsets of AWS instance types that supports processor state control, with documented tuning guides for Amazon Linux. Most if not all of these steps should be possible on Ubuntu but haven't been automated yet.

To tune your g4dn.metal Amazon Linux machine, run

sudo `which python` torchbenchmark/util/machine_config.py --configure

When running pytest (see below), the machine_config script is invoked to assert a proper configuration and log config info into the output json. It is possible to --ignore_machine_config if running pytest without tuning is desired.

Running Model Benchmarks

There are multiple ways for running the model benchmarks.

test.py offers the simplest wrapper around the infrastructure for iterating through each model and installing and executing it.

test_bench.py is a pytest-benchmark script that leverages the same infrastructure but collects benchmark statistics and supports pytest filtering.

userbenchmark allows to develop and run customized benchmarks.

In each model repo, the assumption is that the user would already have all of the torch family of packages installed (torch, torchvision, torchaudio...) but it installs the rest of the dependencies for the model.

Using `test.py`

python3 test.py will execute the APIs for each model, as a sanity check. For benchmarking, use test_bench.py. It is based on unittest, and supports filtering via CLI.

For instance, to run the BERT model on CPU for the train execution mode:

python3 test.py -k "test_BERT_pytorch_train_cpu"

The test name follows the following pattern:

"test_" + <model_name> + "_" + {"train" | "eval" } + "_" + {"cpu" | "cuda"}

Using pytest-benchmark driver

pytest test_bench.py invokes the benchmark driver. See --help for a complete list of options.

Some useful options include:

--benchmark-autosave (or other save related flags) to get .json output
-k <filter expression> standard pytest filtering
--collect-only only show what tests would run, useful to see what models there are or debug your filter expression
--cpu_only if running on a local CPU machine and ignoring machine configuration checks

Examples of Benchmark Filters

-k "test_train[NAME-cuda]" for a particular flavor of a particular model
-k "(BERT and (not cuda))" for a more flexible approach to filtering

Note that test_bench.py will eventually be deprecated as the userbenchmark work evolve. Users are encouraged to explore and consider using userbenchmark.

Using userbenchmark

The userbenchmark allows you to develop your customized benchmarks with TorchBench models. Refer to the userbenchmark instructions to learn more on how you can create a new userbenchmark. You can then use the run_benchmark.py driver to drive the benchmark. e.g. python run_benchmark.py <benchmark_name>. Run python run_benchmark.py —help to find out available options.

Using `run.py` for simple debugging or profiling

Sometimes you may want to just run train or eval on a particular model, e.g. for debugging or profiling. Rather than relying on main implementations inside each model, run.py provides a lightweight CLI for this purpose, building on top of the standard BenchmarkModel API.

python3 run.py <model> [-d {cpu,cuda}] [-t {eval,train}] [--profile]

Note: <model> can be a full, exact name, or a partial string match.

Using torchbench models as a library

If you're interested in using torchbench as a suite of models you can test, the easiest way to integrate it into your code/ci/tests would be something like

import torch
import importlib 
import sys

# If your directory looks like this_file.py, benchmark/
sys.path.append("benchmark")
model_name = "torchbenchmark.models.stable_diffusion_text_encoder" # replace this by the name of the model you're working on
module = importlib.import_module(model_name)

benchmark_cls = getattr(module, "Model", None)
benchmark = benchmark_cls(test="eval", device = "cuda") # test = train or eval device = cuda or cpu

model, example = benchmark.get_module()
model(*example)

Nightly CI runs

Currently, the models run on nightly pytorch builds and push data to Meta's internal database. The Nightly CI publishes both V1 and V0 performance scores.

See Unidash (Meta-internal only)

Adding new models

See Adding Models.

benchmark's People

Contributors

Stargazers

Watchers

Forkers

bityangke akshaykhale1992 zdevito jamesr66a alexholdenmiller ganji15 grseb9s acgtyrant yf225 cpuhrsch jpilaul colesbury krishnakalyan3 zou3519 k-lam messiest jspark1105 lanpa cclauss kit1980 jerryzh168 wanchaol eellison dzhulgakov surgan12 vturrisi gtshs2 nimin98 gaodawn likecoffee naveatvoca michaelarg pbelevich prasunanand rreece jingjieli95 rokasst danliran sunilsurineni terrorizer1980 wconstab nickgg bertmaher edwardtyantov soumendukrg rockingjavabean chillee xuzhao9 jokeren samuelmarks pinzhenx jansel navahgar huiguoo louisfeng shizukanaskytree wanderer2014 00mjk global-localhost global19 global19-atlassian-net isabella232 msaroufim orionr puririshi98 kevinstephano eircfb devsatpathy narendasan chunyuan-w pkulzb imaginary-person greenstar1151 linhduongtuan shunsunsun tornado404 aaronenyeshi strifee ejguan ramiro050 davidjurado abhinavarora ff7250 vectorinstitute erip geraldcsc yinghai sanchitintel brucedai003 miladm classicvalues kflu brad-mengchi mahinlma mrshenli satyaog colin2328 ngimel davidberard98 horizon-blue

benchmark's Issues

SRU_Compute_No_Kernel references an undefined variable

Sigh. Don't have time to submit a PR. Looks like ncols, used here, is undefined.

Add Speech Reco ESPNet

This hasn't been looked at by the benchmark team, there could be issues at any point in the process (torchbenchmark/models/ADDING_MODELS.md)
https://github.com/pytorch/benchmark/blob/master/torchbenchmark/models/ADDING_MODELS.md

JIT optional.

https://github.com/espnet/espnet

The test code, which tests the correctness of FastRnn, allows small difference between the output of FastRnn and the output of cudnn's rnn

Hi,

First, I want to thank you guys for sharing the code. I learned a lot from it. It gives me a good start for implement my own GRU layer. :-)

One thing that I am very confusing is the outputs from cudnn's lstm and the fastrnn's lstm are different. Based on the assertions in your test code, I guess it is a common case which I do not need to worry too much. But, I am curious what cause the difference.

Thanks for your time :-)

Cleanup

rename test.py?
move util scripts into utils subdir?

add 'name' API to each model, to avoid having to use the directory names and to promote a more readable 'short name'?

Add torchvision maskrcnn (enable jit, dynamic shapes)

The 'maskrcnn-benchmark' model has already been added to the benchmark, but it doesn't easily JIT, and we want dynamic shape models to go through PE/TE.

This model is known to jit compile, and should be added to the benchmark
https://github.com/pytorch/vision/blob/master/torchvision/models/detection/mask_rcnn.py
see this test for the exact configuration that is tested with JIT
https://github.com/pytorch/vision/blob/master/test/test_models.py#L68

ideally, reuse coco dataset artifacts that are checked in already from the other maskrcnn.

follow general instructions in ADDING_MODELS.md

Add PyTorch-UNet model to benchmark repo

We're building a suite of pytorch benchmarks by forking popular open-source models and modifying them to conform to a common API that facilitates a central tool installing, running, and collecting measurements.

Benchmark Repo: https://github.com/pytorch/benchmark
Instructions: https://github.com/pytorch/benchmark/blob/master/models/ADDING_MODELS.md
Source model: https://github.com/supriyar/Pytorch-UNet

The UNet model has already been forked and modified in a previous phase of this process, but the API (install.py, hubconf.py) had not yet been standardized. Copy the repo into benchmark.git, follow the latest instructions in the ADDING_MODELS readme to add the new API. Note, the 'install.sh' and 'run.sh' scripts added in the previous effort would serve as good hints on how to proceed.

Remove score.yml

The current torchbenchmark framework uses score.yml as the spec. Instead of maintaining a static file, we can generate the spec and thus eliminate the usage of score.yml. Below are the set of sub - tasks needed to accomplish this:

Update the Benchmark Model class , to have domain and task class attributes
Remove score.yml and generate the spec by reading the data from the Model class
Maintain an explicit list of enabled (required) models which are used for the score, rather than using whatever models are found in 'list_models'

tree2tree doesn't run with PyTorch HEAD

After merging persist_rnn (which does work) with master pytorch/pytorch@909f317 tree2tree now fails with:

Traceback (most recent call last):
  File "tree2tree.py", line 683, in <module>
    train_until_fit(model, optimizer, data_iter, 1)
  File "tree2tree.py", line 613, in train_until_fit
    loss, dec_loss, seq_len = model(batch)
  File "/data/users/ezyang/pytorch/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "tree2tree.py", line 26, in wrapper
    r = func(*args, **kwargs)
  File "tree2tree.py", line 133, in forward
    batch.trg, encoding)
  File "tree2tree.py", line 26, in wrapper
    r = func(*args, **kwargs)
  File "tree2tree.py", line 544, in __call__
    self.step(self.tensor_actions[-1], targets)
  File "tree2tree.py", line 26, in wrapper
    r = func(*args, **kwargs)
  File "tree2tree.py", line 593, in step
    self.attention_states, self.alphas, self.ixs)
  File "tree2tree.py", line 26, in wrapper
    r = func(*args, **kwargs)
  File "tree2tree.py", line 424, in step
    x, self.stack_rnn(x, top.hc), parent=top.first)
  File "/data/users/ezyang/pytorch/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/users/ezyang/pytorch/torch/nn/modules/rnn.py", line 513, in forward
    self.bias_ih, self.bias_hh,
  File "/data/users/ezyang/pytorch/torch/nn/_functions/rnn.py", line 24, in LSTMCell
    igates = F.linear(input, w_ih)
  File "/data/users/ezyang/pytorch/torch/nn/functional.py", line 526, in linear
    return _functions.linear.Linear.apply(input, weight)
  File "/data/users/ezyang/pytorch/torch/nn/_functions/linear.py", line 12, in forward
    output.addmm_(0, 1, input, weight.t())
RuntimeError: size mismatch at /home/ezyang/local/pytorch/torch/lib/THC/generic/THCTensorMathBlas.cu:243

Collect raw benchmark data in scuba rather than only summarized stats

Possibly add new scuba table for the raw data
may need to modify upload_scribe to handle new fields and new table/columns
pytest_benchmark already has cmd line option to include the raw data in the .json file it outputs, but this needs to be enabled in the CI script that runs the benchmark.

License

Hey,

cool work here. I was wondering if you could add a license file to this repository?
I'd like to adopt a few code snippets, but this is currently impossible.

Thanks!

disable slow benchmarks

todo determine long term plan. using this as placeholder for code comment pointer for now.

Add DeOldify Model

Follow instructions at https://github.com/pytorch/benchmark/blob/master/torchbenchmark/models/ADDING_MODELS.md

Model source:
https://github.com/bwasti/DeOldify

This model has already been forked and modified as part of an earlier hackathon, so while it needs work to get into the current benchmark format, the install/data/train steps should already be clear. Script/tracing has already been attempted without success, so that can be skipped.

Reference paper for tree2tree.py code?

I wonder if there is any corresponding paper for tree2tree.py?

Thanks.

Audit existing models

First, create a checklist of important things for each model

Then go through each model and make sure they comply

enable running a model script directly

right now, you can't python torchbenchmark/models/dlrm/__init__.py becuase "ImportError: attempted relative import with no known parent package".

how can the main block in init.py be executed? If it's possible we should write up the workaround in README, or else it would be nice to enable.

Modify fastNLP to work with latest and older torchtext releases

torchtext 0.8.0 release moved AG_NEWS from datasets.experimental. to datasets. fastNLP should be able to run under new and old torchtext for the sake of running historical builds, so we can check the torchtext version in the fastNLP init.py and conditionally import the right name.

python install.py fails on Windows 10 Python 3.7.9

python install fails with OSError: [Errno 22] Invalid argument: 'M:\\AI\\benchmark\\myenvW\\lib\\site-packages\\wincertstore-0.2-py3.7.egg-info\\entry_points.txt'

while installing in virtual environment (myenvW) created with conda 4.9.1 (miniconda).
wincertstore does not have an entry_points.txt, and the package seems not to be updated anymore.

Complete traceback:

Traceback (most recent call last):
  File "M:\AI\benchmark\myenvW\lib\runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "M:\AI\benchmark\myenvW\lib\runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "M:\AI\benchmark\myenvW\lib\runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "M:\AI\benchmark\myenvW\lib\site-packages\spacy\__init__.py", line 10, in <module>
    from thinc.neural.util import prefer_gpu, require_gpu
  File "M:\AI\benchmark\myenvW\lib\site-packages\thinc\__init__.py", line 8, in <module>
    from ._registry import registry
  File "M:\AI\benchmark\myenvW\lib\site-packages\thinc\_registry.py", line 1, in <module>
    import catalogue
  File "M:\AI\benchmark\myenvW\lib\site-packages\catalogue.py", line 18, in <module>
    AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
  File "M:\AI\benchmark\myenvW\lib\site-packages\importlib_metadata\__init__.py", line 596, in entry_points
    ordered = sorted(eps, key=by_group)
  File "M:\AI\benchmark\myenvW\lib\site-packages\importlib_metadata\__init__.py", line 594, in <genexpr>
    dist.entry_points for dist in distributions())
  File "M:\AI\benchmark\myenvW\lib\site-packages\importlib_metadata\__init__.py", line 289, in entry_points
    return EntryPoint._from_text(self.read_text('entry_points.txt'))
  File "M:\AI\benchmark\myenvW\lib\site-packages\importlib_metadata\__init__.py", line 545, in read_text
    return self._path.joinpath(filename).read_text(encoding='utf-8')
  File "M:\AI\benchmark\myenvW\lib\pathlib.py", line 1221, in read_text
    with self.open(mode='r', encoding=encoding, errors=errors) as f:
  File "M:\AI\benchmark\myenvW\lib\pathlib.py", line 1208, in open
    opener=self._opener)
  File "M:\AI\benchmark\myenvW\lib\pathlib.py", line 1063, in _opener
    return self._accessor.open(self, flags, mode)
OSError: [Errno 22] Invalid argument: 'M:\\AI\\benchmark\\myenvW\\lib\\site-packages\\wincertstore-0.2-py3.7.egg-info\\entry_points.txt'
Traceback (most recent call last):
  File "install.py", line 16, in <module>
    spacy_download('en')
  File "install.py", line 9, in spacy_download
    subprocess.check_call([sys.executable, '-m', 'spacy', 'download', language])
  File "M:\AI\benchmark\myenvW\lib\subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['M:\\AI\\benchmark\\myenvW\\python.exe', '-m', 'spacy', 'download', 'en']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "install.py", line 12, in <module>
    setup()
  File "M:\AI\benchmark\torchbenchmark\__init__.py", line 44, in setup
    _install_deps(model_path)
  File "M:\AI\benchmark\torchbenchmark\__init__.py", line 28, in _install_deps
    subprocess.check_call([sys.executable, install_file], cwd=model_path)
  File "M:\AI\benchmark\myenvW\lib\subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['M:\\AI\\benchmark\\myenvW\\python.exe', 'install.py']' returned non-zero exit status 1.
'''

TODO: benchmark models

https://github.com/jihunchoi/recurrent-batch-normalization-pytorch/blob/master/bnlstm.py
LSTMVariants https://github.com/seba-1511/lstms.pth/blob/master/lstms/lstm.py, https://github.com/seba-1511/lstms.pth/blob/master/test/test_speed.py
(@zdevito) QRNN or SRU https://github.com/taolei87/sru/blob/master/cuda_functional.py https://gist.github.com/jekbradbury/a3a5ae890328db49d8093c1a5bdc8a1e https://openreview.net/pdf?id=H1zJ-v5xl
Yuandong's code (maybe not in the cont. build, but runnable): https://github.com/facebookresearch/ELF/ (compile directory: ./rts/game_MC, run script: cd ELF
sh ./train_minirts.sh --gpu 0)
convolution MT (meeting with Michael, David, Sergey to see if it is a good candidate)
(@apaszke) tree2tree
(@ezyang) multiplicative LSTM

Follow up tasks:

Add batching to LSTMVariants (some issues in a few layers with mismatched sizes)
Annotate all examples with tracer code, and verify correctness.

update unidash/scuba

add new models
update queries to move from hub to benchmark or include both in the same query

Add detectron2 models

Currently the benchmark includes maskrcnn-benchmark models: https://github.com/pytorch/benchmark/tree/master/torchbenchmark/models/maskrcnn_benchmark

However as the link above says, maskrcnn-benchmark is now deprecated in favor of detectron2. Detectron2 models are used widely in facebook production and research so we should add or switch to benchmark detectron2 models instead. The two projects are similar in their core models, so the switch shouldn't be hard.

I can follow up with a few most important models in detectron2 that should be benchmarked.

https://github.com/pytorch/benchmark/blob/master/torchbenchmark/models/ADDING_MODELS.md

Add models representative of internal FB workloads

Benchmarks leak files

I observed that when you run benchmarks many files are created in the current working directory. This is bad because:

It may increases benchmark variance because disk access can have less predictable performance
It it subjectively ugly/messy

Reproduce with:

rm -f Video_data_train_processed.csv labels.json results.png results.txt train_batch0.jpg checkpoints/*
pytest test.py
ls Video_data_train_processed.csv labels.json results.png results.txt train_batch0.jpg checkpoints/*

Actual output of ls

labels.json
results.png
results.txt
train_batch0.jpg
Video_data_train_processed.csv

checkpoints/horse2zebra:
loss_log.txt
train_opt.txt
web

checkpoints/horse2zebra_pretrained:
test_opt.txt

Desired output of ls: empty with many "No such file or directory" errors

pytorch nightly installation version conflicts breaking CI

https://app.circleci.com/pipelines/github/pytorch/benchmark/522/workflows/d359209d-6d05-4175-be29-7630a8b32f19/jobs/550

The following specifications were found to be incompatible with each other:

Output in format: Requested package -> Available versions

Package numpy conflicts for:
torchvision -> numpy[version='>1.11|>=1.11']
torchvision -> pytorch=1.3.1 -> numpy[version='>=1.11.3,<2.0a0|>=1.14.6,<2.0a0|>=1.9.3,<2.0a0|>=1.9']

Package ca-certificates conflicts for:
torchvision -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
pytorch -> python[version='>=2.7,<2.8.0a0'] -> ca-certificates
python=3.7 -> openssl[version='>=1.1.1g,<1.1.2a'] -> ca-certificates

Package cudatoolkit conflicts for:
torchvision -> pytorch=1.2.0 -> cudatoolkit[version='10.0.|9.2.|>=10.1.243,<10.2.0a0|>=8.0,<8.1.0a0|9.0.|8.0.|7.5.*']
torchvision -> cudatoolkit[version='>=10.0.130,<10.1.0a0|>=10.1,<10.2|>=10.2,<10.3|>=11.0,<11.1|>=9.2,<9.3|>=9.2,<9.3.0a0|>=9.0,<9.1.0a0']

Package pytorch conflicts for:
torchvision -> pytorch[version='1.1.|1.2.0.|1.3.1.*|1.8.0.dev20201110|>=0.4|>=0.3']
torchtext -> pytorch==1.8.0.dev20201110

Package numpy-base conflicts for:
torchvision -> numpy[version='>=1.11'] -> numpy-base[version='1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.4|1.14.4|1.14.4|1.14.4|1.14.4|1.14.4|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.3|1.15.3|1.15.3|1.15.3|1.15.3|1.15.3|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.2|1.16.2|1.16.2|1.16.2|1.16.2|1.16.2|1.16.3|1.16.3|1.16.3|1.16.3|1.16.3|1.16.3|1.16.4|1.16.4|1.16.4|1.16.4|1.16.4|1.16.4|1.16.5|1.16.5|1.16.5|1.16.5|1.16.5|1.16.5|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.17.2.|1.17.3.|1.17.4.|1.18.1.|1.18.5.|1.19.1|1.19.1|1.19.1|1.19.1|1.19.1|1.19.1|1.19.2|1.17.0|1.17.0|1.17.0|1.17.0|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|>=1.9.3,<2.0a0',build='py36h2b20989_6|py37h2b20989_6|py27hdbf6ddf_6|py27hdbf6ddf_7|py27h2b20989_7|py36hdbf6ddf_7|py36h2b20989_7|py35h2b20989_7|py37h2b20989_7|py36h2f8d375_0|py36hde5b4d6_0|py27h2b20989_7|py36hdbf6ddf_7|py37hdbf6ddf_7|py27h2b20989_8|py35h2b20989_8|py35h7cdd4dd_9|py37h3dfced4_9|py36h3dfced4_9|py27h3dfced4_9|py37h81de0dd_9|py27h74e8950_9|py36h74e8950_9|py27h81de0dd_9|py35h81de0dd_9|py36h74e8950_10|py27h81de0dd_10|py37h81de0dd_10|py35h81de0dd_10|py27h2f8d375_10|py37h2f8d375_11|py27hde5b4d6_11|py37h2f8d375_12|py36h2f8d375_12|py36hde5b4d6_12|py38h2f8d375_12|py27h0ea5e3f_1|py35h0ea5e3f_1|py36h9be14a7_1|py35h9be14a7_1|py27h2b20989_0|py36h2b20989_0|py35h2b20989_0|py35hdbf6ddf_0|py27h2b20989_0|py36h2b20989_1|py37h2b20989_1|py27hdbf6ddf_1|py36hdbf6ddf_1|py27h2b20989_1|py37h2b20989_2|py37hdbf6ddf_2|py36h2b20989_3|py37hdbf6ddf_3|py36hdbf6ddf_3|py27h2b20989_3|py37hdbf6ddf_4|py36hdbf6ddf_4|py35hdbf6ddf_4|py36h81de0dd_4|py38h2f8d375_4|py38hde5b4d6_4|py27h2f8d375_5|py37hde5b4d6_5|py36hde5b4d6_5|py27h7cdd4dd_0|py37h7cdd4dd_0|py27h3dfced4_0|py36h3dfced4_0|py37h3dfced4_0|py27h81de0dd_0|py36h2f8d375_0|py35h81de0dd_0|py36h2f8d375_0|py27h2f8d375_1|py27h81de0dd_1|py36h81de0dd_1|py37h81de0dd_1|py27h2f8d375_0|py36h2f8d375_0|py27h81de0dd_0|py36h2f8d375_0|py27h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py36hde5b4d6_0|py37h2f8d375_1|py27h2f8d375_1|py37hde5b4d6_1|py27hde5b4d6_1|py36h2f8d375_0|py36hde5b4d6_0|py27hde5b4d6_0|py27h2f8d375_1|py36h2f8d375_1|py37hde5b4d6_1|py36hde5b4d6_1|py27hde5b4d6_1|py36h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py37hde5b4d6_0|py36hde5b4d6_0|py36h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py37hde5b4d6_0|py36hde5b4d6_0|py38h2f8d375_0|py38hde5b4d6_0|py27hde5b4d6_0|py38h75fe3a5_0|py38hfa32c7d_0|py36hfa32c7d_0|py37h75fe3a5_0|py37h75fe3a5_0|py38hfa32c7d_0|py36hfa32c7d_0|py36h75fe3a5_0|py38h75fe3a5_0|py37hfa32c7d_0|py36h75fe3a5_0|py37hfa32c7d_0|py27h2f8d375_0|py37h2f8d375_0|py37hde5b4d6_0|py27hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py27hde5b4d6_0|py37hde5b4d6_0|py37h2f8d375_0|py36h2f8d375_0|py27h2f8d375_0|py36hde5b4d6_0|py27hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py27hde5b4d6_0|py37hde5b4d6_0|py27h2f8d375_0|py37h2f8d375_0|py37h2f8d375_1|py37hde5b4d6_0|py27h2f8d375_0|py37h2f8d375_0|py36hde5b4d6_1|py36h2f8d375_1|py27hde5b4d6_0|py37hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py37hde5b4d6_0|py27hde5b4d6_0|py27h81de0dd_0|py36h81de0dd_0|py37h81de0dd_0|py37h2f8d375_0|py37h81de0dd_0|py36h81de0dd_0|py37h2f8d375_0|py37h2f8d375_1|py36h2f8d375_1|py36h81de0dd_0|py35h2f8d375_0|py27h81de0dd_0|py37h81de0dd_0|py27h2f8d375_0|py37h2f8d375_0|py37h2f8d375_0|py27h2f8d375_0|py35h2f8d375_0|py35h81de0dd_0|py37h81de0dd_0|py36h81de0dd_0|py37h74e8950_0|py36h74e8950_0|py27h74e8950_0|py35h74e8950_0|py35h3dfced4_0|py35h7cdd4dd_0|py36h7cdd4dd_0|py27hde5b4d6_5|py37h2f8d375_5|py36h2f8d375_5|py37h81de0dd_4|py35h81de0dd_4|py27h2f8d375_4|py36h2f8d375_4|py27h81de0dd_4|py35h2f8d375_4|py37h2f8d375_4|py35h2b20989_4|py36h2b20989_4|py37h2b20989_4|py27hdbf6ddf_4|py27h2b20989_4|py27hdbf6ddf_3|py37h2b20989_3|py36hdbf6ddf_2|py27hdbf6ddf_2|py36h2b20989_2|py27h2b20989_2|py37hdbf6ddf_1|py35hdbf6ddf_0|py27hdbf6ddf_0|py36hdbf6ddf_0|py36h2b20989_0|py36hdbf6ddf_0|py27hdbf6ddf_0|py27h9be14a7_1|py36h0ea5e3f_1|py38hde5b4d6_12|py37hde5b4d6_12|py27hde5b4d6_12|py27h2f8d375_12|py37hde5b4d6_11|py36hde5b4d6_11|py36h2f8d375_11|py27h2f8d375_11|py35h2f8d375_10|py36h2f8d375_10|py37h2f8d375_10|py36h81de0dd_10|py27h74e8950_10|py35h74e8950_10|py37h74e8950_10|py37h74e8950_9|py35h74e8950_9|py36h81de0dd_9|py35h3dfced4_9|py37h7cdd4dd_9|py27h7cdd4dd_9|py36h7cdd4dd_9|py35hdbf6ddf_8|py27hdbf6ddf_8|py37h2b20989_8|py37hdbf6ddf_8|py36h2b20989_8|py36hdbf6ddf_8|py27hdbf6ddf_7|py36h2b20989_7|py37h2b20989_7|py37hde5b4d6_0|py37h2f8d375_0|py37hdbf6ddf_7|py35hdbf6ddf_7|py36hdbf6ddf_6|py37hdbf6ddf_6|py27h2b20989_6']
pytorch -> numpy[version='>=1.11'] -> numpy-base[version='1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.11.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.3|1.14.4|1.14.4|1.14.4|1.14.4|1.14.4|1.14.4|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.5|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.14.6|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.0|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.1|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.2|1.15.3|1.15.3|1.15.3|1.15.3|1.15.3|1.15.3|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.15.4|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.0|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.1|1.16.2|1.16.2|1.16.2|1.16.2|1.16.2|1.16.2|1.16.3|1.16.3|1.16.3|1.16.3|1.16.3|1.16.3|1.16.4|1.16.4|1.16.4|1.16.4|1.16.4|1.16.4|1.16.5|1.16.5|1.16.5|1.16.5|1.16.5|1.16.5|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.16.6|1.17.2.|1.17.3.|1.17.4.|1.18.1.|1.18.5.|1.19.1|1.19.1|1.19.1|1.19.1|1.19.1|1.19.1|1.19.2|1.17.0|1.17.0|1.17.0|1.17.0|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|1.9.3|>=1.9.3,<2.0a0',build='py36h2b20989_6|py37h2b20989_6|py27hdbf6ddf_6|py27hdbf6ddf_7|py27h2b20989_7|py36hdbf6ddf_7|py36h2b20989_7|py35h2b20989_7|py37h2b20989_7|py36h2f8d375_0|py36hde5b4d6_0|py27h2b20989_7|py36hdbf6ddf_7|py37hdbf6ddf_7|py27h2b20989_8|py35h2b20989_8|py35h7cdd4dd_9|py37h3dfced4_9|py36h3dfced4_9|py27h3dfced4_9|py37h81de0dd_9|py27h74e8950_9|py36h74e8950_9|py27h81de0dd_9|py35h81de0dd_9|py36h74e8950_10|py27h81de0dd_10|py37h81de0dd_10|py35h81de0dd_10|py27h2f8d375_10|py37h2f8d375_11|py27hde5b4d6_11|py37h2f8d375_12|py36h2f8d375_12|py36hde5b4d6_12|py38h2f8d375_12|py27h0ea5e3f_1|py35h0ea5e3f_1|py36h9be14a7_1|py35h9be14a7_1|py27h2b20989_0|py36h2b20989_0|py35h2b20989_0|py35hdbf6ddf_0|py27h2b20989_0|py36h2b20989_1|py37h2b20989_1|py27hdbf6ddf_1|py36hdbf6ddf_1|py27h2b20989_1|py37h2b20989_2|py37hdbf6ddf_2|py36h2b20989_3|py37hdbf6ddf_3|py36hdbf6ddf_3|py27h2b20989_3|py37hdbf6ddf_4|py36hdbf6ddf_4|py35hdbf6ddf_4|py36h81de0dd_4|py38h2f8d375_4|py38hde5b4d6_4|py27h2f8d375_5|py37hde5b4d6_5|py36hde5b4d6_5|py27h7cdd4dd_0|py37h7cdd4dd_0|py27h3dfced4_0|py36h3dfced4_0|py37h3dfced4_0|py27h81de0dd_0|py36h2f8d375_0|py35h81de0dd_0|py36h2f8d375_0|py27h2f8d375_1|py27h81de0dd_1|py36h81de0dd_1|py37h81de0dd_1|py27h2f8d375_0|py36h2f8d375_0|py27h81de0dd_0|py36h2f8d375_0|py27h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py36hde5b4d6_0|py37h2f8d375_1|py27h2f8d375_1|py37hde5b4d6_1|py27hde5b4d6_1|py36h2f8d375_0|py36hde5b4d6_0|py27hde5b4d6_0|py27h2f8d375_1|py36h2f8d375_1|py37hde5b4d6_1|py36hde5b4d6_1|py27hde5b4d6_1|py36h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py37hde5b4d6_0|py36hde5b4d6_0|py36h2f8d375_0|py36hde5b4d6_0|py36h2f8d375_0|py37hde5b4d6_0|py36hde5b4d6_0|py38h2f8d375_0|py38hde5b4d6_0|py27hde5b4d6_0|py38h75fe3a5_0|py38hfa32c7d_0|py36hfa32c7d_0|py37h75fe3a5_0|py37h75fe3a5_0|py38hfa32c7d_0|py36hfa32c7d_0|py36h75fe3a5_0|py38h75fe3a5_0|py37hfa32c7d_0|py36h75fe3a5_0|py37hfa32c7d_0|py27h2f8d375_0|py37h2f8d375_0|py37hde5b4d6_0|py27hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py27hde5b4d6_0|py37hde5b4d6_0|py37h2f8d375_0|py36h2f8d375_0|py27h2f8d375_0|py36hde5b4d6_0|py27hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py27hde5b4d6_0|py37hde5b4d6_0|py27h2f8d375_0|py37h2f8d375_0|py37h2f8d375_1|py37hde5b4d6_0|py27h2f8d375_0|py37h2f8d375_0|py36hde5b4d6_1|py36h2f8d375_1|py27hde5b4d6_0|py37hde5b4d6_0|py37h2f8d375_0|py27h2f8d375_0|py37hde5b4d6_0|py27hde5b4d6_0|py27h81de0dd_0|py36h81de0dd_0|py37h81de0dd_0|py37h2f8d375_0|py37h81de0dd_0|py36h81de0dd_0|py37h2f8d375_0|py37h2f8d375_1|py36h2f8d375_1|py36h81de0dd_0|py35h2f8d375_0|py27h81de0dd_0|py37h81de0dd_0|py27h2f8d375_0|py37h2f8d375_0|py37h2f8d375_0|py27h2f8d375_0|py35h2f8d375_0|py35h81de0dd_0|py37h81de0dd_0|py36h81de0dd_0|py37h74e8950_0|py36h74e8950_0|py27h74e8950_0|py35h74e8950_0|py35h3dfced4_0|py35h7cdd4dd_0|py36h7cdd4dd_0|py27hde5b4d6_5|py37h2f8d375_5|py36h2f8d375_5|py37h81de0dd_4|py35h81de0dd_4|py27h2f8d375_4|py36h2f8d375_4|py27h81de0dd_4|py35h2f8d375_4|py37h2f8d375_4|py35h2b20989_4|py36h2b20989_4|py37h2b20989_4|py27hdbf6ddf_4|py27h2b20989_4|py27hdbf6ddf_3|py37h2b20989_3|py36hdbf6ddf_2|py27hdbf6ddf_2|py36h2b20989_2|py27h2b20989_2|py37hdbf6ddf_1|py35hdbf6ddf_0|py27hdbf6ddf_0|py36hdbf6ddf_0|py36h2b20989_0|py36hdbf6ddf_0|py27hdbf6ddf_0|py27h9be14a7_1|py36h0ea5e3f_1|py38hde5b4d6_12|py37hde5b4d6_12|py27hde5b4d6_12|py27h2f8d375_12|py37hde5b4d6_11|py36hde5b4d6_11|py36h2f8d375_11|py27h2f8d375_11|py35h2f8d375_10|py36h2f8d375_10|py37h2f8d375_10|py36h81de0dd_10|py27h74e8950_10|py35h74e8950_10|py37h74e8950_10|py37h74e8950_9|py35h74e8950_9|py36h81de0dd_9|py35h3dfced4_9|py37h7cdd4dd_9|py27h7cdd4dd_9|py36h7cdd4dd_9|py35hdbf6ddf_8|py27hdbf6ddf_8|py37h2b20989_8|py37hdbf6ddf_8|py36h2b20989_8|py36hdbf6ddf_8|py27hdbf6ddf_7|py36h2b20989_7|py37h2b20989_7|py37hde5b4d6_0|py37h2f8d375_0|py37hdbf6ddf_7|py35hdbf6ddf_7|py36hdbf6ddf_6|py37hdbf6ddf_6|py27h2b20989_6']

mobilenet v3 in 'cuda' mode doesn't appear to use cuda

needs investigation but looks like it's missing and falling to cpu silently.

script_lnlstm crashes when used bi-directional

Dear PyTorch team,
Recently I tried out these versions of LSTMs with layer normalization, that I found through the PyTorch forums. Using the script_lnlstm with layer normalization however, causes the program to crash once loss.backward is called:

===
Traceback (most recent call last):
File "mydir/custom_lstms.py", line 508, in
main()
File "mydir/custom_lstms.py", line 504, in main
test_script_stacked_lnlstm_bidirectional(5, 2, 3, 7, 4)
File "mydir/custom_lstms.py", line 494, in test_script_stacked_lnlstm_bidirectional
loss.backward()
File "/mydir_2/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/mydir_2/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: dim() called on undefined Tensor
The above operation failed in interpreter, with the following stack trace:
at :138:28
def backward(grad_output):
grad_tensors = torch.unbind(grad_output, dim)
return grad_tensors, None

        return torch.stack(tensors, dim), backward

    def unbind(self,
               dim: int=0):
        def backward(grad_outputs: List[Tensor]):
            grad_self = torch.stack(grad_outputs, dim)
                        ~~~~~~~~~~~ <--- HERE
            return grad_self, None

        return torch.unbind(self, dim), backward

    def cat(tensors: List[Tensor],
            dim: int=0):
        size = len(tensors)
        split_sizes = [0] * size
        for i in range(size):

=====
The following test function reproduces this error:

def test_script_stacked_lnlstm_bidirectional(seq_len, batch, input_size, hidden_size,
num_layers):
inp = torch.randn(seq_len, batch, input_size)
states = [[LSTMState(torch.randn(batch, hidden_size),
torch.randn(batch, hidden_size)),
LSTMState(torch.randn(batch, hidden_size),
torch.randn(batch, hidden_size))]
for _ in range(num_layers)]
print("inp.size(): " + str(inp.size()))
print("states: " + str(states))
rnn = script_lnlstm(input_size, hidden_size, num_layers,
bidirectional=True)

# just a smoke test
out, out_state = rnn(inp, states)

# This additional code, adding a loss function and using it
# to compute the loss and then calling the backward function,
# causes the program to crash
loss_function = torch.nn.L1Loss()
out_desired = torch.ones_like(out)
loss = loss_function(out, out_desired)
loss.backward()

======
I also had to make a fix to the "reverse" function to even get to this point:

def reverse(lst):

#print("len(lst: " + str(len(lst)))
#for element in lst:
#    print("element.size(): " + str(element.size()))
# type: (List[Tensor]) -> List[Tensor]

# See: https://github.com/pytorch/pytorch/issues/27543
#return lst[::-1]  # This fails with bidirectional LSTM
# Alternative implementation
copy_list = lst.copy()
copy_list.reverse()
return copy_list
#lst.reverse()
#return lst

=====
The problem only occurs when the bidirectional=True is set, without that it works. Any ideas how to fix this?

Add PyText model to benchmark repo

Benchmark Repo: https://github.com/pytorch/benchmark
Instructions: https://github.com/pytorch/benchmark/blob/master/torchbenchmark/models/ADDING_MODELS.md
Source model: https://github.com/facebookresearch/pytext

After finishing the install.py and init.py, prep a github PR for review.

Simple PytorchBenchmark script gives CUDA forked subprocess error

When I try to run a simple Pytorch Benchmark test I get this error:

Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
  File "/home/cbarkhof/code-thesis/Experimentation/Benchmarking/benchmarking-models-claartje.py", line 31, in <module>
    benchmark.run()
  File "/home/cbarkhof/.local/lib/python3.6/site-packages/transformers/benchmark/benchmark_utils.py", line 674, in run
    memory, inference_summary = self.inference_memory(model_name, batch_size, sequence_length)
ValueError: too many values to unpack (expected 2)

The script is as simple as this:

from transformers import PyTorchBenchmark, PyTorchBenchmarkArguments

benchmark_args = PyTorchBenchmarkArguments(models=["bert-base-uncased"],
                                           batch_sizes=[8],
                                           sequence_lengths=[8, 32, 128, 512],
                                           save_to_csv=True,
                                           log_filename='log',
                                           env_info_csv_file='env_info')

benchmark = PyTorchBenchmark(benchmark_args)
benchmark.run()

Add docs on how to profile benchmark models

nits for running benchmark models locally

moved from original issue in pytorch/hub#148 on behalf of @zdevito

I am working on python packaging for PyTorch and just used the benchmark models to verify that the packaging approach would work and not get caught up in the complexity of existing models. I used the ability to loop through the models to get a handle on the torch.nn.Module for each one, saved it in the package, and reloaded it. It illustrated a lot of shortcomings of my initial code and I was able to quickly try fixes and see how they would work. Pretty cool! I think the benchmark suite is going to be really useful for these type of design questions in addition to purely perf improvements. Thanks for helping to put it together.

As part of the process of using the benchmarks, I ran into a few nits in some of the benchmarks that only got uncovered when trying to use things locally.

Couldn't easily work around
Background-Matting - does not work on my local machine

has hard-coded circleci paths,
expects local directory to be a specific value

tacotron2

requires a GPU even if you ask for a cpu model, because it calls .cuda in load_model()

Require workarounds
Overall

the use of sys.path modifications to load files means that error messages are confusing:

e.g. (File "hubconf.py", line 74, in init, which hubconf.py is that?). Would be to treat models/ as part of the path and to load the submodules from there.
BERT-pytorch

expects local directory to be a specific value

attention-is-all-you-need-pytorch

expects local directory to be a specific value

fastNLP

expects local directory to be a specific value

demucs

get_module does not return a torch.nn.Module (returns a lambda)
doesn't do anything with the jit flag (should throw if it is not supported)
puts ScriptModule annotation on model, but doesn't actually script the model

moco

default device is set to 'cuda' but the runbook specifies the default device is 'cpu',

causes model to fail in unexpected way when cuda is not installed

unify the approach to train() and eval() implementation

some models implement by preparing data, .cuda() etc. in init and do the bare minimum 'inner loop' in train()/eval() - this plays nicely with the benchmark setup.

Other models may use vanilla 'training_loop' functions from original model code, and call this from train() or eval(), which means we could be benchmarking more overhead than we want.

Audit the existing models and identify which ones if any are problematic, address them.

Https proxy comment in test.py

I am trying to run the benchmarks, but the benchmarks hangs when running python test.py. The script does not run and gets stuck somewhere. I discovered the comment in test.py

Make sure to enable an https proxy if necessary, or the setup steps may hang.

What does that mean? What script does need the proxy for what?

Replace static EmbeddingBag in FastNLP with a contextual embedding (ie: BERT)

Fix build and installation error due to maskrcnn benchmark

python test.py fails to run due to missing model attr from maskrcnn_benchmark
python install also fails to install due to maskrcnn_benchmark

Instead of aborting the entire script, this fix wll printout a WARNING message and skip the benchmark, so that the rest of the benchmarks may run.

Warning: <module 'torchbenchmark.models.maskrcnn_benchmark' (namespace)> does not define attribute Model, skip it

In addition, improved the error message for python install --continue_on_fail. After the fix, the error message is:

Warning: some benchmarks were not installed due to failure

Add SSD to benchmark model repo

This is in pytorch/hub: https://pytorch.org/hub/nvidia_deeplearningexamples_ssd/.

There's some evidence that this can be made much faster (up to 3x) with some graph optimizations: https://paulbridger.com/posts/video-analytics-deepstream-pipeline/

Improvements for benchmark score

This is more like an epic to track feedback relative to the first iteration of the benchmark score. We should file specific tasks if there are actionable chunks we intend to do.

Top Level Metrics

what should be produced besides a single score?
how can we compare e.g. cuda vs cpu, or jit vs eager?
- can we highlight which specific tests or categories are responsible for most of the score delta (relative to norm, or to another dataset)?

Weight Hierarchy

overly complicated, not easy to understand or modify

Implementation

multiple scripts and input/output files are clunky. Can we wrap a score configuration up inside a named function (e.g. compute_score_v1(input_timings) -> value)?

Numerics

we should make sure weights are positive
and ensure all the (weight, norm, measurement) are in a numerically stable range, e.g. enforce a minimum epsilon

Other

names used in data and score are 'abbreviated' ones
some of the metadata (e.g. model type, category, etc.) could be moved into Model class (in model/init.py)

@zdevito @zheng-xq - feel free to continue discussion here.

benchmark/rnns/benchmarks/lstm_variants/lstm.py equation error

I think in line 192 in file benchmark/rnns/benchmarks/lstm_variants/lstm.py

o_t = gates[:, -self.hidden_size:]

should be modified to

o_t = gates[:, 2 * self.hidden_size:3 * self.hidden_size]

Add StarGAN model to benchmark repo

Benchmark Repo: https://github.com/pytorch/benchmark
Instructions: https://github.com/pytorch/benchmark/blob/master/models/ADDING_MODELS.md
Source model: https://github.com/zou3519/stargan

The StarGAN model has already been forked and modified in a previous phase of this process, but the API (install.py, hubconf.py) had not yet been standardized. Copy the repo below and follow the latest instructions in the playbook to add the new API. Note, the 'install.sh' and 'run.sh' scripts added in the previous effort would serve as good hints on how to proceed.

note in this case it looks like there are no steps to do for installation, but the install.py file should still be present as a no-op for the API sake.

Filter models in run script

make a list-models option and a filter option to only run setup and/or run steps for filtered models

Also, rename test.py to something else?

Update score and plot to compute subscore for gpu/cpu/train/eval

The current torchbenchmark framework reports one single benchmark score and doesn’t give a way to
calculate the subscore for GPU/CPU/train/eval/eager/JIT. So we need a way to get the benchmark subscore by calculating the subscore for each of the following 8 configurations:

* train-cuda-eager
* train-cpu-eager
* eval-cuda-eager
* eval-cpu-eager
* train-cuda-jit
* train-cpu-jit
* eval-cuda-jit
* eval-cpu-jit

Here are the two subtasks required to accomplish this:

Update the subscore depending on the configuration
Generate 9 plots . One for each of the above configuration(8) and then one with the whole score

Run benchmarks periodically in PyTorch CI

new workflow on pytorch side
- install pytorch, torchvision, torchtext
  - can't use conda like benchmark CI becuase we want access to more fine grained versions of pytorch than nightly
  - @malfet can help with how to access build artifacts from any particular git version of pytorch repo in last X days
  - install those
- run benchmarks
- upload results

bisect job would build on top by

get previous and current periodic result
if regression suspected, then
- do binary search over all commits between A, B
  - for each, get the corresponding pytorch artifacts, install them, run measurement

Remove bloat from git repo

These account for a large % of the repo size but aren't needed. However, some work is needed before they can all be removed, and then the process of rewriting git history to clean the files up should be done all at once and carefully. This issue can accumulate a wish list of files to delete, and at some point we can purge them all at once.

demucs/results/* -- can be removed, not used by benchmark
Background_matting/sample_data -- apparently not used, another ak folder contains used data
yolov3/data --contains mini coco, which is also checked in under torchbenchmark/data, so that should be used instead

Add model with inner optimization loop to model repo

An inner optimization loop is when the functional autograd API is used in the forward pass and then the user differentiates through it. Uses include MAML, Energy Based Models, and (shameless self plug) set generation.

Would be nice as it requires autograd in the forward pass, and requires higher order gradients in the backwards pass. Definitely more "researchy" code.

Add DLRM to benchmark

Original model source
https://github.com/facebookresearch/dlrm

looks like this one contains most (all?) the model code inside one file https://github.com/facebookresearch/dlrm/blob/master/dlrm_s_pytorch.py

the main function should be a good indicator how to set up the model for training or eval, but it has a lot of extra baggage that can be removed when creating the benchmark version, such as profiling, onnx, save/load, etc.

Also, note that pytorch benchmark repo was just reorganized. @zdevito is updating ADDING_MODELS.md with some updated instructions asap. Nothing changes for install.py but the hubconf.py is now named init.py, and there are more rules now about making no assumptions about current working directory for relative paths to load files.

benchmark models are not cleaned up between tests

although the test_bench.py fixtures are set up to be alive/shared across the lifetime of of a test class only, it appears that at least GPU memory is leaked from one test to the next, resulting in an eventual oom when enough tests have run in series.

Some models still download stuff during runtime

Observed downloading on first/clean run via test.py.
Should be downloaded instead during install step.

https://download.pytorch.org/models/vgg16-397923af.pth
en-ud-v2.zip

Identify Graph NNs to add and write up issues for them

Would any of these JIT and be useful for dynamic shapes work? Still good to cover these use cases in our benchmark suite.

https://github.com/facebookresearch/PyTorch-BigGraph (Partitioning)

CPU-only, GPU-experimental, suggests only useful with huge data/distributed system, haven't checked JIT compatibility
https://github.com/facebookresearch/poincare-embeddings (Embedding)
https://github.com/benedekrozemberczki/CapsGNN (Classification)
https://github.com/Diego999/pyGAT (Graph Regression)

make mobilenet use larger batch size for training data

Add Stanza model to benchmark repo

The hub repo is no longer used, models are no longer added as git submodules, other parts remain largely the same. Instructions are found in the README in the new repo

The Stanza model has already been forked and modified in a previous phase of this process, but the API (install.py, hubconf.py) had not yet been standardized. Copy the repo into benchmark.git, follow the latest instructions in the ADDING_MODELS readme to add the new API. Note, the 'install.sh' and 'run.sh' scripts added in the previous effort would serve as good hints on how to proceed.

'custom_lstms.py': name 'List' is not defined

I got this error, but i'm not able to find where the error come from.

  File "lstm_test.py", line 300, in <module>
    test_script_stacked_bidir_rnn(5, 2, 3, 7, 4)

  File "lstm_test.py", line 275, in test_script_stacked_bidir_rnn
    rnn = script_lstm(input_size, hidden_size, num_layers, bidirectional=True)

  File "lstm_test.py", line 50, in script_lstm
    hidden_size])

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\__init__.py", line 950, in init_then_register
    original_init(self, *args, **kwargs)

  File "lstm_test.py", line 189, in __init__
    other_layer_args)

  File "lstm_test.py", line 150, in init_stacked_lstm
    layers = [layer(*first_layer_args)] + [layer(*other_layer_args)

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\__init__.py", line 950, in init_then_register
    original_init(self, *args, **kwargs)

  File "lstm_test.py", line 129, in __init__
    ReverseLSTMLayer(cell, *cell_args),

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\__init__.py", line 951, in init_then_register
    _create_methods_from_stubs(self, methods)

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\__init__.py", line 912, in _create_methods_from_stubs
    self._create_methods(defs, rcbs, defaults)

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\annotations.py", line 52, in get_signature
    return parse_type_line(type_line)

  File "C:\Users\feder\Anaconda3\lib\site-packages\torch\jit\annotations.py", line 90, in parse_type_line
    arg_ann = eval(arg_ann_str, _eval_env)

  File "<string>", line 1, in <module>

NameError: name 'List' is not defined

cyclegan is extremely slow

can it be sped up, are there overheads to eliminate, is it properly using cuda when it says it is?

if fixable, re-enable running on PR jobs