mlbench / mlbench-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

26.0 7.0 6.0 451 KB

Distributed ML Training Benchmarks

Home Page: https://mlbench.github.io

License: Apache License 2.0

Makefile 1.41% Dockerfile 14.48% Shell 8.10% Python 76.01%

mlbench-benchmarks's Introduction

mlbench Benchmarks: Distributed Machine Learning Benchmark

https://api.travis-ci.com/mlbench/mlbench-benchmarks.svg?branch=develop

A public and reproducible collection of reference implementations and benchmark suite for distributed machine learning algorithms, frameworks and systems.

This repository contains the implementations for the various benchmark tasks in mlbench.

Project website: https://mlbench.github.io/
Free software: Apache Software License 2.0
Documentation: https://mlbench.readthedocs.io.

Features

For reproducibility and simplicity, we currently focus on standard supervised ML, including standard deep learning tasks as well as classic linear ML models.
We provide reference implementations for each algorithm, to make it easy to port to a new framework.
Our goal is to benchmark all/most currently relevant distributed execution frameworks. We welcome contributions of new frameworks in the benchmark suite.
We provide precisely defined tasks and datasets to have a fair and precise comparison of all algorithms, frameworks and hardware.
Independently of all solver implementations, we provide universal evaluation code allowing to compare the result metrics of different solvers and frameworks.
Our benchmark code is easy to run on public clouds.

Community

About us: Authors

Mailing list: https://groups.google.com/d/forum/mlbench

Contact Email: [email protected]

mlbench-benchmarks's People

Contributors

Stargazers

Watchers

Forkers

lucianamarques sabetai mmilenkoski sankalp-s caicaigogo isabella232

mlbench-benchmarks's Issues

Fix pytorch reference implementation

The pytorch reference implementation for cifar10-resnet-openmpi-allreduce currently has the wrong execution command specified, this needs to be adjusted.

Getting Error while creating container with Pytorch

Getting Error while creating container with Pytorch:
standard_init_linux.go:228: exec user process caused: no such file or directory

This error comes every time when we build container using docker or Kubernetes to run the workloads on GKE or EKS.

When we sampled the example from mlbench-benchmarks/examples/mlbench-pytorch-tutorial/
The result was successful for GLOO and NCCL.
When we sampled from mlbench-benchmarks/pytorch/backend_benchmark/
or
mlbench-benchmarks/pytorch/imagerecognition/cifar10-resnet20-all-reduce/

Getting the error standard_init_linux.go:228: exec user process caused: no such file or directory

Please suggest to resolve.

can it be used in other servers? such as: my machine,because the document is described by gcloud

Create light version of the base image for development

We should add a light version of the base image with only the essential libraries. This will be useful to speed up local development and testing. For example, we could exclude cuda and openmpi. Are there any non-essential imports that we could exclude and would significantly reduce the image size?

Change Tensorflow Benchmark to use OpenMPI

The current tensorflow cifar10 resnet benchmark uses openmpi to start training, but tensorflow doesn't use OpenMPI for communcation.

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi details how to get tensorflow to use OpenMPI.

It'd make sense to rename the current benchmark to correctly reflect it using gRPC for communcation and for using python directly to call workers instead of using OpenMPI to start training. A separate benchmark can be created that uses OpenMPI as backend, with all else being equal to the current one. Then we can compare gRPC vs OpenMPI.

Update PyTorch base to 1.7

Add NCCL support to PyTorch images

Add new PyTorch images that use NCCl as backend, based on current images.

documentation: clearly link ref code to benchmark tasks

document the code (and task documentation) better that it's easy to navigate from benchmark task to the reference implementation of each task, and the other way around.

https://github.com/mlbench/mlbench-benchmarks/blob/develop/pytorch/linearmodels/openmpi-epsilon-logistic-regression-all-reduce/main.py

Add time-to-accuracy speedup plot

in the benchmark description and results page, add a time-to-accuracy speedup plot, next to the throughput plot. this measures relative speedup in time compared to N=1 nodes.

it is crucial to point out how the same accuracy is (relatively) much slower to reach as the number of machines grows, i.e. the issues of large-batch training.

this is relevant for both cifar10 and linear models currently.

Clean-up tasks

Remove duplicate tasks for CIFAR10
Check all documentation

Remove stale branches

It looks like several "stale" branches (either already merged or inactive) exist in the repository.
Unless there is a good reason to keep them around, they should probably be deleted, which will remove clutter and improve clarity of the ongoing development activities.

tensorflow all-reduce & NCCL

TF 2.0 since the beta supports all-reduce without a parameter server, and with NCCL backend:
https://www.tensorflow.org/beta/guide/distribute_strategy

we should use this instead of the parameter server for the benchmark

remove open/closed division distinction

not needed for now, and confusing for newcomers. we just collect many implementations now, and declare one as the reference implementation for the task

make sure to also remove from documentation, tutorials etc

Update GKE documentation to use kubernetes version 1.10.9

There's a weird networking issue for some docker containers on version 1.10.5 (the default version on GKE), so version 1.10.9 has to be used. This currently affects the tensorflow benchmark implementation.

So we should update the documentation for that issue

Add NLP/machine translation Transformer benchmark task

Hardcoded use_cuda to True for openmpi-cifar10-resnet20-all-reduce

In latest commit (43d700c#diff-e482240eb6fb679b324d992a5ca5bda6R69) on develop there was a bug introduced which hardcodes use_cuda variable in one place to True, which makes unable to run without CUDA.

You can see it here:
https://github.com/mlbench/mlbench-benchmarks/blob/develop/pytorch/imagerecognition/openmpi-cifar10-resnet20-all-reduce/main.py#L69

Run new benchmarks and document costs

Supersedes mlbench/mlbench-core#82. We can now also use PyTorch 1.7.0

CIFAR10, ResNet20, All Reduce, 1 to 16 workers
CIFAR10, ResNet20, DDP, 1 to 16 workers
Wikitext2, LSTM, All Reduce, 1 to 16 (32 ?) workers
Wikitext2, LSTM, DDP, 1 to 16 (32 ?) workers
WMT16, LSTM, All Reduce, 1 to 32 workers
WMT16, LSTM, DDP, 1 to 32 workers
WMT17, Transformer, All Reduce, 1 to 32 workers
WMT17, Transformer, DDP, 1 to 32 workers

Remove Communication backend from image name

Now that we have decided to pass the backend as argument to benchmark tasks, it would be coherent to remove the communication backend from images name (i.e. change openmpi-cifar10-resnet20-all-reduce to cifar10-resnet20-all-reduce)

Also, as we discussed Friday, it would be good to be able to select the backend from the GUI (tick-boxes) and add the used backend to the results name.

benchmark TPU pods with PyTorch XLA

https://pytorch.org/xla/release/1.5/index.html#module-torch_xla.distributed.parallel_loader

https://github.com/pytorch/xla#-how-to-run-on-tpu-pods-distributed-training

Recreate Benchmarks

Recreate Benchmarks from original repo.

Use benchmark independent code from mlbench-core to recreate the benchmark tasks.

Create relevant folder structure

Create Dockerfiles for each benchmark

task implementations: delete choco, name tasks nlp/language-model and nlp/translation

we should also mention the precise task names in the respective code file headers, for easier matching.
implementations should be marked if they are the reference-impl or additional one

create light version (in addition to full) for resource heavy benchmark tasks

which can be used to test everything (including setup on different hardware but also results collection,presentation,etc)

would become part of each task definition

Fix pytorch resnet performance issue

The pytorch resnet implementation currently has an issue where each epoch takes longer than the previous one. This should be investigated and fixed

Add NLP/machine translation RNN benchmark task

Transformers: Add a BERT task

Add BERT as a new task
we can reuse existing transformer code we already have for the translation NLP task from here:
#33

@guptaprkhr @jbcdnr @mpagli can you point us to the best code to start from, including the pre-processing pipeline? and later have a look at it here again so we can get a draft of the standard data-parallel training running?

we should define a very light goal on the BERT training loss at first, to have something to iterate on quickly.

MLperf for comparison currently only has tensorflow BERT

pytorch 1.4

we should update to every major new version there, also gives us good practice to keep our code compatible. (also current cuda version etc)

Add NLP benchmark images & task

Add 1-2 new benchmark tasks for NLP tasks (e.g. Sentiment Analysis, POS Tagging, Machine Translation).

The benchmark should train in a reasonable time (on the order of hours, not days or weeks).

Discuss suitable tasks before implementing.

Create Tensorflow CPU base image

Tensorflow by default depends on CUDA, even when no GPUs are utilized. We need a base image with tensorflow compiled without CUDA to run CPU experiments on nodes without CUDA installed.

No unit tests

I could not find unit tests for the various benchmark implementations.

Repair Logistic regression Model

Add tensorflow cifar10 benchmark

Add a benchmark image for tensorflow, resnet, cifar 10 and (sort of) allreduce

add script to compute official results from raw results (time to acc for example)

make a small script to generate the official results (time to acc etc) from the raw results. needs to be automatic as it's part of the official benchmark

related to mlbench/mlbench-docs#8

Support for local run

It seems no way to run mlbench without AWS or Google Cloud. Help!

[Not an Issue] Comparing 3 backends on multi-node single-gpu env

[UPDATED]

This issue documents my results from testing the 3 different backends' speeds regarding to all_reduce operations in the distributed setting of multi-node single GPU.

The experiments compare the communication speeds of each backend, by sending multiple times, tensors of increasing size.

For each backend, we test float16 and float32 communication, both for GPU and CPU tensors (when possible). We also compare the advantage of using horovod.

Here is a graph depicting the results I have obtained.

This benchmark uses 2 nodes and 1 Tesla-T4 GPUs for each node.

As we can see, native MPI (i.e. without horovod) always outperforms NCCL for float32, and almost always outperforms GLOO for small tensors.

Also, using horovod does not have any performance benefits with speed, but it allows for float16 reduction using MPI. It outperforms NCCL and GLOO in large float16 reductions.

Use torch.nn.DataParallel for intra-node computation

Moved #46 to here

Add Gloo support to PyTorch images

Add new PyTorch images that use GLOO as backend, based on current images.