Coder Social home page Coder Social logo

mlbench-benchmarks's Introduction

mlbench Benchmarks: Distributed Machine Learning Benchmark

https://api.travis-ci.com/mlbench/mlbench-benchmarks.svg?branch=develop Documentation Status

A public and reproducible collection of reference implementations and benchmark suite for distributed machine learning algorithms, frameworks and systems.

This repository contains the implementations for the various benchmark tasks in mlbench.

Features

  • For reproducibility and simplicity, we currently focus on standard supervised ML, including standard deep learning tasks as well as classic linear ML models.
  • We provide reference implementations for each algorithm, to make it easy to port to a new framework.
  • Our goal is to benchmark all/most currently relevant distributed execution frameworks. We welcome contributions of new frameworks in the benchmark suite.
  • We provide precisely defined tasks and datasets to have a fair and precise comparison of all algorithms, frameworks and hardware.
  • Independently of all solver implementations, we provide universal evaluation code allowing to compare the result metrics of different solvers and frameworks.
  • Our benchmark code is easy to run on public clouds.

Community

About us: Authors

Mailing list: https://groups.google.com/d/forum/mlbench

Contact Email: [email protected]

mlbench-benchmarks's People

Contributors

dependabot[bot] avatar ehoelzl avatar giorgiosav avatar liehe avatar lucianamarques avatar martinjaggi avatar mmilenkoski avatar negar-foroutan avatar panaetius avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlbench-benchmarks's Issues

Fix pytorch reference implementation

The pytorch reference implementation for cifar10-resnet-openmpi-allreduce currently has the wrong execution command specified, this needs to be adjusted.

Getting Error while creating container with Pytorch

Getting Error while creating container with Pytorch:
standard_init_linux.go:228: exec user process caused: no such file or directory

This error comes every time when we build container using docker or Kubernetes to run the workloads on GKE or EKS.

  1. When we sampled the example from mlbench-benchmarks/examples/mlbench-pytorch-tutorial/
    The result was successful for GLOO and NCCL.

  2. When we sampled from mlbench-benchmarks/pytorch/backend_benchmark/
    or

  3. mlbench-benchmarks/pytorch/imagerecognition/cifar10-resnet20-all-reduce/

Getting the error standard_init_linux.go:228: exec user process caused: no such file or directory

Please suggest to resolve.

Create light version of the base image for development

We should add a light version of the base image with only the essential libraries. This will be useful to speed up local development and testing. For example, we could exclude cuda and openmpi. Are there any non-essential imports that we could exclude and would significantly reduce the image size?

Change Tensorflow Benchmark to use OpenMPI

The current tensorflow cifar10 resnet benchmark uses openmpi to start training, but tensorflow doesn't use OpenMPI for communcation.

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi details how to get tensorflow to use OpenMPI.

It'd make sense to rename the current benchmark to correctly reflect it using gRPC for communcation and for using python directly to call workers instead of using OpenMPI to start training. A separate benchmark can be created that uses OpenMPI as backend, with all else being equal to the current one. Then we can compare gRPC vs OpenMPI.

Add time-to-accuracy speedup plot

in the benchmark description and results page, add a time-to-accuracy speedup plot, next to the throughput plot. this measures relative speedup in time compared to N=1 nodes.

it is crucial to point out how the same accuracy is (relatively) much slower to reach as the number of machines grows, i.e. the issues of large-batch training.

this is relevant for both cifar10 and linear models currently.

Clean-up tasks

  • Remove duplicate tasks for CIFAR10
  • Check all documentation

Remove stale branches

It looks like several "stale" branches (either already merged or inactive) exist in the repository.
Unless there is a good reason to keep them around, they should probably be deleted, which will remove clutter and improve clarity of the ongoing development activities.

remove open/closed division distinction

not needed for now, and confusing for newcomers. we just collect many implementations now, and declare one as the reference implementation for the task

make sure to also remove from documentation, tutorials etc

Update GKE documentation to use kubernetes version 1.10.9

There's a weird networking issue for some docker containers on version 1.10.5 (the default version on GKE), so version 1.10.9 has to be used. This currently affects the tensorflow benchmark implementation.

So we should update the documentation for that issue

Run new benchmarks and document costs

Supersedes mlbench/mlbench-core#82. We can now also use PyTorch 1.7.0

  • CIFAR10, ResNet20, All Reduce, 1 to 16 workers
  • CIFAR10, ResNet20, DDP, 1 to 16 workers
  • Wikitext2, LSTM, All Reduce, 1 to 16 (32 ?) workers
  • Wikitext2, LSTM, DDP, 1 to 16 (32 ?) workers
  • WMT16, LSTM, All Reduce, 1 to 32 workers
  • WMT16, LSTM, DDP, 1 to 32 workers
  • WMT17, Transformer, All Reduce, 1 to 32 workers
  • WMT17, Transformer, DDP, 1 to 32 workers

Remove Communication backend from image name

Now that we have decided to pass the backend as argument to benchmark tasks, it would be coherent to remove the communication backend from images name (i.e. change openmpi-cifar10-resnet20-all-reduce to cifar10-resnet20-all-reduce)

Also, as we discussed Friday, it would be good to be able to select the backend from the GUI (tick-boxes) and add the used backend to the results name.

Recreate Benchmarks

Recreate Benchmarks from original repo.

Use benchmark independent code from mlbench-core to recreate the benchmark tasks.

Create relevant folder structure

Create Dockerfiles for each benchmark

Fix pytorch resnet performance issue

The pytorch resnet implementation currently has an issue where each epoch takes longer than the previous one. This should be investigated and fixed

Transformers: Add a BERT task

Add BERT as a new task
we can reuse existing transformer code we already have for the translation NLP task from here:
#33

@guptaprkhr @jbcdnr @mpagli can you point us to the best code to start from, including the pre-processing pipeline? and later have a look at it here again so we can get a draft of the standard data-parallel training running?

we should define a very light goal on the BERT training loss at first, to have something to iterate on quickly.

MLperf for comparison currently only has tensorflow BERT

pytorch 1.4

we should update to every major new version there, also gives us good practice to keep our code compatible. (also current cuda version etc)

Add NLP benchmark images & task

Add 1-2 new benchmark tasks for NLP tasks (e.g. Sentiment Analysis, POS Tagging, Machine Translation).

The benchmark should train in a reasonable time (on the order of hours, not days or weeks).

Discuss suitable tasks before implementing.

Create Tensorflow CPU base image

Tensorflow by default depends on CUDA, even when no GPUs are utilized. We need a base image with tensorflow compiled without CUDA to run CPU experiments on nodes without CUDA installed.

No unit tests

I could not find unit tests for the various benchmark implementations.

[Not an Issue] Comparing 3 backends on multi-node single-gpu env

[UPDATED]

This issue documents my results from testing the 3 different backends' speeds regarding to all_reduce operations in the distributed setting of multi-node single GPU.

The experiments compare the communication speeds of each backend, by sending multiple times, tensors of increasing size.

For each backend, we test float16 and float32 communication, both for GPU and CPU tensors (when possible). We also compare the advantage of using horovod.

Here is a graph depicting the results I have obtained.
fp32vsfp16

This benchmark uses 2 nodes and 1 Tesla-T4 GPUs for each node.

As we can see, native MPI (i.e. without horovod) always outperforms NCCL for float32, and almost always outperforms GLOO for small tensors.

Also, using horovod does not have any performance benefits with speed, but it allows for float16 reduction using MPI. It outperforms NCCL and GLOO in large float16 reductions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.