Coder Social home page Coder Social logo

liutongxuan / dlrover Goto Github PK

View Code? Open in Web Editor NEW

This project forked from intelligent-machine-learning/dlrover

0.0 0.0 0.0 70.05 MB

DLRover: An Automatic Distributed Deep Learning System

License: Apache License 2.0

Shell 0.56% Python 60.06% Go 38.69% Makefile 0.43% Dockerfile 0.27%

dlrover's Introduction

DLRover

Editor
DLRover: An Automatic Distributed Deep Learning System

Build

DLRover automatically trains the Deep Learning model on the distributed cluster. It helps model developers to focus on model arichtecture, without taking care of any engineering stuff, say, hardware acceleration, distributed running, etc. Now, it provides automated operation and maintenance for deep learning training jobs on K8s/Ray. Major features as

  • Automatic Resource Optimization, Automatically optimize the job resource to improve the training performance and resources utilization.
  • Dynamic data sharding, dynamic dispatch training data to each worker instead of dividing equally, faster worker more data.
  • Fault-Tolerance, single node failover without restarting the entire job.
  • Auto-Scaling, Automatically scale up/down resources at both node level and CPU/memory level.

Why DLRover?

No Resource Configuration to Submit a Job.

Compared with TFJob in Kubeflow, Users need not to set any resource configuration to submit a distributed training job.

Editor

Fault Tolerance to Improve the Stability of Job.

DLRover can recover failed parameter servers and workers to resume training. Some failed nodes do not interrupt the training and hurt the convergence accuracy. The main error is OOM of node due to user's insufficient memory configuration. DLRover can automatically launch a Pod with more memory to recover the OOM node. In AntGroup, DLRover manages hundreds of DL training jobs every day on the customized Kubernetes cluster in AntGroup. Except for the failed job resulting from code errors, the rate of completed jobs raise 89% with tf-operator in KubeFlow to 95%. Other unrecoverable failure reasons of a job are data error, NaN loss of the model, network breakdown, and so on.

Editor

Auto-Scaling to Improve Training Performance.

DLRover automatically scales up/down resources (for parameter servers or workers) at the runtime of a training job. By monitoring the workload of nodes and throughput, DLRover can diagnose the bottleneck of the resource configuration. The common bottleneck contains node straggler, the unbalanced workload of PS, insufficient CPU cores of nodes, and the insufficient number of nodes. DLRover can improve the training performance by dynamic resource adjustment.

We use the dataset of Kaggle CRITEO to train Wide&Deep and xDeepFM with 10 epoches on a K8s cluster. DLRover can mitigate straggler to improve the training throughput and shorten the job competion time (JCT).

Editor

Auto-Scaling to improve Resource Utilization.

Different model training requires different resources. Users prefer to configure their jobs with over-provision resources to avoid any potential risk from insufficient resources. This usually ends up in huge resource waste. DLRover Auto-Scaling can allocate resources by the demand of model training to reduce the waste of resources.

Editor

Dynamic data sharding

There are at two reasons why we need dynamic data sharding. The first one is that Dlrover needs to ensure the training data used to fit the model is processed as users expect. When workers recover from failure or are scaled up/down, they may consume training data twice or miss some training data due to a lack of a global coordinator. With dynamic data sharding, DLrover keeps track of every worker's data consumption and try its best to ensurse that data is delievered exact once/at least once/at most once streaming-data-splitter-and-manager.md. As a result, dynamic data sharding helps to eliminate uncertainties by ensuring data is consumed as users expect.

The second one is that dynamic data sharding reduces complexity for worker to deal with obtaining training data. As the training-master.md indicates, worker only needs to ask for data shard from the DLrover master without intracting with other worker to split data.

Dynamic data sharding can also mitigate the worker straggler. After a worker starts its training loop, it queries a shard from the TODO queue one by one. The fast worker will consumes more shards than the slow worker which is a straggler.

Integration to Offline and Online Deep Learning.

With the data source transparency provided by dynamic data sharding, DLRover can be integrated with offline training which consumes batch data, and also supports online learning with real-time streaming data. (fed with a message queue like RocketMQ/Kafka/Pulsar/..., or executed as a training sink node inside Flink/Spark/Ray/...)

By practice, DLRover is an ideal component to build an end-to-end industrial online learning system, estimator.md provides a detailed example implemented with tf.estimator.Estimator.

What's Next?

  • Fine-grained automatic distributed training for GPU Synchronous jobs
    • hybrid-parallel mode
    • adapted hyper parameters adjustment with dynamic resources
    • more strategies for Fine-grained scenarioes
  • Full stack solution for Online Deep Learning
  • High performance extension library for Tensorflow/Pytorch to speed up training
  • ...

Quick Start

Train a TensorFlow Estimator on Aliyun ACK

Train a PyTorch Model on Aliyun ACK

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.