Coder Social home page Coder Social logo

tiresias's Introduction

Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information

Tiresias is a GPU cluster resource manager that aims at minimizing distributed deep learning (DDL) jobs’ completion times with partial or no a priori knowledge. It does not rely on any intermediate DL algorithm states (e.g., training loss values) or framework specifics (e.g., tensors-to-parameter server mapping).

DDL training jobs bring some unique challenges to the cluster manager:

  1. unpredictable training time
  2. over-aggressive job consolidation
  3. all-or-nothing resource allocation
  4. inflexibility in GPU sharing (job preemption and resumption)

Tiresias tackles those challenges with the Discretized-2DAS (two-dimensional age/attained-service based) scheduler and the model profile-based job placement scheme. The 2DAS scheduler, which considers both the spatial (GPU requirements) and temporal (job's executed time) aspects of DDL jobs, has two scheduling algorithms (Discretized 2D-LAS and Discretized 2D-Gittins Index). They can minimize the average JCT with no and partial job knowledge, respectively. The profile-based job placement scheme can appropriately relax the consolidation constraints and maintain the resource (GPU) utilization of cluster without hurting jobs’ performance.

Out testbed experiments and large-scale trace-driven simulations show that Tiresias improves the average JCT by up to 5.5x (2x) over current production solutions (state-of-the-art DDL cluster scheduler), and it performs comparably to the solution using perfect knowledge of all job characteristics.

Detailed design and performance are available in our NSDI'19 paper.

What's in this repository?

  1. Discrete-time simulator of GPU cluster manager for DL training jobs (with both the job scheduler and placement scheme)

Coming soon ...

  1. Network(RDMA)-level message profiler for DL models

  2. ...

Others

  1. What's LAS (Least-Attained Service) algorithm?
    Nuyens, Misja, and Adam Wierman. "The foreground–background queue: a survey." Performance evaluation 65.3-4 (2008): 286-307.

  2. What's Gittins Index policy?
    Gittins, John, Kevin Glazebrook, and Richard Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.