Coder Social home page Coder Social logo

gitter-badger / tfmesos Goto Github PK

View Code? Open in Web Editor NEW

This project forked from douban/tfmesos

0.0 0.0 0.0 76 KB

Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos

License: BSD 3-Clause "New" or "Revised" License

Shell 11.38% Python 88.62%

tfmesos's Introduction

TFMesos

TFMesos is a lightweight framework to help running distributed Tensorflow Machine Learning tasks on Apache Mesos within Docker and Nvidia-Docker .

TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.

Prerequisites

  • For Mesos >= 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
  2. Setup Mesos Agent to enable Mesos Containerizer and Mesos Nvidia GPU Support (optional). eg: mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
  3. (optional) A Distributed Filesystem (eg: MooseFS)
  4. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster
  • For Mesos < 1.0.0:
  1. Mesos Cluster (cf: Mesos Getting Started). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical /etc/passwd and /etc/group.
  2. Docker (cf: Docker Get Start Tutorial)
  3. Mesos Docker Containerizer Support (cf: Mesos Docker Containerizer)
  4. (optional) Nvidia-docker installation (cf: Nvidia-docker installation) and make sure nvidia-plugin is accessible from remote host (with -l 0.0.0.0:3476)
  5. (optional) A Distributed Filesystem (eg: MooseFS)
  6. Ensure latest TFMesos docker image (tfmesos/tfmesos) is pulled across the whole cluster

If you are using AWS G2 instance, here is a sample script to setup most of there prerequisites.

Running simple Test

After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.

$ docker run -e MESOS_MASTER=mesos-master:5050 \
    -e DOCKER_IMAGE=tfmesos/tfmesos \
    --net=host \
    -v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
    --rm \
    -it \
    tfmesos/tfmesos \
    python /tmp/plus.py mesos-master:5050

Successfully running the test should result in an output of 42 on the console.

Running in replica mode

This mode is called Between-graph replication in official Distributed Tensorflow Howto

Most distributed training models that Google has open sourced (such as mnist_replica and inception) are using this mode. In this mode, two kind of Jobs are defined with the names 'ps' and 'wocker'. 'ps' tasks act as 'Parameter Server' and 'worker' tasks run the actual training process.

Here we use our modified 'mnist_replica' as example:

  1. Checkout the mnist example codes into a directory in shared filesystem, eg: /nfs/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1  \
             -V /nfs/mnist:/nfs/mnist \
             -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /nfs/mnist:/nfs/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /nfs/mnist \
             tfmesos/tfmesos \
             tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
             --ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
             --job_name {job_name} --worker_index {task_index}

Note:

In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container. tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.

Running in fine-grained mode

This mode is called In-graph replication in official Distributed Tensorflow Howto

In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.

Here we use our modified mnist as example:

  1. Checkout the mnist example codes into a directory, eg: /tmp/mnist
  2. Assume Mesos master is mesos-master:5050
  3. Now we can launch this script using following commands:

CPU:

$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py

GPU (1 GPU per worker):

$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
             --net=host \
             -v /tmp/mnist:/tmp/mnist \
             -v /etc/passwd:/etc/passwd:ro \
             -v /etc/group:/etc/group:ro \
             -u `id -u` \
             -w /tmp/mnist \
             tfmesos/tfmesos \
             python mnist.py --worker-gpus 1

tfmesos's People

Contributors

giorgioercixu avatar mckelvin avatar windreamer avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.