Coder Social home page Coder Social logo

caffe2-operator's Introduction

caffe2-operator

Experimental repository for a caffe2 operator

Motivation

Caffe2 is a popular machine learning framework which currently does not have an operator/controller for Kubernetes. This proposal is aimed at defining what that operator should look like, and adding it to Kubeflow.

For distributed training, Caffe2 has no parameter server compared with Tensorflow, so it has to use Redis/Gloo to find the other nodes to communicate.

Build

$ make
mkdir -p _output/bin
go build -o _output/bin/caffe2-operator ./cmd/caffe2-operator/
$ _output/bin/caffe2-operator --help

Custom Resource Definition

The custom resource submitted to the Kubernetes API would look something like this:

apiVersion: "kubeflow.org/v1alpha1"
kind: "Caffe2Job"
metadata:
  name: "example-job"
spec:
  backend: "redis"
  replicaSpecs:
      replicas: 2
      template:
        spec:
          hostNetwork: true
          containers:
          - image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
            name: caffe2
            resources:
              limits:
                nvidia.com/gpu: 2
            workingDir: /usr/local/caffe2/caffe2/python/examples/
            command: ["python", "resnet50_trainer.py"]

A full resnet50 trainer example is here.

This Caffe2Job resembles the existing TFJob for the tf-operator. The main differences being the omission of the parameter server replica type, and the addition of backend options.

backend Defines the distributed type the Caffe2 master and workers will use to communicate when initializing the worker group. Information on the different backends (and the functions they support) can be found here.

For redis backend, you need a working Redis server to serve for workers communication.

Resulting Worker

apiVersion: v1
kind: Pod
metadata:
  name: caffe2-worker-${job_id}
  labels:
      caffe2_job_key: default-example-job
      caffe2_replica_index: "0"
      caffe2_replica_type: worker
      group_name: kubeflow.org
      runtime_id: "1529307087"
spec:
  containers:
    image: kubeflow/caffe2:py2-cuda9.0-cudnn7-ubuntu16.04
    imagePullPolicy: IfNotPresent
    name: caffe2
    env:
      - name: SHARD_ID
        value: "0"
      - name: NUM_SHARDS
        value: "1"
      - name: RUN_ID
        value: "1529307087"
      - name: CAFFE2_CONFIG
        value: '{"cluster":{"worker":["default-example-job-worker-0:2222"]},"task":{"type":"worker","index":0}}'
    ...

The worker spec generates a pod. They will communicate to the master through the redis's service name.

NOTE: There are three additional environments which are generated based on worker role, such as index for SHARD_ID, replicas for NUM_SHARDS and running ID for RUN_ID.

Design

This is an implementaion of the Caffe2 distributed design patterns, found here.

Other backends

Form here, Caffe2 also support NFS backend, however, we do not test the nfs backend now.

How to setup

Setup kubernetes

  • A full function kubernetes.
  • Open the features-gate if you want to use GPU

Create a CRD for kuberntes

# kubectl apply -f https://raw.githubusercontent.com/kubeflow/caffe2-operator/master/examples/crd.yaml
customresourcedefinition.apiextensions.k8s.io "caffe2jobs.kubeflow.org" created

Start the caffe2-operator

# ./caffe2-operator -alsologtostderr -v 4 -controller-config-file /root/admin.conf

Prepare the dataset

In the example, we use handwritten. You need to convert it to levelDB type by using make_mnist_db.

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/train-images-idx3-ubyte --label_file data/mnist/train-labels-idx1-ubyte --output_file data/mnist/mnist-train-nchw-leveldb 

$ make_mnist_db --channel_first --db leveldb --image_file data/mnist/t10k-images-idx3-ubyte --label_file data/mnist/t10k-labels-idx1-ubyte --output_file data/mnist/mnist-test-nchw-leveldb 

Run the job

$ kubectl apply -f ./examples/resnet50.yaml
$ kubectl get caffe2jobs
NAME          AGE
example-job   29m
$ kubectl get pods
NAME                READY     STATUS    RESTARTS   AGE
example-job-pdcbs   1/1       Running   0          29m

In this example, we use hostNetwork = true, it is not the better solution, but it will train more quickly. Because the overlay network will reduce some performance.

caffe2-operator's People

Contributors

carmark avatar jlewi avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.