Coder Social home page Coder Social logo

Comments (3)

paramgo avatar paramgo commented on August 11, 2024

Yes, not sure about the severity of the issue, however if you leave out the restartPolicy, it seems K8s 1.8 defaults to Always which results in the creation of master/workers but not ps (error below):

I  Creating Service: master-xyzz-0 
I  Service master-xyzz-0 already exists. 
I  Creating Job: master-xyzz-0 
I  master-xyzz-0 already exists. 
I  Creating Service: worker-xyzz-0 
I  Service worker-xyzz-0 already exists. 
I  Creating Job: worker-xyzz-0 
I  worker-xyzz-0 already exists. 
I  Creating Service: ps-xyzz-0 
I  Service ps-xyzz-0 already exists. 
I  Creating Job: ps-xyzz-0 
E  trainingJobCreateReplicas() error; [Creating Job ps-xyzz-0 returned error., Job.batch "ps-xyzz-0" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: OnFailure, Never] 
  undefined
A simple TfJob configuration with one master, one worker, and one ps server would suffice where the restartPolicy for ps server is omitted (while it is present for master/worker).

from training-operator.

ScorpioCPH avatar ScorpioCPH commented on August 11, 2024

After discuss, We will extend kubernetes built-in RestartPolicy by adding new policy ExitCode:

    RestartPolicyAlways    RestartPolicy = "Always"
    RestartPolicyOnFailure RestartPolicy = "OnFailure"
    RestartPolicyNever     RestartPolicy = "Never"
    RestartPolicyExitCode  RestartPolicy = "ExitCode"

We let users set this field according to their model code.

  • If set RestartPolicy to OnFailure/Always, user should add reloading checkpoint code by themselves.
  • Otherwise restarting will take no effect.

ExitCode policy means that user should add exit code by themselves, tf-operator will check these exit codes to determine the behavior when an error occurs:

  • 1-127: permanent error, do not restart.
  • 128-255: retryable error, will restart the pod.

from training-operator.

gaocegege avatar gaocegege commented on August 11, 2024

dup with #524

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.