Comments (3)
Yes, not sure about the severity of the issue, however if you leave out the restartPolicy, it seems K8s 1.8 defaults to Always which results in the creation of master/workers but not ps (error below):
I Creating Service: master-xyzz-0 I Service master-xyzz-0 already exists. I Creating Job: master-xyzz-0 I master-xyzz-0 already exists. I Creating Service: worker-xyzz-0 I Service worker-xyzz-0 already exists. I Creating Job: worker-xyzz-0 I worker-xyzz-0 already exists. I Creating Service: ps-xyzz-0 I Service ps-xyzz-0 already exists. I Creating Job: ps-xyzz-0 E trainingJobCreateReplicas() error; [Creating Job ps-xyzz-0 returned error., Job.batch "ps-xyzz-0" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: OnFailure, Never] undefinedA simple TfJob configuration with one master, one worker, and one ps server would suffice where the restartPolicy for ps server is omitted (while it is present for master/worker).from training-operator.
After discuss, We will extend kubernetes built-in RestartPolicy
by adding new policy ExitCode
:
RestartPolicyAlways RestartPolicy = "Always"
RestartPolicyOnFailure RestartPolicy = "OnFailure"
RestartPolicyNever RestartPolicy = "Never"
RestartPolicyExitCode RestartPolicy = "ExitCode"
We let users set this field according to their model code.
- If set RestartPolicy to
OnFailure
/Always
, user should add reloading checkpoint code by themselves. - Otherwise restarting will take no effect.
ExitCode
policy means that user should add exit code by themselves, tf-operator
will check these exit codes to determine the behavior when an error occurs:
- 1-127: permanent error, do not restart.
- 128-255: retryable error, will restart the pod.
from training-operator.
dup with #524
from training-operator.
Related Issues (20)
- mpijob will stuck if LastReconcileTime is updated in 1 second
- Worker failed without exit code
- PyTorchJobClient not found HOT 3
- The actual default RestartPolicy of PyTorch is inconsistent with its description in the CRD HOT 1
- spatial dataset training functions HOT 1
- TfJob creation failed due to webhook validation failure HOT 1
- [GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator
- Improve Training Operator release process HOT 4
- [GSOC] Project 7 Tracking Issue: Automate docs generation for Training-operator Python SDK HOT 1
- Docs: reference architecture for fault tolerance capabilities HOT 12
- [SDK] Add more unit tests for TrainingClient APIs HOT 18
- TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'eval_strategy' HOT 3
- [Release] Training Operator 1.9 Roadmap
- Kubeflow Training V2 API
- Encountered an error while running the example in the document train_api_hf_dataset HOT 6
- Enable pre-commit for repo HOT 4
- "ImportError" when running fine-tuning API
- Support richer volcano scheduling HOT 1
- Consider container image rename of `kubeflow/storage-initializer` HOT 5
- Training job restart enhancement HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.