Comments (15)
Here's some background.
-
The original motivation to use a Job or Replica set controller (as opposed to managing pods directly) was to let K8s handle failures that would cause a pod to be terminated
- For example if a VM becomes unhealthy terminating the pod, we want to restart the pod; the assumption is that the training code is robust to these types of failures i.e. they are using Supervisor or similar mechanism to automatically recover
-
I originally picked Job controller over Replica set because it seemed like the semantics we wanted were "run until completion"; which maps to job controller pretty well
-
Job controller's semantics turn out not to be quite what we want
- We eventually realized we needed a mechanism to distinguish between retryable and permanent errors
- So we settled on a mechanism of classifying certain exit codes as retryable and others as permanent.
- Once we did that we can no longer rely on the built in controller semantics we have to look at container terminations to decide what to do based on the exit code and what type of process (worker, ps, master)
- So at this point none of the built in controllers really give us the semantics we want.
-
Auto restarting containers can lead to confusing behavior for users
- Suppose a container exits with a code indicating a permanent exit code
- The job controller would automatically restart it
- The TfJob controller would detect the process exited with a permanent exit code and terminate the process
- So if you look at the logs for the container, you see the container exiting due to some error, restarting and then being killed by SIGTERM
- Telling users to look at the previous container termination and not the most recent one is very confusing.
Thoughts for next iteration
- From a requirements perspective,I think we want to be able to supply our own function that decides whether a container should be restarted or not
- I think this will give us the flexibility to define semantics that make the most sense for TensorFlow
- Conceptually I think its simpler if the semantics are "containers are never restarted unless the TfJob Controller explicitly restarts them"
- Then its pretty clear that the controller just needs to enumerate the cases under which containers/pods need to be restarted
- We have to enumerate some cases that the default controllers (Job/Replica) automatically handle but I don't think that adds undue complexity. - I think we want to think about how this interacts with kube-arbitrator for gang-scheduling #165
from training-operator.
Another issue is figuring out what to do about logs.
I'd logs to be available after a job finishes but before it is deleted via kubectl logs. But a stateful set will just restart the container so I'm not sure how we preserve logs.
from training-operator.
If we decide to use pod, we could reuse the code in caicloud/kubeflow-controller 😄
And I also vote for pod, now
from training-operator.
@vishh FYI.
from training-operator.
The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.
That should be possible through the downward API and some sort of init script that configures env-var differently on each replica. The logs issue is more difficult because StatefulSets were not designed to support completion semantics.
Another alternative worth investigating is just scheduling individual pods in which there is complete control over all of those aspects.
from training-operator.
I think we should also consider creating the pods directly ourselves so that we can get the semantics we want in terms
- Restart behavior based on exit code
- Setting of environment variables per replica
I think GoogleCloudPlatform/kube-metacontroller could make it really simple to define a controller that just manages a single pod with the exact semantics we want.
Using a single controller per replica means we can set environment variables specific to each replica (e.g. index) which I think is very convenient for end-users because they can use the downward api to set replica specific flags based on environment variables e.g.
--index=${replica_index}
from training-operator.
We should create the pods ourselves, I agree. But having a controller per pod is not a common pattern - I think it would be quite a lot of complexity. One place the meta-controller fits in, is in replacing the existing go-based controller code. But IIRC, there's certain things we may not want to do (yet) with the meta-controller - like scaling (watch optimizations & informers), authorization of individual controllers, etc. For those things, it might be better to continue to use the go-based custom controller.
from training-operator.
A same discussion in caicloud internally. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365
from training-operator.
@jlewi Hi, Thanks for your background update.
Auto restarting containers can lead to confusing behavior for users.
We have an option restartPolicy
in TFJob API here, so let user to define the restarting policy maybe better.
from training-operator.
We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.
I agree with letting the user specify the restart behavior. In the current implementation, restart behavior is tied to replica type. Workers, parameter servers, and masters have different restart behaviors.
I think in a future iteration we should let users define the restart behavior for each replica by picking from a set of policies.
I don't know if we can reuse the existing restartPolicy
field. The restartPolicy
field is a side effect of using PodTemplateSpec. The built in values for K8s restartPolicies aren't sufficient. For example, there's no policy that corresponds to the existing behavior of restarting on retryable errors but not permanent errors.
Its not clear to me whether we could introduce TfJob specific values for PodTemplateSpec's restartPolicy
; this might interfere with automatic template validation.
I suspect a cleaner implementation will be to add an appropriate restart behavior field to TfReplicaSpec. The restartPolicy
in PodTemplateSpec should always be set to some suitable default chosen by TfJobController (#55). So users should never explicitly set restartPolicy
. We can enforce consistency for restartPolicy
by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.
from training-operator.
retryable errors
Do you mean we can re-started some workers with checkpoint file?
We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.
Good idea, and in consideration of these smart
features, I prefer Pod
now :), It will give us the full control of the whole workflow.
from training-operator.
@jlewi @ScorpioCPH Sorry for the late reply.
Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.
SGTM. We'd better move it forward and take care of the implementation at the same time.
from training-operator.
@ScorpioCPH @gaocegege Maybe we should merge the CRD at first. There are some difference between upstream and caicloud's implementation.
Some discussion here https://github.com/caicloud/kubeflow-controller/issues/80
from training-operator.
@jlewi I think we have reached an agreement (Pod) after discussion, should we close this?
from training-operator.
Good idea. Opened #325
from training-operator.
Related Issues (20)
- mpijob will stuck if LastReconcileTime is updated in 1 second
- Worker failed without exit code
- PyTorchJobClient not found HOT 3
- The actual default RestartPolicy of PyTorch is inconsistent with its description in the CRD HOT 1
- spatial dataset training functions HOT 1
- TfJob creation failed due to webhook validation failure HOT 1
- [GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator
- Improve Training Operator release process HOT 4
- [GSOC] Project 7 Tracking Issue: Automate docs generation for Training-operator Python SDK HOT 1
- Docs: reference architecture for fault tolerance capabilities HOT 12
- [SDK] Add more unit tests for TrainingClient APIs HOT 18
- TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'eval_strategy' HOT 3
- [Release] Training Operator 1.9 Roadmap
- Kubeflow Training V2 API
- Encountered an error while running the example in the document train_api_hf_dataset HOT 6
- Enable pre-commit for repo HOT 4
- "ImportError" when running fine-tuning API
- Support richer volcano scheduling HOT 1
- Consider container image rename of `kubeflow/storage-initializer` HOT 5
- Training job restart enhancement HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.