In the current implementation if a TfProcess (e.g. PS, MASTER, WORKER) has N replicas.

Here's some background. The original motivation

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

A same discussion in caicloud internally. Ref: <a href="https://github.com/caicloud/k

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Consider how we manage replicas (stateful sets, managing pods directly) about training-operator HOT 15 CLOSED

kubeflow commented on August 11, 2024

Consider how we manage replicas (stateful sets, managing pods directly)

from training-operator.

Comments (15)

jlewi commented on August 11, 2024 3

Here's some background.

The original motivation to use a Job or Replica set controller (as opposed to managing pods directly) was to let K8s handle failures that would cause a pod to be terminated
- For example if a VM becomes unhealthy terminating the pod, we want to restart the pod; the assumption is that the training code is robust to these types of failures i.e. they are using Supervisor or similar mechanism to automatically recover
I originally picked Job controller over Replica set because it seemed like the semantics we wanted were "run until completion"; which maps to job controller pretty well
Job controller's semantics turn out not to be quite what we want
- We eventually realized we needed a mechanism to distinguish between retryable and permanent errors
- So we settled on a mechanism of classifying certain exit codes as retryable and others as permanent.
- Once we did that we can no longer rely on the built in controller semantics we have to look at container terminations to decide what to do based on the exit code and what type of process (worker, ps, master)
- So at this point none of the built in controllers really give us the semantics we want.
Auto restarting containers can lead to confusing behavior for users
- Suppose a container exits with a code indicating a permanent exit code
- The job controller would automatically restart it
- The TfJob controller would detect the process exited with a permanent exit code and terminate the process
- So if you look at the logs for the container, you see the container exiting due to some error, restarting and then being killed by SIGTERM
  - Telling users to look at the previous container termination and not the most recent one is very confusing.

Thoughts for next iteration

From a requirements perspective,I think we want to be able to supply our own function that decides whether a container should be restarted or not
- I think this will give us the flexibility to define semantics that make the most sense for TensorFlow
Conceptually I think its simpler if the semantics are "containers are never restarted unless the TfJob Controller explicitly restarts them"
- Then its pretty clear that the controller just needs to enumerate the cases under which containers/pods need to be restarted
- We have to enumerate some cases that the default controllers (Job/Replica) automatically handle but I don't think that adds undue complexity.
I think we want to think about how this interacts with kube-arbitrator for gang-scheduling #165

from training-operator.

jlewi commented on August 11, 2024 1

Another issue is figuring out what to do about logs.

I'd logs to be available after a job finishes but before it is deleted via kubectl logs. But a stateful set will just restart the container so I'm not sure how we preserve logs.

from training-operator.

gaocegege commented on August 11, 2024 1

If we decide to use pod, we could reuse the code in caicloud/kubeflow-controller 😄

And I also vote for pod, now

from training-operator.

jlewi commented on August 11, 2024

@vishh FYI.

from training-operator.

foxish commented on August 11, 2024

The main challenge to using StatefulSets is figuring how to set the TF_CONFIG environment variable which depends on the index in the stateful set.

That should be possible through the downward API and some sort of init script that configures env-var differently on each replica. The logs issue is more difficult because StatefulSets were not designed to support completion semantics.

Another alternative worth investigating is just scheduling individual pods in which there is complete control over all of those aspects.

from training-operator.

jlewi commented on August 11, 2024

I think we should also consider creating the pods directly ourselves so that we can get the semantics we want in terms

Restart behavior based on exit code
Setting of environment variables per replica

I think GoogleCloudPlatform/kube-metacontroller could make it really simple to define a controller that just manages a single pod with the exact semantics we want.

Using a single controller per replica means we can set environment variables specific to each replica (e.g. index) which I think is very convenient for end-users because they can use the downward api to set replica specific flags based on environment variables e.g.

--index=${replica_index}

from training-operator.

foxish commented on August 11, 2024

We should create the pods ourselves, I agree. But having a controller per pod is not a common pattern - I think it would be quite a lot of complexity. One place the meta-controller fits in, is in replacing the existing go-based controller code. But IIRC, there's certain things we may not want to do (yet) with the meta-controller - like scaling (watch optimizations & informers), authorization of individual controllers, etc. For those things, it might be better to continue to use the go-based custom controller.

from training-operator.

DjangoPeng commented on August 11, 2024

A same discussion in caicloud internally. Ref: https://github.com/caicloud/kubeflow-controller/issues/71#issuecomment-355056365

PTAL @jlewi @foxish

from training-operator.

ScorpioCPH commented on August 11, 2024

@jlewi Hi, Thanks for your background update.

Auto restarting containers can lead to confusing behavior for users.

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

from training-operator.

jlewi commented on August 11, 2024

We have an option restartPolicy in TFJob API here, so let user to define the restarting policy maybe better.

I agree with letting the user specify the restart behavior. In the current implementation, restart behavior is tied to replica type. Workers, parameter servers, and masters have different restart behaviors.

I think in a future iteration we should let users define the restart behavior for each replica by picking from a set of policies.

I don't know if we can reuse the existing restartPolicy field. The restartPolicy field is a side effect of using PodTemplateSpec. The built in values for K8s restartPolicies aren't sufficient. For example, there's no policy that corresponds to the existing behavior of restarting on retryable errors but not permanent errors.

Its not clear to me whether we could introduce TfJob specific values for PodTemplateSpec's restartPolicy; this might interfere with automatic template validation.

I suspect a cleaner implementation will be to add an appropriate restart behavior field to TfReplicaSpec. The restartPolicy in PodTemplateSpec should always be set to some suitable default chosen by TfJobController (#55). So users should never explicitly set restartPolicy. We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

from training-operator.

ScorpioCPH commented on August 11, 2024

@jlewi

retryable errors

Do you mean we can re-started some workers with checkpoint file?

We can enforce consistency for restartPolicy by failing jobs with a validation error if the user explicitly sets it and the value doesn't match the value deemed correct as chosen by the TfJobController.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

from training-operator.

DjangoPeng commented on August 11, 2024

@jlewi @ScorpioCPH Sorry for the late reply.

Good idea, and in consideration of these smart features, I prefer Pod now :), It will give us the full control of the whole workflow.

SGTM. We'd better move it forward and take care of the implementation at the same time.

from training-operator.

DjangoPeng commented on August 11, 2024

@ScorpioCPH @gaocegege Maybe we should merge the CRD at first. There are some difference between upstream and caicloud's implementation.

Some discussion here https://github.com/caicloud/kubeflow-controller/issues/80

from training-operator.

ScorpioCPH commented on August 11, 2024

@jlewi I think we have reached an agreement (Pod) after discussion, should we close this?

from training-operator.

jlewi commented on August 11, 2024

Good idea. Opened #325

from training-operator.

Consider how we manage replicas (stateful sets, managing pods directly) about training-operator HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent