Coder Social home page Coder Social logo

API Review about training-operator HOT 7 CLOSED

kubeflow avatar kubeflow commented on August 11, 2024
API Review

from training-operator.

Comments (7)

ScorpioCPH avatar ScorpioCPH commented on August 11, 2024

I also have some comments about API changes #249, maybe we can discuss this topic together :)

from training-operator.

gaocegege avatar gaocegege commented on August 11, 2024
  • Instead of specifying replica type should we specify features e.g. run forever, restart always, etc...

I think the type name tfReplicaType is a little redundant, maybe type is fine.

And I agree with @ScorpioCPH about the status:

Maybe we should revisit status representation: State/ReplicaState/TfReplicaStatus.

For example, I think State field in TFReplicaStatus could not express the state accurately since it is complicated, then TFReplicasStates is enough maybe.

from training-operator.

0xgj avatar 0xgj commented on August 11, 2024

I suggest we should use chief instead of master as #306 stated

* it's confusing using master when there is distributed master in tensorflow;
* due to issue #61 , in tensorflow 1.4, TF_CONFIG use chief instead of master;

@jlewi @ScorpioCPH @gaocegege

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

I agree we should probably get rid of "master". I actually think we should probably get rid of ReplicaType and introduce properties that control different behaviors of the replica such as termination policy. Users can then assign an arbitrary name to each replica.

from training-operator.

0xgj avatar 0xgj commented on August 11, 2024

+1, @jlewi we can use a property named job or name instead of replicaType, which can be assigned to any valid string. other terms we can use include task, group... by default job will be set to worker.

apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
  name: "example-job"
spec:
  replicaSpecs:
    - replicas: 1
      name: chief
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: chief   ## if i need a seperate chief, add volumes if we want persistent logs
          restartPolicy: OnFailure
    - replicas: 1
      name: worker
      template:
        spec:
          containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: worker
          restartPolicy: OnFailure
    - replicas: 2
        name: ps
        template:
          spec:
            containers:
            - image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
              name: ps
          restartPolicy: OnFailure

from training-operator.

ScorpioCPH avatar ScorpioCPH commented on August 11, 2024

I think the type name tfReplicaType is a little redundant, maybe type is fine.

It seems like we must use something to specify the Type of TFReplica.
+1 for type.

I don't think name or job can represent this exactly.

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

Closing this issue since the proposal for the v1alpha2 API has been approved.

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.