Comments (7)
I also have some comments about API changes #249, maybe we can discuss this topic together :)
from training-operator.
- Instead of specifying replica type should we specify features e.g. run forever, restart always, etc...
I think the type name tfReplicaType
is a little redundant, maybe type
is fine.
And I agree with @ScorpioCPH about the status:
Maybe we should revisit status representation: State/ReplicaState/TfReplicaStatus.
For example, I think State
field in TFReplicaStatus
could not express the state accurately since it is complicated, then TFReplicasStates
is enough maybe.
from training-operator.
I suggest we should use chief
instead of master
as #306 stated
* it's confusing using master when there is distributed master in tensorflow;
* due to issue #61 , in tensorflow 1.4, TF_CONFIG use chief instead of master;
from training-operator.
I agree we should probably get rid of "master". I actually think we should probably get rid of ReplicaType and introduce properties that control different behaviors of the replica such as termination policy. Users can then assign an arbitrary name to each replica.
from training-operator.
+1, @jlewi we can use a property named job
or name
instead of replicaType
, which can be assigned to any valid string. other terms we can use include task, group... by default job
will be set to worker.
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
name: "example-job"
spec:
replicaSpecs:
- replicas: 1
name: chief
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: chief ## if i need a seperate chief, add volumes if we want persistent logs
restartPolicy: OnFailure
- replicas: 1
name: worker
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: worker
restartPolicy: OnFailure
- replicas: 2
name: ps
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: ps
restartPolicy: OnFailure
from training-operator.
I think the type name tfReplicaType is a little redundant, maybe type is fine.
It seems like we must use something to specify the Type
of TFReplica
.
+1 for type
.
I don't think name
or job
can represent this exactly.
from training-operator.
Closing this issue since the proposal for the v1alpha2 API has been approved.
from training-operator.
Related Issues (20)
- PyTorchJobClient not found HOT 3
- The actual default RestartPolicy of PyTorch is inconsistent with its description in the CRD HOT 1
- spatial dataset training functions HOT 1
- TfJob creation failed due to webhook validation failure HOT 1
- [GSOC] Tracking Issue: Integrate JAX in Kubeflow Training Operator
- Improve Training Operator release process HOT 4
- [GSOC] Project 7 Tracking Issue: Automate docs generation for Training-operator Python SDK HOT 1
- Docs: reference architecture for fault tolerance capabilities HOT 12
- [SDK] Add more unit tests for TrainingClient APIs HOT 18
- TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'eval_strategy' HOT 3
- [Release] Training Operator 1.9 Roadmap
- Kubeflow Training V2 API
- Encountered an error while running the example in the document train_api_hf_dataset HOT 6
- Enable pre-commit for repo HOT 4
- "ImportError" when running fine-tuning API
- Support richer volcano scheduling HOT 1
- Consider container image rename of `kubeflow/storage-initializer` HOT 5
- Training job restart enhancement HOT 4
- [SDK] Add e2e tests to fine-tune LLMs with `train` API HOT 2
- Add support for the `managedBy` field HOT 12
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.