Comments (20)
Done. Opened: https://github.com/jlewi/mlkube.io/issues/35
from training-operator.
How would we handle the different version of TF?
Can we assume that TF won't make any breaking change around data exchange between PS and workers in 1.x.x ?
Would a TF 1.3 PS work nominally with 1.1 workers today for example?
from training-operator.
Using multiple versions of TF in the same job is asking for trouble. I think we'd have to provide a mechanism for the user to specify the TF version and it would be up to the user to select a version matching the version in their container.
from training-operator.
Agree.
So we would maintain a collection of docker images containing a simple PS server for each minor TF version.
from training-operator.
Yes.
from training-operator.
FYI, I have started working on this one.
from training-operator.
Great! Thank you.
from training-operator.
How we take care of Parameter variable suggestions in https://www.tensorflow.org/performance/performance_models
from training-operator.
@bhack not sure I understand your question.
The parameter server would still be running, and would still be part of the ClusterSpec.
We would just automatically start the PS for you so you don't need to create/maintain your own image.
If I missed your point, feel free to clarify.
from training-operator.
It is not strictly related to the PS automation but more on how to broadcast variables and aggregate gradients across different GPUs within the same host machine (the same Node in k8).
from training-operator.
If you request say 1 PS
and 2 worker
in your tfjob
, each instance will be created as a separate job and each one exposed through it's own service.
So even though everything might be happening on a single k8s Node (assuming you only have one in your cluster), it would still behave as if each instance were on separate nodes and the variable updates will happen over the network.
If you want instead to have the PS and workers all run on the same node, I think this fall outside the scope of this feature, you would instead declare a single replica of type master
asking for 2 GPUs and 1 CPU and then you would declare the CPU as the parameter server in your custom code (does it still make sense to use tfjob in this case though?).
IMO, this feature is really just there to remove the boilerplate from the simple and common case where you have one or multiple PS managing multiple remote workers.
What do you think?
from training-operator.
What I meant if we want to totally exlcude intra GPU/NCCL communication use cases.
from training-operator.
Probably there is no way to handle affinity and let jobs to communicate intra GPU/NCCL cause a mixed intra GPU and over network deployment network doesn't exist. How much the training process will be slowed down when GPU just communicate over the network on the same node?
from training-operator.
@bhack K8s has Affinity and Anti-Affinity which are part of the PodSpec. These can be used to try to prevent some pods or containers from running on the same node.
A potential use for this would be to prevent multiple parameter servers from running on the same node. Since parameter servers can be IO constrained you might want to run each PS on different nodes to prevent network contention.
To the extent TfJob includes a PodTemplateSpec, I believe a user should be able to manually set the affinity/anti-affinity to optimize performance. If we wanted to do that automatically we should consider opening up a new issue for that (that's beyond the scope of what I intended when I initially filed this issue).
In terms of efficient communication between multiple GPUs. I think its TBD how K8s will handle this; GPU scheduling is still pretty primitive. Right now if you want to take advantage of intra GPU communication (e.g. NVIDIA Peer to Peer) I think you'd want to assign multiple GPUs to a single TF process. This should guarantee that all those GPUs are on the same machine and now your TF process should be able to take advantage of Peer to Peer or other networking features.
from training-operator.
@jlewi You got the point.. it is really an interesting topic.. probably a little bit too advanced for the actual k8 GPU support. Can you move on a new issue just to not have this topic lost?
from training-operator.
@jhseu It looks like grpc_tensorflow_server.py isn't part of TensorFlow's published Docker images. Does that seem like a reasonable TF feature request?
from training-operator.
@jlewi just found this: https://hub.docker.com/r/tensorflow/tf_grpc_server/
from training-operator.
It'd probably be better for us to build the C++ binary grpc_tensorflow_std_server and publish that. File a bug internally?
from training-operator.
@jhseu Done thanks.
from training-operator.
Fixed by #36
from training-operator.
Related Issues (20)
- master pod not getting started for pytorch job HOT 2
- Add more AI/ML Training Examples HOT 3
- Pytorch Elastic Example Error HOT 3
- PytorchJob restartPolicy: ExitCode leaves job in a "Restarting" state after a non-retryable error HOT 7
- PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors HOT 5
- Support MLX on Kubernetes with Kubeflow HOT 2
- Migrate to controller-runtime logger HOT 5
- Support CertManager for the Webhook cert generation HOT 1
- Unable to start elastic PyTorchJob example HOT 5
- Commonize webhook validations at the some points
- Update developer documentation for arm HOT 1
- Aunpun1.00 HOT 1
- Update pytorch launcher component in Kubeflow Pipelines repository HOT 3
- Update developer guide to handle missing training-operator-webhook-cert HOT 2
- Job Status is failed, when scale-in ps. HOT 4
- Failed K8s nodes leave jobs hanging indefinitely HOT 3
- Update examples for `train` API HOT 1
- [Question] Training Operator v1.8 Release Date HOT 1
- Why manifests/base/service.yaml does not include webhook server port (443) in version 1.7.0~1.5.0? HOT 7
- Not getting Kubeflow Training SDK v1.7 when installing `kubeflow-training` HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.