For the parameter servers we could just run the TensorFlow server by default and not m

How we take care of Parameter variable suggestions in <a href="https://www.tensorflow.

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Run TensorFlow server for parameter servers by default about training-operator HOT 20 CLOSED

kubeflow commented on August 11, 2024 2

Run TensorFlow server for parameter servers by default

from training-operator.

Comments (20)

jlewi commented on August 11, 2024 1

Done. Opened: https://github.com/jlewi/mlkube.io/issues/35

from training-operator.

wbuchwalter commented on August 11, 2024

How would we handle the different version of TF?
Can we assume that TF won't make any breaking change around data exchange between PS and workers in 1.x.x ?
Would a TF 1.3 PS work nominally with 1.1 workers today for example?

from training-operator.

jlewi commented on August 11, 2024

Using multiple versions of TF in the same job is asking for trouble. I think we'd have to provide a mechanism for the user to specify the TF version and it would be up to the user to select a version matching the version in their container.

from training-operator.

wbuchwalter commented on August 11, 2024

Agree.
So we would maintain a collection of docker images containing a simple PS server for each minor TF version.

from training-operator.

jlewi commented on August 11, 2024

Yes.

from training-operator.

wbuchwalter commented on August 11, 2024

FYI, I have started working on this one.

from training-operator.

jlewi commented on August 11, 2024

Great! Thank you.

from training-operator.

bhack commented on August 11, 2024

How we take care of Parameter variable suggestions in https://www.tensorflow.org/performance/performance_models

from training-operator.

wbuchwalter commented on August 11, 2024

@bhack not sure I understand your question.
The parameter server would still be running, and would still be part of the ClusterSpec.
We would just automatically start the PS for you so you don't need to create/maintain your own image.

If I missed your point, feel free to clarify.

from training-operator.

bhack commented on August 11, 2024

It is not strictly related to the PS automation but more on how to broadcast variables and aggregate gradients across different GPUs within the same host machine (the same Node in k8).

from training-operator.

wbuchwalter commented on August 11, 2024

If you request say 1 PS and 2 worker in your tfjob, each instance will be created as a separate job and each one exposed through it's own service.
So even though everything might be happening on a single k8s Node (assuming you only have one in your cluster), it would still behave as if each instance were on separate nodes and the variable updates will happen over the network.
If you want instead to have the PS and workers all run on the same node, I think this fall outside the scope of this feature, you would instead declare a single replica of type master asking for 2 GPUs and 1 CPU and then you would declare the CPU as the parameter server in your custom code (does it still make sense to use tfjob in this case though?).

IMO, this feature is really just there to remove the boilerplate from the simple and common case where you have one or multiple PS managing multiple remote workers.

What do you think?

from training-operator.

bhack commented on August 11, 2024

What I meant if we want to totally exlcude intra GPU/NCCL communication use cases.

from training-operator.

bhack commented on August 11, 2024

Probably there is no way to handle affinity and let jobs to communicate intra GPU/NCCL cause a mixed intra GPU and over network deployment network doesn't exist. How much the training process will be slowed down when GPU just communicate over the network on the same node?

from training-operator.

jlewi commented on August 11, 2024

@bhack K8s has Affinity and Anti-Affinity which are part of the PodSpec. These can be used to try to prevent some pods or containers from running on the same node.

A potential use for this would be to prevent multiple parameter servers from running on the same node. Since parameter servers can be IO constrained you might want to run each PS on different nodes to prevent network contention.

To the extent TfJob includes a PodTemplateSpec, I believe a user should be able to manually set the affinity/anti-affinity to optimize performance. If we wanted to do that automatically we should consider opening up a new issue for that (that's beyond the scope of what I intended when I initially filed this issue).

In terms of efficient communication between multiple GPUs. I think its TBD how K8s will handle this; GPU scheduling is still pretty primitive. Right now if you want to take advantage of intra GPU communication (e.g. NVIDIA Peer to Peer) I think you'd want to assign multiple GPUs to a single TF process. This should guarantee that all those GPUs are on the same machine and now your TF process should be able to take advantage of Peer to Peer or other networking features.

from training-operator.

bhack commented on August 11, 2024

@jlewi You got the point.. it is really an interesting topic.. probably a little bit too advanced for the actual k8 GPU support. Can you move on a new issue just to not have this topic lost?

from training-operator.

jlewi commented on August 11, 2024

@jhseu It looks like grpc_tensorflow_server.py isn't part of TensorFlow's published Docker images. Does that seem like a reasonable TF feature request?

from training-operator.

wbuchwalter commented on August 11, 2024

@jlewi just found this: https://hub.docker.com/r/tensorflow/tf_grpc_server/

from training-operator.

jhseu commented on August 11, 2024

It'd probably be better for us to build the C++ binary grpc_tensorflow_std_server and publish that. File a bug internally?

from training-operator.

jlewi commented on August 11, 2024

@jhseu Done thanks.

from training-operator.

jlewi commented on August 11, 2024

Fixed by #36

from training-operator.

Run TensorFlow server for parameter servers by default about training-operator HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent