Coder Social home page Coder Social logo

Comments (20)

jlewi avatar jlewi commented on August 11, 2024 1

Done. Opened: https://github.com/jlewi/mlkube.io/issues/35

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

How would we handle the different version of TF?
Can we assume that TF won't make any breaking change around data exchange between PS and workers in 1.x.x ?
Would a TF 1.3 PS work nominally with 1.1 workers today for example?

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

Using multiple versions of TF in the same job is asking for trouble. I think we'd have to provide a mechanism for the user to specify the TF version and it would be up to the user to select a version matching the version in their container.

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

Agree.
So we would maintain a collection of docker images containing a simple PS server for each minor TF version.

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

Yes.

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

FYI, I have started working on this one.

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

Great! Thank you.

from training-operator.

bhack avatar bhack commented on August 11, 2024

How we take care of Parameter variable suggestions in https://www.tensorflow.org/performance/performance_models

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

@bhack not sure I understand your question.
The parameter server would still be running, and would still be part of the ClusterSpec.
We would just automatically start the PS for you so you don't need to create/maintain your own image.

If I missed your point, feel free to clarify.

from training-operator.

bhack avatar bhack commented on August 11, 2024

It is not strictly related to the PS automation but more on how to broadcast variables and aggregate gradients across different GPUs within the same host machine (the same Node in k8).

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

If you request say 1 PS and 2 worker in your tfjob, each instance will be created as a separate job and each one exposed through it's own service.
So even though everything might be happening on a single k8s Node (assuming you only have one in your cluster), it would still behave as if each instance were on separate nodes and the variable updates will happen over the network.
If you want instead to have the PS and workers all run on the same node, I think this fall outside the scope of this feature, you would instead declare a single replica of type master asking for 2 GPUs and 1 CPU and then you would declare the CPU as the parameter server in your custom code (does it still make sense to use tfjob in this case though?).

IMO, this feature is really just there to remove the boilerplate from the simple and common case where you have one or multiple PS managing multiple remote workers.

What do you think?

from training-operator.

bhack avatar bhack commented on August 11, 2024

What I meant if we want to totally exlcude intra GPU/NCCL communication use cases.

from training-operator.

bhack avatar bhack commented on August 11, 2024

Probably there is no way to handle affinity and let jobs to communicate intra GPU/NCCL cause a mixed intra GPU and over network deployment network doesn't exist. How much the training process will be slowed down when GPU just communicate over the network on the same node?

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

@bhack K8s has Affinity and Anti-Affinity which are part of the PodSpec. These can be used to try to prevent some pods or containers from running on the same node.

A potential use for this would be to prevent multiple parameter servers from running on the same node. Since parameter servers can be IO constrained you might want to run each PS on different nodes to prevent network contention.

To the extent TfJob includes a PodTemplateSpec, I believe a user should be able to manually set the affinity/anti-affinity to optimize performance. If we wanted to do that automatically we should consider opening up a new issue for that (that's beyond the scope of what I intended when I initially filed this issue).

In terms of efficient communication between multiple GPUs. I think its TBD how K8s will handle this; GPU scheduling is still pretty primitive. Right now if you want to take advantage of intra GPU communication (e.g. NVIDIA Peer to Peer) I think you'd want to assign multiple GPUs to a single TF process. This should guarantee that all those GPUs are on the same machine and now your TF process should be able to take advantage of Peer to Peer or other networking features.

from training-operator.

bhack avatar bhack commented on August 11, 2024

@jlewi You got the point.. it is really an interesting topic.. probably a little bit too advanced for the actual k8 GPU support. Can you move on a new issue just to not have this topic lost?

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

@jhseu It looks like grpc_tensorflow_server.py isn't part of TensorFlow's published Docker images. Does that seem like a reasonable TF feature request?

from training-operator.

wbuchwalter avatar wbuchwalter commented on August 11, 2024

@jlewi just found this: https://hub.docker.com/r/tensorflow/tf_grpc_server/

from training-operator.

jhseu avatar jhseu commented on August 11, 2024

It'd probably be better for us to build the C++ binary grpc_tensorflow_std_server and publish that. File a bug internally?

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

@jhseu Done thanks.

from training-operator.

jlewi avatar jlewi commented on August 11, 2024

Fixed by #36

from training-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.