Comments (4)
@DjangoPeng Is this the right thing to do? Do you have other suggestions on how we might improve networking efficiency?
from training-operator.
@jlewi I think LB is sort of useful for TensorFlow Serving jobs. Generally we launch a model (eg: face recognition) in a TensorFlow Serving Pod when the number of requests is bearable. But, with the number of requests increasing, a single pod is not enough to support them. Based on the assumption, I prefer launching TensorFlow Serving jobs as Deployment. On the one hand, Deployment is good at scaling up and down. On the other hand, Deployment would recover the dead pod automatically.
Do you have other suggestions on how we might improve networking efficiency?
I think the point is the implementation of pod networks. I know many Kubernetes users set Flannel overlay network by default, but Flannel is not a good choice for TensorFlow and other DL workload. If we really want to improve the networking efficiency, we'd better use other network implementations, such as host network.
from training-operator.
Sorry I should have clarified that for headless services I only meant in the context of training jobs. For training jobs we need to assign stable names to each replica. So for a given replica there should be only 1 pod backing it. So I think load balancers are just introducing overhead.
Regarding network performance, is there a simple benchmark that can be run to measure network performance in a way that's relevant to TF/DL?
from training-operator.
I will take a stab at this one
from training-operator.
Related Issues (20)
- Why overwrite RestartPolicy in podTemplate HOT 2
- Back-off pulling image "alpine:3.10" HOT 6
- KEP-2170: Add APIs for TrainJob and TrainingRuntime HOT 2
- KEP-2170: Create controller for TrainJob HOT 1
- KEP-2170: Create Kustomize manifests to deploy JobSet and TrainJob controllers
- KEP-2170: Implement validations for TrainJob HOT 3
- KEP-2170: Create dataset and model initializers
- KEP-2170: Create PyTorch multi-node distributed training runtime HOT 1
- KEP-2170: Create LLM training runtime for Llama 2 7b
- KEP-2170: Add E2E tests for TrainJob
- KEP-2170: Update documentation for V2 APIs
- KEP-2170: Generate OpenAPI spec for V2 APIs
- Design Kubeflow Python SDK for Training V2
- KEP-2170: Create MPI Runtime HOT 4
- KEP-2170: Support the PodSpecOverrides API in TrainJob
- KEP-2170: Implement validations for TrainingRuntime and ClusterTrainingRuntime HOT 1
- Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods HOT 2
- KEP-2170: Provide the client-go library for the TrainJob and TarriningRuntime
- Add Dependabot or Renovate HOT 4
- Support debugging webhooks locally HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from training-operator.