Comments (24)
Some updates here. @zw0610 and I wrote a doc and we still need some time to polish the doc and make it public. Then we can have some discussion in WG meetings.
from common.
Raise this request in 3/24's AutoML and Training meeting. I will draft one proposal for deep discussion in following month. We can either discuss it offline or talk about it in next community meeting.
from common.
Here's the proposal All-in-one training operator. Any feedbacks are welcomed. I present in 05/19 US & EU friendly meeting and @zw0610 will presents in 06/02 CN & EU friendly meeting.
from common.
I also just left some specific comments in the doc. Please take a look when you get a chance.
from common.
Should we close this issue?
from common.
SGTM !
One of the feedbacks I heard from users is exactly the case.
"Before you run any job, there is already tons of pods running in the k8s."
We have tried to merge multiple controller in one process before. With using kubebuilder SetupWithManager
, one can spawn multiple controllers in one pod while implement each controller separately
from common.
Yeah, we also have an internal implementation for it. Alibaba has similar design. I think we can try to merge it back to the community.
from common.
/cc @jian-he
from common.
This is mainly for merging the controllers into one controller manager and we still keep the code base for different operators separate, right?
from common.
Personally, I think it will be better to keep all training operators in one repository. WDYT
from common.
@gaocegege Personally I think it would be great if the code bases are closer together but it might not be easy to maintain releases and versioning. Also things like stargazers/watchers, commit history, and useful design discussions will be spread out if we start a new repo and develop from scratch.
from common.
One repo -> one manager -> multiple controller -> crd reconciler loop.
The manager could has many controllers, each controller should only take care of crd. Frameworks should be configured to
@gaocegege What's the plan on your side? We have some internal works going on as well and consider to make it in kubeflow.
from common.
@Jeffwan We also have some internal works on it, maybe we can discuss it
from common.
I second @terrytangyuan 's suggestion that instead of merging all operators into one repo. While keeping one manager with multiple controller does bring many benefits like saving traffic, if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving. That includes features like elastic training, error handling, event recording, etc.
Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk
equivalent executable for this common library, making it something like kubefloe-operator-sdk
. The executable generates more training-operator specific functions like ReconcilePods
, ReconcileServices
in addition to code that operator-sdk generates. So developers would only need to fill these functions for customization.
A summary from most operators:
Operator | Roles | Native* Resources | Others |
---|---|---|---|
tf-operator | PS, Worker, Chief, Evaluator | service, pod (podgroup) | |
mpi-operator | Launcher, Worker | service, pod, configMap, serviceAccount, role, roleBinding (podgroup) | |
pytorch-operator | Master, Worker | service, pod (podgroup) | |
xgboost-operator | Master, Worker | service, pod (podgroup) | |
mxnet-operator | Scheduler, Server, Sorker, TunerTracker, TunerServer, Tuner | service, pod (podgroup) |
*while podgroup is not a Kubernetes native resource, here we consider it as one as it is defined out of the xxx-operator scope.
If mpi-operator is able to get rid of the reliance on configMap, serviceAccount, role and roleBinding, maybe we can apply common to all of the operators.
from common.
But here I have another concern. The contemporary design of ReconcilePods
and ReconcileServices
is really puzzling thanks to the lack of real OOP in golang. Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker
is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators. However, the reason why we need ReconcilePods
is more about the difference in podTemplate, like the compose of TF_CONFIG
.
Is it possible if we use the shared ReconcilePods
method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef
.
But sure, developers should still be able to 'override' ReconcilePods
method if they want.
from common.
if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving
I don't quite understand perpendicular
in this context.. Can you elaborate this with more details?
Instead, maybe we can move kubeflow/common one step forward by creating operator-sdk equivalent executable for this common library, making it something like kubefloe-operator-sdk.
I think this is a good idea. I have two things in mind.
- from implementation perspective, I am not sure if there's a way to leverage existing tools like
kubebuilder
tooperator-sdk
and inject generated codes. Currently, people still need to use low level api or use kubebuilder to generate skeleton and then use kubeflow/common. Withkubeflow-operator-sdk
This further reduce time to build up a new DL operator from scratch - Frameworks is growing but still within reasonable number. Making tools for small groups may not worth the efforts. If the tool can be made flexible enough so that All-in-one operator can use it as well, then it may worth the efforts.
Moreover, this design still prevents developers from sharing elastic feature among operators. For example, when the replicas of worker is changed, the Pods creation/deletion logic still needs to be implemented individually for multiple operators.
Is it possible if we use the shared ReconcilePods method, but allow developers to register decorators for podTemplate before it is sent to PodControl.CreatePodsWithControllerRef.
Em. I think we should collect these good use cases and probably try to find a different way to abstract the methods. I agree that there's no guarantee we can make it compatible for future use cases, we can still leave enough flexibility like overide ReconcilePods
from common.
if the controllers work independently, the one-manager-multiple-controller design is perpendicular to the idea of sharing job reconciliation function to all operators for develop cost saving
Hi @Jeffwan , what I mean for the comment above is whether we use one-mgr-multi-controller or controllers working individually is a question of another dimension to the idea of sharing the code on how we reconcile jobs. With or without such a design, we can help developers working on operators for new frameworks with more shared code base.
But as you mentioned, the number of new framework looks limited, which I agree.
from common.
Few thoughts
- Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?
- Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?
@gaocegege Are you having this design already?
from common.
@zw0610 and @Jeffwan are working on the design, I think we can discuss it in the community call.
from common.
Few thoughts
- Are all operators using same amount of resources? Say, In one deployment, Pytorch jobs are spawned more than TF Jobs while in other, MXNet jobs are used more. How do we control/recommend resource usage from a single combined operator point of view?
controllers will use different amount of resources. Instead of understanding each controller's utilization, I would recommend to adjust resources based on total number of the jobs. For users who transit from multiple operators, they can sum up request/limits and use it in the new operator. ( All-in-one operator uses less resources because of shared cache. ) However, this won't be accurate, it would be good for us to do some load testing and give some recommended number.
- Since we will lose isolation between operator deployments, will releases become difficult? Also what if there is CR upgrade required for one operator while there are running jobs for the other operators?
good point. Do you have concerns on the job state? Anything different from upgrading an operator nowadays?
from common.
When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?
from common.
When any CR is upgraded, we will need operator upgrade and controller restart. Will upgrade be difficult for the users?
I forget to response that, I think the upgrade behavior is similar to existing controller. The only "overhead" is operator owner may need more time for coordination since users are all using one.
from common.
/close
from common.
@Jeffwan: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from common.
Related Issues (20)
- Proposal: Define a job type to track all kinds of jobs on the same cluster HOT 2
- Panic when controller restarts HOT 2
- Use fully-qualified labels HOT 11
- Fully-qualified labels migration tasks
- Cut release-0.4 branch and cut 0.4.0 tag HOT 8
- PR 135 makes replicaType updates hard HOT 3
- [chore] Upgrade PriorityClass to scheduling.k8s.io/v1 HOT 2
- Use Interface instead of lister/informer in controller HOT 3
- Missing methods in reconciler.v1 HOT 1
- Master fails to build HOT 4
- Enable GitHub action to build and test code HOT 9
- Limit the number of restarts under ExitCode restartPolicy HOT 5
- Encounter NIL Error when job in error stage with TTL value set
- pr 172 breaks training-operator HOT 6
- set podgroup failed when pytorchjob failed
- Add comment to Clean Policy and Restart Policy nums
- When `restartPolicy` is `ExitCode` and a pod is deleted (137), the entire TFJob will still be marked as failed. HOT 10
- cannot use "github.com/go-openapi/spec".Schema{...} (type "github.com/go-openapi/spec".Schema) as type "k8s.io/kube-openapi/pkg/validation/spec".Schema in field value HOT 1
- [Discuss] Archive Kubeflow/common repo HOT 3
- Deprecation notice: kubeflow/common repo will be archived in two weeks.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from common.