project-codeflare / appwrapper Goto Github PK
View Code? Open in Web Editor NEWAppWrapper controller for Kueue
Home Page: https://project-codeflare.github.io/appwrapper/
License: Apache License 2.0
AppWrapper controller for Kueue
Home Page: https://project-codeflare.github.io/appwrapper/
License: Apache License 2.0
Consider adding configurable automatic cleanup after completion of an AppWrapper.
Can draw inspiration from similar feature on Jobs (https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs).
We may want to have different default behavior for successful vs. failed jobs.
MCAD provided a mechanism for encoding conditions of the top level resources that if present would indicate success or failure. This was used to augment the pod-level status infromation.
We need to design and Implement a similar mechanism for the v1beta2 AppWrapper.
We can make the specification of a PodSet for Component optional for common cases when we can recognize the GVK of the component. This would lower the barrier to entry for using AppWrappers.
Some low hanging fruit with PodSpecTemplates: Jobs, Deployments, StatefulSets, PyTorchJob, RayJobs/Clusters, JobSets.
We probably should do this in conjunction with an extension to the validating web hook that rejects an AppWrapper that contains any unknown GVKs that don't have explicit PodSets and appear to contain PodSpecTemplates within them. A stricter alternative would be to also recognize common GVKs that don't create pods (Secrets, ConfigMaps, ServiceAccount, etc) and reject any unknown GVK that doesn't have an explicit PodSet (which may say 0
replicas for resources that don't create Pods).
Consider following the structure described here: https://github.com/golang/go/wiki/Modules/6fe9f52ac7c4d92cb8fc878d8dee1bda0c63c8a5#how-can-i-track-tool-dependencies-for-a-module
to enable dependency tracking of go tools we depend on to build appwrapper controller.
The copy button in code blocks in the AppWrapper website is very useful. However, when block contains both a command or list of commands and the output then the button copies both, which is not ideal.
The ChildAdmissionController was a hacky workaround for Kueue 0.6 not correctly recognizing external frameworks.
We contributed upstream enhancements (kubernetes-sigs/kueue#2059) that are included in Kueue 0.7 that make this workaround obsolete. Once we can drop support for Kueue 0.6 from the main branch, we should remove this controller and related configuration and documentation (including controller architecture description in webpages).
Need to wait for #161 to be merged to main before doing this.
We should have a favicon for the AppWrapper website. Make one from the codeflare icon?
Workload pods may become stuck (fail to delete). MCAD implemented an optional forceful deletion where after a user-specified grace period, MCAD would forcefully delete (delete with graceperiod 0) all remaining pods of a deleted workload.
The code that does this in mcadv2 is here: https://github.com/project-codeflare/mcad/blob/4dffea6bb957248dda957d9d97ec62fb19b7b9dc/internal/controller/resource_manager.go#L216-L241
We need similar functionality in the AppWrapper controller, but should drive it by looking for an annotation that enables it (and sets the deletion grace period) and by using timestamps in conditions instead of adding additional specialized fields to the status.
In the cases where the appwrapper contains resources that are managed by Kueue and the component implemented ReclaimablePods, we should monitor the workload instances and flow that information through to Kueue.
In the admission webhook, we should check that the AppWrapper controller itself has the permission to create the resources that are wrapped before we admit it (in addition to validating that the user has the permissions too).
Since an appwrapper identifies all the embedded podsets, we could automatically generate a podgroup and label the podtemplates accordingly, using an appwrapper annotation to trigger the podgroup synthesis.
To enable full external customization of an AppWrapper controller deployment, we should adopt the same pattern as Kueue and allow a .yaml configuration file to be provided on the command line. If present, the contents of this yaml file would be merged into the AppWrapperConfig initialized by default values by config.NewConfig()
.
k8s.io/kubernetes
is not meant to be depended on as explain in kubernetes/kubernetes#79384.
This causes issues with some tools.
Replace the dependency:
Line 30 in 37ce35b
With the right one from k8s.io/api/core/v1
.
need to write some unit tests to ensure we don't regress. #131 was quite annoying to debug...
Need to write an e2e test that covers the scenario of Autopilot tagging a resource as unhealthy while a workload is running. Test would verify that the workload gets reset/resumed in response.
When deploying for example a PyTorchJob using an AppWrapper, the AppWrapper controller monitors the pods but not the PyTorchJob itself. We should consider extending the monitoring to the AppWrapper's components, in particular detection the deletion of these components, and responding to the deletion by undeploying other components if any, releasing the quota, and reporting the AppWrapper status. This may require a setting to opt-in (or opt-out) from this monitoring separately from the monitoring of the pods.
Motivation: eliminate user errors where the duplicated replica count gets out of synch. All the CRDs of interest to us have a replica count in their Template somewhere "adjacent" to their PodSpecTemplate that we can refer to in the PodSet.
This will be a non-backwards compatible change in the AppWrapper CRD.
We failed to consider the use of generateName
by the wrapped resource when recording the name of the created resource in the component status. As a result, we fail to track status and properly delete any resource that uses generateName
.
See error log in project-codeflare/codeflare-operator#591 which attempted to use GenerateName in a RayCluster wrapped in an AppWrapper and exposed the bug.
This is a follow on from #65.
Since we can only infer/validate PodSets for known GVKs, there may be some scenarios in which we want a strict mode where we would reject all AppWrappers that contain GVKs that aren't known to the operator.
This would mostly be for user experience and better error messages. The RBACs for the operator will prevent us from actually creating instances of unexpected GVKs, so this would just move the error from when resources are created during the Resuming
phase of the AppWrapper to when the AppWrapper is initially created/validated by the WebHook.
We should add an additional CI configuration that verifies that the controller works in dev mode (make install; make run
).
In this mode, it should still pass all the e2e tests that are not tagged with "Webhook".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.