Coder Social home page Coder Social logo

project-codeflare / appwrapper Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 6.0 719 KB

AppWrapper controller for Kueue

Home Page: https://project-codeflare.github.io/appwrapper/

License: Apache License 2.0

Dockerfile 0.50% Makefile 3.93% Go 88.14% Shell 5.64% Ruby 0.14% JavaScript 0.64% Smarty 1.01%

appwrapper's People

Contributors

dgrove-oss avatar tardieu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

appwrapper's Issues

Automatically infer PodSets for well-known kinds

We can make the specification of a PodSet for Component optional for common cases when we can recognize the GVK of the component. This would lower the barrier to entry for using AppWrappers.

Some low hanging fruit with PodSpecTemplates: Jobs, Deployments, StatefulSets, PyTorchJob, RayJobs/Clusters, JobSets.

We probably should do this in conjunction with an extension to the validating web hook that rejects an AppWrapper that contains any unknown GVKs that don't have explicit PodSets and appear to contain PodSpecTemplates within them. A stricter alternative would be to also recognize common GVKs that don't create pods (Secrets, ConfigMaps, ServiceAccount, etc) and reject any unknown GVK that doesn't have an explicit PodSet (which may say 0 replicas for resources that don't create Pods).

Copy button in AppWrapper website

The copy button in code blocks in the AppWrapper website is very useful. However, when block contains both a command or list of commands and the output then the button copies both, which is not ideal.

Eliminate ChildAdmissionController after dropping support for Kueue 0.6

The ChildAdmissionController was a hacky workaround for Kueue 0.6 not correctly recognizing external frameworks.

We contributed upstream enhancements (kubernetes-sigs/kueue#2059) that are included in Kueue 0.7 that make this workaround obsolete. Once we can drop support for Kueue 0.6 from the main branch, we should remove this controller and related configuration and documentation (including controller architecture description in webpages).

Need to wait for #161 to be merged to main before doing this.

Implement delayed forceful deletion of pods & resources

Workload pods may become stuck (fail to delete). MCAD implemented an optional forceful deletion where after a user-specified grace period, MCAD would forcefully delete (delete with graceperiod 0) all remaining pods of a deleted workload.

The code that does this in mcadv2 is here: https://github.com/project-codeflare/mcad/blob/4dffea6bb957248dda957d9d97ec62fb19b7b9dc/internal/controller/resource_manager.go#L216-L241

We need similar functionality in the AppWrapper controller, but should drive it by looking for an annotation that enables it (and sets the deletion grace period) and by using timestamps in conditions instead of adding additional specialized fields to the status.

implement Job.ReclaimablePods for AppWrappers

In the cases where the appwrapper contains resources that are managed by Kueue and the component implemented ReclaimablePods, we should monitor the workload instances and flow that information through to Kueue.

Automatically generate a podgroup

Since an appwrapper identifies all the embedded podsets, we could automatically generate a podgroup and label the podtemplates accordingly, using an appwrapper annotation to trigger the podgroup synthesis.

Support merging a config.yaml into default configs

To enable full external customization of an AppWrapper controller deployment, we should adopt the same pattern as Kueue and allow a .yaml configuration file to be provided on the command line. If present, the contents of this yaml file would be merged into the AppWrapperConfig initialized by default values by config.NewConfig().

Add e2e testing for Autopilot integration

Need to write an e2e test that covers the scenario of Autopilot tagging a resource as unhealthy while a workload is running. Test would verify that the workload gets reset/resumed in response.

Detect deletion of deployed resources

When deploying for example a PyTorchJob using an AppWrapper, the AppWrapper controller monitors the pods but not the PyTorchJob itself. We should consider extending the monitoring to the AppWrapper's components, in particular detection the deletion of these components, and responding to the deletion by undeploying other components if any, releasing the quota, and reporting the AppWrapper status. This may require a setting to opt-in (or opt-out) from this monitoring separately from the monitoring of the pods.

Allow PodSets to specify a replicaPath

Motivation: eliminate user errors where the duplicated replica count gets out of synch. All the CRDs of interest to us have a replica count in their Template somewhere "adjacent" to their PodSpecTemplate that we can refer to in the PodSet.

This will be a non-backwards compatible change in the AppWrapper CRD.

Optional strict checking mode for known GVKs in PodSet inference

This is a follow on from #65.

Since we can only infer/validate PodSets for known GVKs, there may be some scenarios in which we want a strict mode where we would reject all AppWrappers that contain GVKs that aren't known to the operator.

This would mostly be for user experience and better error messages. The RBACs for the operator will prevent us from actually creating instances of unexpected GVKs, so this would just move the error from when resources are created during the Resuming phase of the AppWrapper to when the AppWrapper is initially created/validated by the WebHook.

CI testing for "dev mode" controllers

We should add an additional CI configuration that verifies that the controller works in dev mode (make install; make run).

In this mode, it should still pass all the e2e tests that are not tagged with "Webhook".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.