flux-framework / flux-k8s Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 10.0 16.75 MB

Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces

License: Apache License 2.0

Makefile 3.39% Dockerfile 0.88% Go 93.04% Shell 2.69%

flux-k8s's People

Contributors

Stargazers

Watchers

Forkers

milroy arangogutierrez chu11 mdrocco rountree xyloid tpatki wahello themkots kannon92

flux-k8s's Issues

Fluence Refactor with kubernetes-sigs/scheduler-plugin

We want to refactor the build / deploy / testing of Fluence so that:

it uses kubernetes-sigs/scheduler-plugins directly
a rebase / update is done once a week against upstream:
- The rebuilt plugin is tested, ensuring we can create the example pod, and it is scheduled and runs
- Likely we want a pod that outputs something meaningful to check
- Successful testing -> deploy updated containers, ensuring a version of fluence with latest upstream is always available
- We will likely maintain this as latest, along with tags for versions of kube-scheduler, and dated tags

Specific design notes for the refactor, we are aiming for a structure like this:

sig-scheduler-plugins/ <-- references kubernetes-sigs/scheduler-plugins clearly
  manifests/fluence
  pkg/fluence
  ...
src/               <-- the current scheduler-plugin source code (will be renamed after with PR)

High level, we will keep here only the files here that need to be added (for build) to the upstream, and then in the CI, run and test that build, and deploy on success. I'm assigning myself to this issue because (after the first commit of the files from our psap-openshift fork) I should be oriented to handle the above work. When this issue I will go through other issues still open, either following up with people or closing (if the work is done).

Problems compiling when running `make`

I have cloned the repo and I am trying to build the scheduler plugin.

a) cd scheduler-plugin
b) make

I get the following error in the build:

make[1]: Leaving directory '/home/flux-sched'
cp: cannot stat 'resource/hlapi/bindings/c/.libs/*': No such file or directory
The command '/bin/sh -c git clone https://github.com/cmisale/flux-sched.git --branch gobind-dev --single-branch &&     cd /home/flux-sched/ 	&& ./autogen.sh && PYTHON_VERSION=3.8 ./configure --prefix=/home/flux-install && make -j && make install 	&& cp -r resource/hlapi/bindings/c/.libs/* resource/.libs/* /home/flux-install/lib/ 	&& cp -r resource/hlapi/bindings/go/src/fluxcli /go/src/ 	&& mv  resource/hlapi/bindings /tmp 	&& cd /home && mkdir -p flux-sched/resource/hlapi && mv /tmp/bindings flux-sched/resource/hlapi' returned a non-zero code: 1

I think this is an issue in the Dockerfile:

	&& cp -r resource/hlapi/bindings/c/.libs/* resource/.libs/* /home/flux-install/lib/ \
	&& cp -r resource/hlapi/bindings/go/src/fluxcli /go/src/ \

Any suggestions on how to proceed?

Interaction between Fluxion and K8s Topology Manager

Based on our collaboration discussion this week, we will need to understand how Fluxion (or an RJMS in general) can interact with or override a kubelet's Topology Manager.

bug: support for affinity rules

When we parse the pod, it looks like we don't take into account affinity rules (e.g., for the Flux Operator here). Regardless of the CPU limit/requests, it could be that a pod has affinity that would ask for the entire node. In this case, we would ignore that and still pass in the cpu/memory via the jobspec here and fluxion could decide to put two pods on one node (if I understand that correctly). I think affinity rules are typically applied in Filter which is the step after PreFilter), and we implement it here but don't account for them. In this case we might ignore the affinity rule all together, so that could result in multiple pods/node for the MiniCluster unless the resource limits are also set.

For context, I'm trying to brainstorm the behavior I'm seeing with the latest experiments. It's most likely I did something wrong, but I think there are features of the Flux Operator that need to be taken into account (such as this one). If the default scheduler is accounting for affinity, that is minimally a subtle difference (even if not the exact problem here). I think likely what is needed is careful debugging of an entire scheduling session and checking of every output. I'll continue to try to think of more subtle differences and open issues as I do.

Design Problems for Fluence

I think I've been working on this over 30 hours this weekend and want to write down some concerns I have about #61, which is still not fully working with the new "bulk submit" model.

Resources not accounted for: Fluence can only account for resources on nodes at the init time, and doesn't account for resources that are created with the default scheduler (primarily before fluence comes up). E.g., on smaller sizes I would often see fluence assign nodes to pods, the pods then accept, but then (for some reason) the work didn't wind up there. I suspect it was rejected by the kubelet. I'm not sure the state after that, because fluence thinks the pod is running there but it is not. In practice I think it leads to a stop or clog.
Reliance on state fluence stores group job ids in an internal dictionary. This means that if the scheduler restarts (which happens) we lose a record of them. Further, fluence comes up and is not able to reliably do a new mapping between the existing pod groups and node assignments. I'm not even sure how to think about this one aside from avoiding the restart case - it seems like a failure case (again related to state).
PodGroup decoupling For smaller or controlled runs, it works nicely to see a job run, the job complete, and then the reconciler watching pods to see that the group number completed / failed is == the min members, and delete the pod group. For scaled runs (where there is likely more stress on watching pods, even a kubectl get pods can take 10 seconds) I'm worried that the PodGroup logic can get decoupled from the fluence logic, meaning that the PodGroup is cleaned up, and (for some reason) we then do another AskFlux and allocate again. I'm not sure I've seen this happen but based on the design I think it might be possible.
Pod recreation The main issue I'm seeing now (that I don't understand) is that nodes are allocated for a group, but then for some reason, the pods change. But fluence already has assigned (and lost the nodes from its list). This was helped a bit by adding back in the cancel (if AskFlux happens again) but it still clogs.
Unit of operation The scheduler works on the unit of pods. We need to work in the unit of groups. We are getting around that via the PodGroup, and indeed the MicroSecond timestamps help, but we still have edge cases that are hard to handle like update / delete of a pod, because those events act on an entire group (and maybe it is an erroneous event for the pod and we should not). I don't know how to handle that right now but want to point out the design is problematic.

On a high level, we are trying to implement a model that has state into a framework that is largely against that. We are also trying to enforce the idea of a group of pods in a model where the unit is a single pod. For all of the above, I think our model works OK for small, more controlled cases, but we run into trouble for submission en-masse (as I'm trying to do). My head is spinning a bit from all these design problems and probably I need to step away for a bit. Another set / sets of eyes would help too.

Provenance for LAMMPS efa image

In case anyone needs it, this Dockerfile that builds to this container is a variant of spack v0.17, specifically commit 7893be7712ed709f6136ac83f49afc3d718d5ddc.

documentation: make pretty, web rendered-docs

Reference: #49 (comment)

Note that scheduling pods with kube-scheduler and Fluence on the same cluster isn't supported. There isn't currently any way to propagate pod-to-node mappings generated by kube-scheduler to Fluence.

And for our testing cases:

It's important that kubectl apply -f fluence-job.yaml is executed before kubectl apply -f default-job.yaml, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.

And a point of emphasis that is needed (the original title of this issue):

docs needed: emphasize cannot run kube default scheduler with fluence

Scheduler simulator

I started to look into ways to get more detail from the scheduler, and one idea is this simulator: https://github.com/kubernetes-sigs/kube-scheduler-simulator/tree/master.

That isn't the same as running real jobs, but I think we might try it out anyway. The other approach I'm thinking about it to make a bland / vanilla custom plugin that actually includes the kubernetes upstream schedule module that we can customize as we want! I'm looking for what a default config should look like (probably will just bring up a cluster and try to find it).

Where should Fluxion and its resource graph reside?

This issue originates from the items @cmisale brought up in issue #1.

We will need to determine where (i.e., relative to the K8s masters and workers) Fluxion and the resource graph will run/reside.

discussion: version release strategy

When I add a regular test-> deploy (after #47) we will be making regular releases. I want to propose the following release and versioning strategy.

We don't match versions of flux-k8s with flux-sched. It adds complexity that I don't think is needed, and often we will have changes here (and no changes to flux-sched) or vice versa. The source of truth will be the version we checkout/build in our Dockerfile/Makefile.
Each week we will do one build to test the current scheduler-plugins upstream against here. A successful set of builds should release the latest tag for each container, along with a YYYY-MM-DD tag. I've seen no issue with GitHub having many tags so I think this will work ok to deploy 52/year (registries share layers, so there will be redundancy there).
We do explicit releases just via traditional GitHub release, and that will trigger the workflow here to build a container with the same release number. I propose that after #47 is merged, we create release 0.1.0 of fluence and then can increment however others like (I'm weird and change the patch version typically, but I'm good to go with what others like).

In summary:

latest and YYYY-MM-DD will be updated weekly, with fresh builds against upstream scheduler-plugins
latest and x.x.x will be done manually when we decide to do releases
The versions can increment as we like, with no forced frequency and under our jurisdiction to decide when to bump patch vs. minor vs. major.

Passing of duration / timeout from jobs / pods to fluence

Currently, the default time limit for fluence is one hour, meaning that if the Kubernetes abstraction (pod, job, minicluster, etc.) has a different time, the two would not be synced. As an example, given a Kubernetes job that requires more than an hour, it might not have been cancelled by Kubernetes until hour 2. However fluence will hit the 1 hour mark (it's default) and cancel the job too early.

Another issue (that isn't scoped to fluence, but related to timing) is timeout for a Job that has pods in a group. For example, for a Job abstraction (from here):

The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded

I think this means that, given we have an MPI job that spans nodes, the timing will start when the first pod in the job is running. If there is a large delay to when the last pod is up (when the job can truly start) we don't actually get the runtime we asked for, but the runtime - the waiting time for all pods to be up. In the context of fluence, we are again not accounting for the waiting time. If the pods are quick to schedule in the group, this likely won't be an issue. But if there is some delay that comes close to the total runtime needed, we might want to mutate the time to allow for that. What seems to be a good idea, given the above, is to set a timeout that would deem the job unreasonably long running but not to make it close to the actual runtime.

For the first (simpler) issue, we basically need to pass forward any duration / time limits set on a pod or group abstraction to fluence. Discussed with @milroy today, please add any comments that I forgot.

Restructure repo for usability

The current main branch contains out of date instructions and examples. We need to update it to make it usable. Items to consider:

replace the contents of the examples directory with examples from the CANOPIE22 paper
update the README to include build and running instructions from the CANOPIE22 paper

Other suggestions are welcome.

Accessing NFD information

This issue originates from the items @cmisale brought up in issue #1.

Let's consider how Fluxion will gain access to NFD information. This issue will also need to be addressed for the second collaboration swimlane: standardizing resource expression for K8s and Fluxion/RJMS.

Test using Active queue "Activate Siblings" vs Current approach

Coscheduling uses a strategy of moving siblings to the active Q when a pod that is about to hit a node hits the Permit endpoint. The strategy I have in place to schedule the first pod seems to be working OK, but I'd like to (after we merge the current PR) test this approach. I can see pros and cons to both ways - having to rely on another queue (subject to other issues) seems less ideal than having them all scheduled at the right time. On the other hand, if something might happen with the latter approach that warrants the active queue, maybe it makes sense. I think empirical testing can help us determine which strategy we like best (or even a combination of the two).

bug: sorting based on PodGroup timestamp with second granularity?

I'm trying to understand the granularity that we get with using the metav1.Time, because (based on what I'm seeing) it seems like when I submit a huge batch of jobs with multiprocessing (likely in the same second) we get interleaving. I can't think of another reason that we'd get blocking, and consistently for both default and fluence, when the cluster size is close to the job size (or the ratio is about 1/2, so one large job could take up half resources). For example, I noticed this issue here tilt-dev/tilt#4313 that mentions some APIs are using time.Time(), which (according to the issue) has second granularity. Their fix was to use time.MicroTime. Specifically:

Currently, metav1.Time is only stored with second-level granularity, which is probably not sufficient for this API.

And indeed the PodGroup is using metav1.Time as we can see is defined here which wraps here again. I think probably if we want to handle this "spamming the scheduler" case (and not screw up the sort) we also need to use https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/apis/meta/v1/micro_time.go#L31. This also means the PodGroup abstraction is going to have that bug, and (I think) it wasn't an issue before with launching just 3-5 jobs. What I probably should do is create a new branch off of my current development one, and restore some of the cache logic that I was working on with an internal PodGroup, and test a very simple (stupid) approach to create a MicroTime at the first time that I see a group go through sort. If that resolves the interleaving, we can be more confident it's related to time. I ran out of extra credits today but should be able to test this locally with kind (I was seeing interleaving there, why I abandoned the experimental design in the first place!)

discussion: cancelResponse

Per discussion in #47 (comment)

We want to sanity check:

what happens when you don't specify a struct data member for CancelResponse, and 2) what is the relationship between the struct and message CancelResponse.

Post refactor changes needed

Update fluence to go 1.20 or 1.21: We are going to hit issues using fluence (go 1.19) with other integrations like rainbow (go 1.20) and on our systems (go 1.20), and after #69 should consider updating.
Take in the entire group to calculate resources: right now we use one representative pod, and we should be taking in the entire group instead.
containment / JGF format issues: I think the Name needs to reference the index of resource relative to others, and then the paths -> containment needs to be the path to it plus that name.
Carefully review resources generation for Jobspec (I have only looked at this superficially)
- Resources should include TPUs, accelerators, etc. cbeffce#r142267695
- Support millicpu (int32 -> int64) #69 (comment)

discussion: affinity for scheduler deployment

I'm looking at the scheduler for "Koordinator" and noticed they have affinity for the pod spec in their deployment:

https://github.com/koordinator-sh/charts/blob/95b88c4df3a7c6ad14020add854d5df4d48c836d/versions/v1.3.0/templates/koord-scheduler.yaml#L67-L78

I seem to remember we had a manual strategy to ensure some topology, but it might be a cool idea to have an affinity proper. Here is our deployment yaml for the similar idea:

https://github.com/flux-framework/flux-k8s/blob/main/sig-scheduler-plugins/manifests/fluence/deploy.yaml

Note that I haven't looked at the code that builds that container yet in detail - they could be very different (and then nullify this issue) but I wanted to open it for a future thing to think about (so I don't forget).

flux-K8s usage problem

Hello, I want to learn the use of flux-K8S plugin, but I can't find relevant documents, how should I run this plugin。

Segmentation Fault in ReapiCliInit

Happening sometimes during initialization

[signal SIGSEGV: segmentation violation code=0x2 addr=0x499da8 pc=0x19d1768]

runtime stack:
runtime.throw(0x1ee0b88, 0x2a)
        /usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:704 +0x4ac

goroutine 1 [syscall]:
runtime.cgocall(0x19cb4f3, 0xc000712da0, 0x1896f)
        /usr/local/go/src/runtime/cgocall.go:133 +0x5b fp=0xc000712d70 sp=0xc000712d38 pc=0x460e3b
fluxcli._Cfunc_reapi_cli_initialize(0x7f7a64000cf0, 0x7f7a2c000b60, 0x0)
        _cgo_gotypes.go:161 +0x4d fp=0xc000712da0 sp=0xc000712d70 pc=0x19b636d
fluxcli.ReapiCliInit.func1(0x7f7a64000cf0, 0xc000d34000, 0x1896f, 0xc000d34000)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/fluxcli/reapi_cli.go:35 +0x7d fp=0xc000712dd8 sp=0xc000712da0 pc=0x19b6abd
fluxcli.ReapiCliInit(0x7f7a64000cf0, 0xc000d34000, 0x1896f, 0xc000d34000)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/fluxcli/reapi_cli.go:35 +0x3f fp=0xc000712e08 sp=0xc000712dd8 pc=0x19b66ff
sigs.k8s.io/scheduler-plugins/pkg/kubeflux.New(0x0, 0x0, 0x2149000, 0xc0001e9520, 0x0, 0x0, 0x0, 0x0)
        /go/src/sigs.k8s.io/scheduler-plugins/pkg/kubeflux/kubeflux.go:138 +0x18a fp=0xc000712ee8 sp=0xc000712e08 pc=0x19bd5ca
k8s.io/kubernetes/pkg/scheduler/framework/runtime.NewFramework(0xc000db8150, 0xc0004b26c0, 0x0, 0x0, 0x0, 0xc000dfed80, 0x8, 0xc, 0x215a100, 0x817ab4b8b4242ccd, ...)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:297 +0x866 fp=0xc000713368 sp=0xc000712ee8 pc=0x18a9f06
k8s.io/kubernetes/pkg/scheduler/profile.newProfile(0xc0003216e0, 0x11, 0xc0004b26c0, 0x0, 0x0, 0x0, 0xc000db8150, 0xc00031a330, 0xc0007137c0, 0x6, ...)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/profile/profile.go:41 +0x12c fp=0xc0007133e8 sp=0xc000713368 pc=0x192c26c
k8s.io/kubernetes/pkg/scheduler/profile.NewMap(0xc000435620, 0x2, 0x2, 0xc000db8150, 0xc00031a330, 0xc0007137c0, 0x6, 0x6, 0xb2b7905169e345a7, 0xc000600c00, ...)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/profile/profile.go:61 +0x1b5 fp=0xc000713550 sp=0xc0007133e8 pc=0x192c515
k8s.io/kubernetes/pkg/scheduler.(*Configurator).create(0xc000713aa0, 0xc000301920, 0x1eb1481, 0xf)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/factory.go:135 +0xc0c fp=0xc000713800 sp=0xc000713550 pc=0x199f36c
k8s.io/kubernetes/pkg/scheduler.(*Configurator).createFromProvider(0xc000fa7aa0, 0x1eb1481, 0xf, 0xc0003018c0, 0x2, 0x2)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/factory.go:201 +0x23d fp=0xc0007138d8 sp=0xc000713800 pc=0x199febd
k8s.io/kubernetes/pkg/scheduler.New(0x215df80, 0xc000158f20, 0x2159100, 0xc0006d61e0, 0xc00031a330, 0xc000150b40, 0xc000fa7c30, 0x9, 0x9, 0x1c9fce0, ...)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/scheduler.go:237 +0x785 fp=0xc000713b50 sp=0xc0007138d8 pc=0x19a2405
k8s.io/kubernetes/cmd/kube-scheduler/app.Setup(0x2135400, 0xc000a877c0, 0xc000137ba0, 0xc000114e88, 0x1, 0x1, 0x47, 0x47, 0x47, 0x46)
        /go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:322 +0x50f fp=0xc000713c88 sp=0xc000713b50 pc=0x19b536f
k8s.io/kubernetes/cmd/kube-scheduler/app.runCommand(0xc0004238c0, 0xc000137ba0, 0xc000114e88, 0x1, 0x1, 0x0, 0x0)
[...]

References issue here

feat: Update fluxion with changes to K8s graph

This issue originates from the items @cmisale brought up in issue #1.

How can hierarchical Flux-K8s instances access the resource graph? Claudia, can you expand on this question a bit? I didn't fully understand your idea during our discussion.

optimization: slim down fluence-sidecar container

We could reduce the size with a multi-stage build.

Restore NFD Properties

I don't think we are using NFD Properties (and there has been considerable work on them in the last few years) so I think what might make sense to do now is to remove them from our parsing, and add back strategically for specific things that we want/need. I think with the changes to NFD, if we leave them now if someone is using NFD we could hit a bug or it wouldn't do anything, and I want to make sure we are adding metadata / nodes knowing what the outcome / scenarios should be.

It looks like our graph doesn't have sockets or racks either - so removing those until we need them too.

:robot: Fluence Reboot! Questions / Features discussed

This is a small list of features / notes we discussed that we want to pick up on after the refactor is finished. Please feel free to add to this list - I didn't properly capture the discussion from yesterday.

Question: Is there metadata coming in to fluence that we aren't using (e.g., from custom scheduler)?
Question: What other extension interfaces / plugins are active when fluence is added with a KubeSchedulerConfiguration. See the picture at the top here: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces. I think others might already understand this, but I want to get a cluster running and see (however much I can) for myself what is happening (and then think about how those things work together, etc).
Fluence might have different flavors 🍨 of JobSpec. E.g., as a user I want to be able to add some metadata (label or annotation, likely annotation because labels are more limited in verbosity) to my pod and ask for the nodes to be closer together (or otherwise have a named topology).
When fluxion supports resource graph growth, we should support that. When we change number of pods in our jobs we want to make sure that scheduling decisions continue to be what is desired.
Better understand (and factor into experiments) how TopologySpreadConstraint is relevant https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/

Desired:

Separate rendered docs
Shared image builds (between the automated build/deploy and testing pipelines for GitHub actions)
review current cleanup (saving and restoring cache) not sure is adequate!

Implement Golang binding for Fluxion Resource API CLI

@cmisale defined several tasks in issue #1. I'm breaking these out into separate issues so we can track them independently as todo items.

We will need a Golang binding for Fluxion Resource API CLI so as not to rely on resource-query for scheduling as was done for kube-flux.

When update kubernetes API version: test pending

Newer versions of the Kubernetes API have support for a Pending state that would skip the backoff queue, and I think this might be helpful / useful. I did (stupidly) try it this weekend (updating library versions) and it led to like, 5 hours of debugging mysterious panics, so yeah, not something we should do soon/first with all the current debugging we still need to do! But I wanted to put a note because I saw it was added recently.

Rename src/fluence/utils to something I'll remember

The "utils" package of fluence has a lot of useful stuff, and I can never find it when I'm looking for it! Let's plan to rename it, or move functions to be under the package where they are most suited. I think some functions can go alongside jgf generation or the graph, and if there are any "stuff that still smell like utils" left over we can leave it there.

Initial Assessment for Scheduler Plugin

As discussed during our meeting on Feb 3rd, we have two main options to build the Fluxion Scheduler Plugin

Extension Points (Scheduling Framework)
Out-of-tree Scheduler Plugin

We would need to start a discussion about pros/cons of the two options, so that we can take a decision ASAP.
Initial metrics we can consider are:

Maintainability wrt Kubernetes/OpenShift releases
Ease of implementation and deployment in Kubernetes/OpenShift
Ease of integration with Fluxion
What else?

Regardless of the option we go for, there are certain tasks we have to address that we identified during our discussion:

Implement a Golang binding for Fluxion C CLI
Define where/how to save state for Fluxion (i.e., Resource Graph)
- Define what's the minimal state that we need
- How the state is shared/accessed in case of hierarchical instances
Talk about how to access information about NFD in the future

flux-framework / flux-k8s Goto Github PK

flux-k8s's People

Contributors

Stargazers

Watchers

Forkers

flux-k8s's Issues

Recommend Projects

Recommend Topics

Recommend Org