flux-framework / flux-k8s Goto Github PK
View Code? Open in Web Editor NEWProject to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
License: Apache License 2.0
Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
License: Apache License 2.0
We want to refactor the build / deploy / testing of Fluence so that:
Specific design notes for the refactor, we are aiming for a structure like this:
sig-scheduler-plugins/ <-- references kubernetes-sigs/scheduler-plugins clearly
manifests/fluence
pkg/fluence
...
src/ <-- the current scheduler-plugin source code (will be renamed after with PR)
High level, we will keep here only the files here that need to be added (for build) to the upstream, and then in the CI, run and test that build, and deploy on success. I'm assigning myself to this issue because (after the first commit of the files from our psap-openshift fork) I should be oriented to handle the above work. When this issue I will go through other issues still open, either following up with people or closing (if the work is done).
I have cloned the repo and I am trying to build the scheduler plugin.
a) cd scheduler-plugin
b) make
I get the following error in the build:
make[1]: Leaving directory '/home/flux-sched'
cp: cannot stat 'resource/hlapi/bindings/c/.libs/*': No such file or directory
The command '/bin/sh -c git clone https://github.com/cmisale/flux-sched.git --branch gobind-dev --single-branch && cd /home/flux-sched/ && ./autogen.sh && PYTHON_VERSION=3.8 ./configure --prefix=/home/flux-install && make -j && make install && cp -r resource/hlapi/bindings/c/.libs/* resource/.libs/* /home/flux-install/lib/ && cp -r resource/hlapi/bindings/go/src/fluxcli /go/src/ && mv resource/hlapi/bindings /tmp && cd /home && mkdir -p flux-sched/resource/hlapi && mv /tmp/bindings flux-sched/resource/hlapi' returned a non-zero code: 1
I think this is an issue in the Dockerfile
:
&& cp -r resource/hlapi/bindings/c/.libs/* resource/.libs/* /home/flux-install/lib/ \
&& cp -r resource/hlapi/bindings/go/src/fluxcli /go/src/ \
Any suggestions on how to proceed?
Based on our collaboration discussion this week, we will need to understand how Fluxion (or an RJMS in general) can interact with or override a kubelet's Topology Manager.
When we parse the pod, it looks like we don't take into account affinity rules (e.g., for the Flux Operator here). Regardless of the CPU limit/requests, it could be that a pod has affinity that would ask for the entire node. In this case, we would ignore that and still pass in the cpu/memory via the jobspec here and fluxion could decide to put two pods on one node (if I understand that correctly). I think affinity rules are typically applied in Filter which is the step after PreFilter), and we implement it here but don't account for them. In this case we might ignore the affinity rule all together, so that could result in multiple pods/node for the MiniCluster unless the resource limits are also set.
For context, I'm trying to brainstorm the behavior I'm seeing with the latest experiments. It's most likely I did something wrong, but I think there are features of the Flux Operator that need to be taken into account (such as this one). If the default scheduler is accounting for affinity, that is minimally a subtle difference (even if not the exact problem here). I think likely what is needed is careful debugging of an entire scheduling session and checking of every output. I'll continue to try to think of more subtle differences and open issues as I do.
I think I've been working on this over 30 hours this weekend and want to write down some concerns I have about #61, which is still not fully working with the new "bulk submit" model.
On a high level, we are trying to implement a model that has state into a framework that is largely against that. We are also trying to enforce the idea of a group of pods in a model where the unit is a single pod. For all of the above, I think our model works OK for small, more controlled cases, but we run into trouble for submission en-masse (as I'm trying to do). My head is spinning a bit from all these design problems and probably I need to step away for a bit. Another set / sets of eyes would help too.
In case anyone needs it, this Dockerfile that builds to this container is a variant of spack v0.17, specifically commit 7893be7712ed709f6136ac83f49afc3d718d5ddc
.
Reference: #49 (comment)
Note that scheduling pods with kube-scheduler and Fluence on the same cluster isn't supported. There isn't currently any way to propagate pod-to-node mappings generated by kube-scheduler to Fluence.
And for our testing cases:
It's important that kubectl apply -f fluence-job.yaml is executed before kubectl apply -f default-job.yaml, and that they don't specify limits or requests so they could be scheduled on the same node. That's currently the case in this PR, but I'm emphasizing it for posterity.
And a point of emphasis that is needed (the original title of this issue):
docs needed: emphasize cannot run kube default scheduler with fluence
I started to look into ways to get more detail from the scheduler, and one idea is this simulator: https://github.com/kubernetes-sigs/kube-scheduler-simulator/tree/master.
That isn't the same as running real jobs, but I think we might try it out anyway. The other approach I'm thinking about it to make a bland / vanilla custom plugin that actually includes the kubernetes upstream schedule module that we can customize as we want! I'm looking for what a default config should look like (probably will just bring up a cluster and try to find it).
When I add a regular test-> deploy (after #47) we will be making regular releases. I want to propose the following release and versioning strategy.
latest
tag for each container, along with a YYYY-MM-DD
tag. I've seen no issue with GitHub having many tags so I think this will work ok to deploy 52/year (registries share layers, so there will be redundancy there).In summary:
latest
and YYYY-MM-DD
will be updated weekly, with fresh builds against upstream scheduler-pluginslatest
and x.x.x
will be done manually when we decide to do releasesCurrently, the default time limit for fluence is one hour, meaning that if the Kubernetes abstraction (pod, job, minicluster, etc.) has a different time, the two would not be synced. As an example, given a Kubernetes job that requires more than an hour, it might not have been cancelled by Kubernetes until hour 2. However fluence will hit the 1 hour mark (it's default) and cancel the job too early.
Another issue (that isn't scoped to fluence, but related to timing) is timeout for a Job that has pods in a group. For example, for a Job abstraction (from here):
The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded
I think this means that, given we have an MPI job that spans nodes, the timing will start when the first pod in the job is running. If there is a large delay to when the last pod is up (when the job can truly start) we don't actually get the runtime we asked for, but the runtime - the waiting time
for all pods to be up. In the context of fluence, we are again not accounting for the waiting time. If the pods are quick to schedule in the group, this likely won't be an issue. But if there is some delay that comes close to the total runtime needed, we might want to mutate the time to allow for that. What seems to be a good idea, given the above, is to set a timeout that would deem the job unreasonably long running but not to make it close to the actual runtime.
For the first (simpler) issue, we basically need to pass forward any duration / time limits set on a pod or group abstraction to fluence. Discussed with @milroy today, please add any comments that I forgot.
The current main
branch contains out of date instructions and examples. We need to update it to make it usable. Items to consider:
examples
directory with examples from the CANOPIE22 paperOther suggestions are welcome.
Coscheduling uses a strategy of moving siblings to the active Q when a pod that is about to hit a node hits the Permit endpoint. The strategy I have in place to schedule the first pod seems to be working OK, but I'd like to (after we merge the current PR) test this approach. I can see pros and cons to both ways - having to rely on another queue (subject to other issues) seems less ideal than having them all scheduled at the right time. On the other hand, if something might happen with the latter approach that warrants the active queue, maybe it makes sense. I think empirical testing can help us determine which strategy we like best (or even a combination of the two).
I'm trying to understand the granularity that we get with using the metav1.Time, because (based on what I'm seeing) it seems like when I submit a huge batch of jobs with multiprocessing (likely in the same second) we get interleaving. I can't think of another reason that we'd get blocking, and consistently for both default and fluence, when the cluster size is close to the job size (or the ratio is about 1/2, so one large job could take up half resources). For example, I noticed this issue here tilt-dev/tilt#4313 that mentions some APIs are using time.Time(), which (according to the issue) has second granularity. Their fix was to use time.MicroTime. Specifically:
Currently, metav1.Time is only stored with second-level granularity, which is probably not sufficient for this API.
And indeed the PodGroup is using metav1.Time as we can see is defined here which wraps here again. I think probably if we want to handle this "spamming the scheduler" case (and not screw up the sort) we also need to use https://github.com/kubernetes/apimachinery/blob/02a41040d88da08de6765573ae2b1a51f424e1ca/pkg/apis/meta/v1/micro_time.go#L31. This also means the PodGroup abstraction is going to have that bug, and (I think) it wasn't an issue before with launching just 3-5 jobs. What I probably should do is create a new branch off of my current development one, and restore some of the cache logic that I was working on with an internal PodGroup, and test a very simple (stupid) approach to create a MicroTime at the first time that I see a group go through sort. If that resolves the interleaving, we can be more confident it's related to time. I ran out of extra credits today but should be able to test this locally with kind (I was seeing interleaving there, why I abandoned the experimental design in the first place!)
Per discussion in #47 (comment)
We want to sanity check:
- what happens when you don't specify a struct data member for CancelResponse, and 2) what is the relationship between the struct and message CancelResponse.
I'm looking at the scheduler for "Koordinator" and noticed they have affinity for the pod spec in their deployment:
I seem to remember we had a manual strategy to ensure some topology, but it might be a cool idea to have an affinity proper. Here is our deployment yaml for the similar idea:
Note that I haven't looked at the code that builds that container yet in detail - they could be very different (and then nullify this issue) but I wanted to open it for a future thing to think about (so I don't forget).
Hello, I want to learn the use of flux-K8S plugin, but I can't find relevant documents, how should I run this plugin。
Happening sometimes during initialization
[signal SIGSEGV: segmentation violation code=0x2 addr=0x499da8 pc=0x19d1768]
runtime stack:
runtime.throw(0x1ee0b88, 0x2a)
/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:704 +0x4ac
goroutine 1 [syscall]:
runtime.cgocall(0x19cb4f3, 0xc000712da0, 0x1896f)
/usr/local/go/src/runtime/cgocall.go:133 +0x5b fp=0xc000712d70 sp=0xc000712d38 pc=0x460e3b
fluxcli._Cfunc_reapi_cli_initialize(0x7f7a64000cf0, 0x7f7a2c000b60, 0x0)
_cgo_gotypes.go:161 +0x4d fp=0xc000712da0 sp=0xc000712d70 pc=0x19b636d
fluxcli.ReapiCliInit.func1(0x7f7a64000cf0, 0xc000d34000, 0x1896f, 0xc000d34000)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/fluxcli/reapi_cli.go:35 +0x7d fp=0xc000712dd8 sp=0xc000712da0 pc=0x19b6abd
fluxcli.ReapiCliInit(0x7f7a64000cf0, 0xc000d34000, 0x1896f, 0xc000d34000)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/fluxcli/reapi_cli.go:35 +0x3f fp=0xc000712e08 sp=0xc000712dd8 pc=0x19b66ff
sigs.k8s.io/scheduler-plugins/pkg/kubeflux.New(0x0, 0x0, 0x2149000, 0xc0001e9520, 0x0, 0x0, 0x0, 0x0)
/go/src/sigs.k8s.io/scheduler-plugins/pkg/kubeflux/kubeflux.go:138 +0x18a fp=0xc000712ee8 sp=0xc000712e08 pc=0x19bd5ca
k8s.io/kubernetes/pkg/scheduler/framework/runtime.NewFramework(0xc000db8150, 0xc0004b26c0, 0x0, 0x0, 0x0, 0xc000dfed80, 0x8, 0xc, 0x215a100, 0x817ab4b8b4242ccd, ...)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/framework/runtime/framework.go:297 +0x866 fp=0xc000713368 sp=0xc000712ee8 pc=0x18a9f06
k8s.io/kubernetes/pkg/scheduler/profile.newProfile(0xc0003216e0, 0x11, 0xc0004b26c0, 0x0, 0x0, 0x0, 0xc000db8150, 0xc00031a330, 0xc0007137c0, 0x6, ...)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/profile/profile.go:41 +0x12c fp=0xc0007133e8 sp=0xc000713368 pc=0x192c26c
k8s.io/kubernetes/pkg/scheduler/profile.NewMap(0xc000435620, 0x2, 0x2, 0xc000db8150, 0xc00031a330, 0xc0007137c0, 0x6, 0x6, 0xb2b7905169e345a7, 0xc000600c00, ...)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/profile/profile.go:61 +0x1b5 fp=0xc000713550 sp=0xc0007133e8 pc=0x192c515
k8s.io/kubernetes/pkg/scheduler.(*Configurator).create(0xc000713aa0, 0xc000301920, 0x1eb1481, 0xf)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/factory.go:135 +0xc0c fp=0xc000713800 sp=0xc000713550 pc=0x199f36c
k8s.io/kubernetes/pkg/scheduler.(*Configurator).createFromProvider(0xc000fa7aa0, 0x1eb1481, 0xf, 0xc0003018c0, 0x2, 0x2)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/factory.go:201 +0x23d fp=0xc0007138d8 sp=0xc000713800 pc=0x199febd
k8s.io/kubernetes/pkg/scheduler.New(0x215df80, 0xc000158f20, 0x2159100, 0xc0006d61e0, 0xc00031a330, 0xc000150b40, 0xc000fa7c30, 0x9, 0x9, 0x1c9fce0, ...)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/pkg/scheduler/scheduler.go:237 +0x785 fp=0xc000713b50 sp=0xc0007138d8 pc=0x19a2405
k8s.io/kubernetes/cmd/kube-scheduler/app.Setup(0x2135400, 0xc000a877c0, 0xc000137ba0, 0xc000114e88, 0x1, 0x1, 0x47, 0x47, 0x47, 0x46)
/go/src/sigs.k8s.io/scheduler-plugins/vendor/k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:322 +0x50f fp=0xc000713c88 sp=0xc000713b50 pc=0x19b536f
k8s.io/kubernetes/cmd/kube-scheduler/app.runCommand(0xc0004238c0, 0xc000137ba0, 0xc000114e88, 0x1, 0x1, 0x0, 0x0)
[...]
References issue here
We could reduce the size with a multi-stage build.
I don't think we are using NFD Properties (and there has been considerable work on them in the last few years) so I think what might make sense to do now is to remove them from our parsing, and add back strategically for specific things that we want/need. I think with the changes to NFD, if we leave them now if someone is using NFD we could hit a bug or it wouldn't do anything, and I want to make sure we are adding metadata / nodes knowing what the outcome / scenarios should be.
It looks like our graph doesn't have sockets or racks either - so removing those until we need them too.
This is a small list of features / notes we discussed that we want to pick up on after the refactor is finished. Please feel free to add to this list - I didn't properly capture the discussion from yesterday.
KubeSchedulerConfiguration
. See the picture at the top here: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/#interfaces. I think others might already understand this, but I want to get a cluster running and see (however much I can) for myself what is happening (and then think about how those things work together, etc).TopologySpreadConstraint
is relevant https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/Desired:
Newer versions of the Kubernetes API have support for a Pending state that would skip the backoff queue, and I think this might be helpful / useful. I did (stupidly) try it this weekend (updating library versions) and it led to like, 5 hours of debugging mysterious panics, so yeah, not something we should do soon/first with all the current debugging we still need to do! But I wanted to put a note because I saw it was added recently.
The "utils" package of fluence has a lot of useful stuff, and I can never find it when I'm looking for it! Let's plan to rename it, or move functions to be under the package where they are most suited. I think some functions can go alongside jgf generation or the graph, and if there are any "stuff that still smell like utils" left over we can leave it there.
As discussed during our meeting on Feb 3rd, we have two main options to build the Fluxion Scheduler Plugin
We would need to start a discussion about pros/cons of the two options, so that we can take a decision ASAP.
Initial metrics we can consider are:
Regardless of the option we go for, there are certain tasks we have to address that we identified during our discussion:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.