Coder Social home page Coder Social logo

Comments (10)

brianberzins avatar brianberzins commented on May 28, 2024 2

pod-reaper grabs all pods that match based on REQUIRE_LABEL_KEY/EXCLUDE_LABEL_KEY and rapidly (without any delay) accesses whether or not each pod should be terminated. With the CHAOS_CHANCE flag set and it being able to look at every pod in a replicaset (or deployment/daemonset) there isn't currently way to ensure at least one of those pods is running. Basically, the current behavior will always lead to ($CHAOS_CHANCE)^(REPLICA_COUNT) chance of having all pods in a replicaset terminated by pod-reaper on a given reaping cycle.

The approach I've taken has been to calculate this chance, observe the effects in a testing environment, and try to do a quick cost/benefit analysis relative to relative to my products error budget. For systems with extremely tight service level objects, this usually meant running a high enough replica count that the risks of a clusterwide outage heavily outweighed the chances of pod-reaper (or something else) knocking out all pods in a replicaset at the same time.

That being said, this is something that's come up a couple of times now. If this is something where having the extra layer of safety means the difference between trying it out, and skipping over, then maybe it's time for me to implement it :)

from pod-reaper.

therc avatar therc commented on May 28, 2024 1

Before deleting pods, we compare the values of each owner of the pod to a configurable value.

Such a value is already a standard concept in Kubernetes, it's the pod disruption budgets. Users and administrators should already be setting them, because they are used e.g. when machines are drained for any reason or the scheduler triggers some piority-based preemption. The above sounds a lot like reinventing PDBs.

You can take advantage of them through the pods' eviction API: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api

Now the question is: would you only use evictions? or maybe evictions, followed by deletes if they haven't succeeded after some time? or maybe leave that for the admin to choose for each reaper instance, given that services should be able to survive drains, preemptions AND sudden deaths? @gilgameshskytrooper would only set up evictions, in this case.

I hope the above helps and spares you from writing a lot more code :)

from pod-reaper.

mpp-petew avatar mpp-petew commented on May 28, 2024

This would be really helpful for me as well - the non-zero chance of all replicas being terminated at the same time is a problem for us (even if it is a low chance). It would be especially useful when there is a low number of replicas.

from pod-reaper.

brianberzins avatar brianberzins commented on May 28, 2024

I'm going to block some time to take a deeper look at what an implementation of this would look like.
Inevitably, it will need to involve the reaper being aware of a constructs managing relicasets (or deployments/daemonSets) which doesn't immediately make me feel great about the idea. But if having a safety mechanism helps, it's absolutely worth the look!

from pod-reaper.

gilgameshskytrooper avatar gilgameshskytrooper commented on May 28, 2024

This would be very cool. Thanks peeps

from pod-reaper.

brianberzins avatar brianberzins commented on May 28, 2024

Worked on a prototype when I got some time this weekend. Sample output log below:

The way I went about this was to make use of each pods OwnerReferences. Basically, we keep track of a map of these owners to a pod count. Before deleting pods, we compare the values of each owner of the pod to a configurable value. If it's too low, we don't delete. I've got a ways to go on testing and such, but this looks like an approach that should work without requiring the reaper to be directly aware of deployments/replicasets/daemonsets (it does still need to know about them through OwnerReference.Kind and OwnerReference.UID -- but that doesn't take custom code for each different type)

{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kindnet","pod":"kindnet-zhhw6","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-apiserver-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-controller-manager-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kube-proxy","pod":"kube-proxy-xw5rv","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas Node/kind-control-plane","pod":"kube-scheduler-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/local-path-provisioner-7745554f7f","pod":"local-path-provisioner-7745554f7f-8t7r2","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/pod-reaper-76cbbf8d87","pod":"pod-reaper-76cbbf8d87-mf9vl","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}

from pod-reaper.

brianberzins avatar brianberzins commented on May 28, 2024

here's an example of it keeping one replica alive and killing the others.

reaper               dummies-65876d98cb-8zfzb                     1/1     Running   0          14s
reaper               dummies-65876d98cb-s8sx6                     1/1     Running   0          14s
reaper               dummies-65876d98cb-wfxgh                     1/1     Running   0          14s
reaper               dummies-65876d98cb-wh7vl                     1/1     Running   0          109s
reaper               pod-reaper-76cbbf8d87-ppw4w                  1/1     Running   0          109s

from pod-reaper.

brianberzins avatar brianberzins commented on May 28, 2024

There's going to be either some form of chance or some form or order to which pod gets saved. The API seems to be returning pods is alphabetical order, and the safety check allows pod deletes until it encounters a violation of minimum replicas so it favors alphabetically last pods.

Another thing to perhaps consider -- this code does not look at the health of the pods (Running in the case above). That might be something to add before I go much further.

from pod-reaper.

bewing avatar bewing commented on May 28, 2024

Created #68 which uses the eviction API to delete pods, behind an optional feature gate

from pod-reaper.

brianberzins avatar brianberzins commented on May 28, 2024

closing this as I think it's been resolved. Please feel free to reopen if there's anything else!

from pod-reaper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.