I apologize in advance if this was specified in your documentation, but I could not fi

Created <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

Does pod-reaper act on 1 pod at a time? or all pods simultaneously about pod-reaper HOT 10 CLOSED

target commented on May 28, 2024 1

Does pod-reaper act on 1 pod at a time? or all pods simultaneously

from pod-reaper.

Comments (10)

brianberzins commented on May 28, 2024 2

pod-reaper grabs all pods that match based on REQUIRE_LABEL_KEY/EXCLUDE_LABEL_KEY and rapidly (without any delay) accesses whether or not each pod should be terminated. With the CHAOS_CHANCE flag set and it being able to look at every pod in a replicaset (or deployment/daemonset) there isn't currently way to ensure at least one of those pods is running. Basically, the current behavior will always lead to ($CHAOS_CHANCE)^(REPLICA_COUNT) chance of having all pods in a replicaset terminated by pod-reaper on a given reaping cycle.

The approach I've taken has been to calculate this chance, observe the effects in a testing environment, and try to do a quick cost/benefit analysis relative to relative to my products error budget. For systems with extremely tight service level objects, this usually meant running a high enough replica count that the risks of a clusterwide outage heavily outweighed the chances of pod-reaper (or something else) knocking out all pods in a replicaset at the same time.

That being said, this is something that's come up a couple of times now. If this is something where having the extra layer of safety means the difference between trying it out, and skipping over, then maybe it's time for me to implement it :)

from pod-reaper.

therc commented on May 28, 2024 1

Before deleting pods, we compare the values of each owner of the pod to a configurable value.

Such a value is already a standard concept in Kubernetes, it's the pod disruption budgets. Users and administrators should already be setting them, because they are used e.g. when machines are drained for any reason or the scheduler triggers some piority-based preemption. The above sounds a lot like reinventing PDBs.

You can take advantage of them through the pods' eviction API: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api

Now the question is: would you only use evictions? or maybe evictions, followed by deletes if they haven't succeeded after some time? or maybe leave that for the admin to choose for each reaper instance, given that services should be able to survive drains, preemptions AND sudden deaths? @gilgameshskytrooper would only set up evictions, in this case.

I hope the above helps and spares you from writing a lot more code :)

from pod-reaper.

mpp-petew commented on May 28, 2024

This would be really helpful for me as well - the non-zero chance of all replicas being terminated at the same time is a problem for us (even if it is a low chance). It would be especially useful when there is a low number of replicas.

from pod-reaper.

brianberzins commented on May 28, 2024

I'm going to block some time to take a deeper look at what an implementation of this would look like.
Inevitably, it will need to involve the reaper being aware of a constructs managing relicasets (or deployments/daemonSets) which doesn't immediately make me feel great about the idea. But if having a safety mechanism helps, it's absolutely worth the look!

from pod-reaper.

gilgameshskytrooper commented on May 28, 2024

This would be very cool. Thanks peeps

from pod-reaper.

brianberzins commented on May 28, 2024

Worked on a prototype when I got some time this weekend. Sample output log below:

The way I went about this was to make use of each pods OwnerReferences. Basically, we keep track of a map of these owners to a pod count. Before deleting pods, we compare the values of each owner of the pod to a configurable value. If it's too low, we don't delete. I've got a ways to go on testing and such, but this looks like an approach that should work without requiring the reaper to be directly aware of deployments/replicasets/daemonsets (it does still need to know about them through OwnerReference.Kind and OwnerReference.UID -- but that doesn't take custom code for each different type)

{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kindnet","pod":"kindnet-zhhw6","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-apiserver-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-controller-manager-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kube-proxy","pod":"kube-proxy-xw5rv","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas Node/kind-control-plane","pod":"kube-scheduler-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/local-path-provisioner-7745554f7f","pod":"local-path-provisioner-7745554f7f-8t7r2","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/pod-reaper-76cbbf8d87","pod":"pod-reaper-76cbbf8d87-mf9vl","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}

from pod-reaper.

brianberzins commented on May 28, 2024

here's an example of it keeping one replica alive and killing the others.

reaper               dummies-65876d98cb-8zfzb                     1/1     Running   0          14s
reaper               dummies-65876d98cb-s8sx6                     1/1     Running   0          14s
reaper               dummies-65876d98cb-wfxgh                     1/1     Running   0          14s
reaper               dummies-65876d98cb-wh7vl                     1/1     Running   0          109s
reaper               pod-reaper-76cbbf8d87-ppw4w                  1/1     Running   0          109s

from pod-reaper.

brianberzins commented on May 28, 2024

There's going to be either some form of chance or some form or order to which pod gets saved. The API seems to be returning pods is alphabetical order, and the safety check allows pod deletes until it encounters a violation of minimum replicas so it favors alphabetically last pods.

Another thing to perhaps consider -- this code does not look at the health of the pods (Running in the case above). That might be something to add before I go much further.

from pod-reaper.

bewing commented on May 28, 2024

Created #68 which uses the eviction API to delete pods, behind an optional feature gate

from pod-reaper.

brianberzins commented on May 28, 2024

closing this as I think it's been resolved. Please feel free to reopen if there's anything else!

from pod-reaper.

Does pod-reaper act on 1 pod at a time? or all pods simultaneously about pod-reaper HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent