Comments (10)
pod-reaper grabs all pods that match based on REQUIRE_LABEL_KEY/EXCLUDE_LABEL_KEY
and rapidly (without any delay) accesses whether or not each pod should be terminated. With the CHAOS_CHANCE
flag set and it being able to look at every pod in a replicaset (or deployment/daemonset) there isn't currently way to ensure at least one of those pods is running. Basically, the current behavior will always lead to ($CHAOS_CHANCE)^(REPLICA_COUNT)
chance of having all pods in a replicaset terminated by pod-reaper on a given reaping cycle.
The approach I've taken has been to calculate this chance, observe the effects in a testing environment, and try to do a quick cost/benefit analysis relative to relative to my products error budget. For systems with extremely tight service level objects, this usually meant running a high enough replica count that the risks of a clusterwide outage heavily outweighed the chances of pod-reaper (or something else) knocking out all pods in a replicaset at the same time.
That being said, this is something that's come up a couple of times now. If this is something where having the extra layer of safety means the difference between trying it out, and skipping over, then maybe it's time for me to implement it :)
from pod-reaper.
Before deleting pods, we compare the values of each owner of the pod to a configurable value.
Such a value is already a standard concept in Kubernetes, it's the pod disruption budgets. Users and administrators should already be setting them, because they are used e.g. when machines are drained for any reason or the scheduler triggers some piority-based preemption. The above sounds a lot like reinventing PDBs.
You can take advantage of them through the pods' eviction API: https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api
Now the question is: would you only use evictions? or maybe evictions, followed by deletes if they haven't succeeded after some time? or maybe leave that for the admin to choose for each reaper instance, given that services should be able to survive drains, preemptions AND sudden deaths? @gilgameshskytrooper would only set up evictions, in this case.
I hope the above helps and spares you from writing a lot more code :)
from pod-reaper.
This would be really helpful for me as well - the non-zero chance of all replicas being terminated at the same time is a problem for us (even if it is a low chance). It would be especially useful when there is a low number of replicas.
from pod-reaper.
I'm going to block some time to take a deeper look at what an implementation of this would look like.
Inevitably, it will need to involve the reaper being aware of a constructs managing relicasets (or deployments/daemonSets) which doesn't immediately make me feel great about the idea. But if having a safety mechanism helps, it's absolutely worth the look!
from pod-reaper.
This would be very cool. Thanks peeps
from pod-reaper.
Worked on a prototype when I got some time this weekend. Sample output log below:
The way I went about this was to make use of each pods OwnerReferences
. Basically, we keep track of a map of these owners to a pod count. Before deleting pods, we compare the values of each owner of the pod to a configurable value. If it's too low, we don't delete. I've got a ways to go on testing and such, but this looks like an approach that should work without requiring the reaper to be directly aware of deployments/replicasets/daemonsets (it does still need to know about them through OwnerReference.Kind
and OwnerReference.UID
-- but that doesn't take custom code for each different type)
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kindnet","pod":"kindnet-zhhw6","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-apiserver-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"reaping pod","pod":"kube-controller-manager-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas DaemonSet/kube-proxy","pod":"kube-proxy-xw5rv","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas Node/kind-control-plane","pod":"kube-scheduler-kind-control-plane","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/local-path-provisioner-7745554f7f","pod":"local-path-provisioner-7745554f7f-8t7r2","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
{"level":"info","msg":"pod flagged as unsafe for delete by minimum replicas ReplicaSet/pod-reaper-76cbbf8d87","pod":"pod-reaper-76cbbf8d87-mf9vl","reasons":["was flagged for chaos"],"time":"2020-12-27T23:44:28Z"}
from pod-reaper.
here's an example of it keeping one replica alive and killing the others.
reaper dummies-65876d98cb-8zfzb 1/1 Running 0 14s
reaper dummies-65876d98cb-s8sx6 1/1 Running 0 14s
reaper dummies-65876d98cb-wfxgh 1/1 Running 0 14s
reaper dummies-65876d98cb-wh7vl 1/1 Running 0 109s
reaper pod-reaper-76cbbf8d87-ppw4w 1/1 Running 0 109s
from pod-reaper.
There's going to be either some form of chance or some form or order to which pod gets saved. The API seems to be returning pods is alphabetical order, and the safety check allows pod deletes until it encounters a violation of minimum replicas so it favors alphabetically last pods.
Another thing to perhaps consider -- this code does not look at the health of the pods (Running in the case above). That might be something to add before I go much further.
from pod-reaper.
Created #68 which uses the eviction API to delete pods, behind an optional feature gate
from pod-reaper.
closing this as I think it's been resolved. Please feel free to reopen if there's anything else!
from pod-reaper.
Related Issues (20)
- Easier local development HOT 4
- Allow default configuration override with annotations HOT 14
- Log messages not parsed by Stackdriver HOT 4
- Explicit rule enable
- Setup CI/CD outside of docker
- Deployment bug HOT 2
- Schedule doesn't seem to work correctly HOT 7
- Dry run mode HOT 2
- Update Dependencies to use Go Modules HOT 3
- Helm chart HOT 5
- Pod status rule is misleading HOT 2
- Upgrade docker base image to use golang 1.15 HOT 3
- In nonprod, reduce resouces: Reap pod/apps so they don't consume resources on weekend HOT 1
- Docker builds no longer happening automatically HOT 5
- v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; Use v1 ClusterRole
- MAX_DURATION option does not count the Pod Status Start time HOT 3
- Split the helm chart into a new repo or Equate the the Helm chart version with the pod-reaper version HOT 19
- Pod reaping strategy HOT 5
- Improve dry run log accuracy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pod-reaper.