target / pod-reaper Goto Github PK

Rule based pod killing kubernetes controller

License: MIT License

Go 98.07% Shell 1.32% Dockerfile 0.61%

go kubernetes chaos resiliency

pod-reaper's Introduction

pod-reaper: kills pods dead

A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions. See the "Implemented Rules" section below for details on specific rules.

Configuring Pod Reaper

Pod-Reaper is configurable through environment variables. The pod-reaper specific environment variables are:

NAMESPACE the kubernetes namespace where pod-reaper should look for pods
GRACE_PERIOD duration that pods should be given to shut down before hard killing the pod
SCHEDULE schedule for when pod-reaper should look for pods to reap
RUN_DURATION how long pod-reaper should run before exiting
EVICT try to evict pods instead of deleting them
EXCLUDE_LABEL_KEY pod metadata label (of key-value pair) that pod-reaper should exclude
EXCLUDE_LABEL_VALUES comma-separated list of metadata label values (of key-value pair) that pod-reaper should exclude
REQUIRE_LABEL_KEY pod metadata label (of key-value pair) that pod-reaper should require
REQUIRE_LABEL_VALUES comma-separated list of metadata label values (of key-value pair) that pod-reaper should require
REQUIRE_ANNOTATION_KEY pod metadata annotation (of key-value pair) that pod-reaper should require
REQUIRE_ANNOTATION_VALUES comma-separated list of metadata annotation values (of key-value pair) that pod-reaper should require
DRY_RUN log pod-reaper's actions but don't actually kill any pods
MAX_PODS kill a maximum number of pods on each run
POD_SORTING_STRATEGY sorts pods before killing them (most useful when used with MAX_PODS)
LOG_LEVEL control verbosity level of log messages
LOG_FORMAT choose between several formats of logging

Additionally, at least one rule must be enabled, or the pod-reaper will error and exit. See the Rules section below for configuring and enabling rules.

Example environment variables:

# pod-reaper configuration
NAMESPACE=test
SCHEDULE=@every 30s
RUN_DURATION=15m
EXCLUDE_LABEL_KEY=pod-reaper
EXCLUDE_LABEL_VALUES=disabled,false

# enable at least one rule
CHAOS_CHANCE=.001

`NAMESPACE`

Default value: "" (which will look at ALL namespaces)

Controls which kubernetes namespace the pod-reaper is in scope for the pod-reaper. Note that the pod-reaper uses an InClusterConfig which makes use of the service account that kubernetes gives to its pods. Only pods (and namespaces) accessible to this service account will be visible to the pod-reaper.

`GRACE_PERIOD`

Default value: nil (indicates to the use the default specified for pods)

Controls the grace period between a soft pod termination and a hard termination. This will determine the time between when the pod's containers are send a SIGTERM signal and when they are sent a SIGKILL signal. The format follows the go-lang time.duration format (example: "1h15m30s"). A duration of 0s can be considered a hard kill of the pod.

`SCHEDULE`

Default value: "@every 1m"

Controls how frequently pod-reaper queries kubernetes for pods. The format follows the upstream cron library https://godoc.org/github.com/robfig/cron. For most use cases, the interval format @every 1h2m3s is sufficient. But more complex use cases can make use of the * * * * * notation. The cron parser used can optionally support seconds if a sixth parameter is add. 12 * * * * * for example will run on the 12th second of every minute.

`RUN_DURATION`

Default value: "0s" (which corresponds to running indefinitely)

Controls the minimum duration that pod-reaper will run before intentionally exiting. The value of "0s" (or anything equivalent such as the empty string) will be interpreted as an indefinite run duration. The format follows the go-lang time.duration format (example: "1h15m30s"). Pod-Reaper will not wait for reap-cycles to finishing waiting and will exit immediately (with exit code 0) after the duration has elapsed.

Warnings about RUN_DURATION

pod-rescheduling: if the reaper completes, even successfully, it may be restarted depending on the pod-spec.
self-reaping: the pod-reaper can reap itself if configured to do so, this can cause the reaper to not run for the expected duration.

Recommendations:

One time run:

create a pod spec and apply it to kubernetes
make the pod spec has restartPolicy: Never
add an exclusion label and key using EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES
make the pod spec for the reaper match an excluded label and key to prevent it from reaping itself

Sustained running:

do not use RUN_DURATION
manage the pod reaper via a deployment

`EVICT`

Use the Eviction API instead of pod deletion when reaping pods. The Eviction API will honor the disruption budget assigned to pods, and can for example be useful when reaping pods by duration to ensure that you don't reap all the pods of a specific deployment simultaneously, interrupting a published service. When a pod cannot be reaped due to a disruption budget, the reason will be logged as a warning.

`EXCLUDE_LABEL_KEY` and `EXCLUDE_LABEL_VALUES`

These environment variables are used to build a label selector to exclude pods from reaping. The key must be a properly formed kubernetes label key. Values are a comma-separated (without whitespace) list of kubernetes label values. Setting exactly one of the key or values environment variables will result in an error.

A pod will be excluded from the pod-reaper if the pod has a metadata label has a key corresponding to the pod-reaper's exclude label key, and that same metadata label has a value in the pod-reaper's list of excluded label values. This means that exclusion requires both the pod-reaper and pod to be configured in a compatible way.

`REQUIRE_LABEL_KEY` and `REQUIRE_LABEL_VALUES`

These environment variables build a label selector that pods must match in order to be reaped. Use them the same way as you would EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES.

`REQUIRE_ANNOTATION_KEY` and `REQUIRE_ANNOTATION_VALUES`

These environment variables build a annotation selector that pods must match in order to be reaped. Use them the same way as you would EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES.

`DRY_RUN`

Default value: unset (which will behave as if it were set to "false")

Acceptable values are 1, t, T, TRUE, true, True, 0, f, F, FALSE, false, False. Any other values will error. If the provided value is one of the "true" values then pod reaper will do select pods for reaper but will not actually kill any pods. Logging messages will reflect that a pod was selected for reaping and that pod was not killed because the reaper is in dry-run mode.

`MAX_PODS`

Default value: unset (which will behave as if it were set to "0")

Acceptable values are positive integers. Negative integers will evaluate to 0 and any other values will error. This can be useful to prevent too many pods being killed in one run. Logging messages will reflect that a pod was selected for reaping and that pod was not killed because too many pods were reaped already.

`POD_SORTING_STRATEGY`

Default value: unset (which will use the pod ordering return without specification from the API server). Accepted values:

(unset) - use the default ordering from the API server
random (case-sensitive) will randomly shuffle the list of pods before killing
oldest-first (case-sensitive) will sort pods into oldest-first based on the pods start time. (!! warning below).
youngest-first (case-sensitive) will sort pods into youngest-first based on the pods start time (!! warning below)
pod-deletion-cost (case-sensitive) will sort pods based on the pod deletion cost annotation.

!! WARNINGS !!

Pod start time is not always defined. In these cases, sorting strategies based on age put pods without start times at the end of the list. From my experience, this usually happens during a race condition with the pod initially being scheduled, but there may be other cases hidden away.

Using pod-reaper against the kube-system namespace can have some surprising implications. For example, during testing I found that the kube-schedule was owned by a master node (not a replicaset/daemon-set) and appeared to effectively ignore delete actions. The age returned from kubectl was reset, but the actual pod start time was unaffected. As a result of this, I found a looping scenario where the kube scheduler was effectively always the oldest pod.

In examples/pod-sorting-strategy.yml I mitigated this using by excluding on the label tier: control-plane

Logging

Pod reaper logs in JSON format using a logrus (https://github.com/sirupsen/logrus).

rule load: customer messages for each rule are logged when the pod-reaper is starting
reap cycle: a message is logged each time the reaper starts a cycle.
pod reap: a message is logged (with a reason for each rule) when a pod is flag for reaping.
exit: a message is logged when the reaper exits successfully (only is RUN_DURATION is specified)

`LOG_LEVEL`

Default value: Info

Messages this level and above will be logged. Available logging levels: Debug, Info, Warning, Error, Fatal and Panic

Example Log

{"level":"info","msg":"loaded rule: chaos chance .3","time":"2017-10-18T17:09:25Z"}
{"level":"info","msg":"loaded rule: maximum run duration 2m","time":"2017-10-18T17:09:25Z"}
{"level":"info","msg":"executing reap cycle","time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"reaping pod","pod":"hello-cloud-deployment-3026746346-bj65k","reasons":["was flagged for chaos","has been running for 3m6.257891269s"],"time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"reaping pod","pod":"example-pod-deployment-125971999cgsws","reasons":["was flagged for chaos","has been running for 2m55.269615797s"],"time":"2017-10-18T17:09:55Z"}
{"level":"info","msg":"executing reap cycle","time":"2017-10-18T17:10:25Z"}
{"level":"info","msg":"reaping pod","pod":"hello-cloud-deployment-3026746346-grw12","reasons":["was flagged for chaos","has been running for 3m36.054164005s"],"time":"2017-10-18T17:10:25Z"}
{"level":"info","msg":"pod reaper is exiting","time":"2017-10-18T17:10:46Z"}

`LOG_FORMAT`

Default value: Logrus

This environment variable modifies the structured log format for easy ingestion into different logging systems, including Stackdriver via the Fluentd format. Available formats: Logrus, Fluentd

Implemented Rules

`CHAOS_CHANCE`

Flags a pod for reaping based on a random number generator.

Enabled and configured by setting the environment variable CHAOS_CHANCE with a floating point value. A random number generator will generate a value in range [0,1) and if the the generated value is below the configured chaos chance, the pod will be flagged for reaping.

Example:

# every 30 seconds kill 1/100 pods found (based on random chance)
SCHEDULE=@every 30s
CHAOS_CHANCE=.01

Remember that pods can be excluded from reaping if the pod has a label matching the pod-reaper's configuration. See the EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES section above for more details.

`CONTAINER_STATUSES`

Flags a pod for reaping based on a container within a pod having a specific container status.

Enabled and configured by setting the environment variable CONTAINER_STATUSES with a coma separated list (no whitespace) of statuses. If a pod is in either a waiting or terminated state with a status in the specified list of status, the pod will be flagged for reaping.

Example:

# every 10 minutes, kill all pods with a container with a status ImagePullBackOff, ErrImagePull, or Error
SCHEDULE=@every 10m
CONTAINER_STATUSES=ImagePullBackOff,ErrImagePull,Error

Note that this will not catch statuses that are describing the entire pod like the Evicted status.

`POD_STATUS`

Flags a pod for reaping based on the pod status.

Enabled and configured by setting the environment variable POD_STATUSES with a coma separated list (no whitespace) of statuses. If the pod status in the specified list of status, the pod will be flagged for reaping.

Example:

# every 10 minutes, kill all pods with status ImagePullBackOff, ErrImagePull, or Error
SCHEDULE=@every 10m
POD_STATUSES=Evicted,Unknown

Note that pod status is different than container statuses as it checks the status of the overall pod rather than teh status of containers in the pod. The most obvious use case of this if dealing with Evicted pods.

`MAX_DURATION`

Flags a pod for reaping based on the pods current run duration.

Enabled and configured by setting the environment variable MAX_DURATION with a valid go-lang time.duration format (example: "1h15m30s"). If a pod has been running longer than the specified duration, the pod will be flagged for reaping.

`UNREADY`

Flags a pod for reaping based on the time the pod has been unready.

Enabled and configured by setting the environment variable MAX_UNREADY with a valid go-lang time.duration format (example: "10m"). If a pod has been unready longer than the specified duration, the pod will be flagged for reaping.

Running Pod-Reapers

Service Accounts

Pod reaper uses the permissions of the pod's service account to list and delete pods. Unless specified, the service account used will be the default service account in the pod's namespace. By default, and in most cases, the default service account will not have the neccessary permissions to list and delete pods.

Cluster Wide Permissions: example
Namespace Specific Permissions: example

Combining Rules

A pod will only be reaped if ALL rules flag the pod for reaping, but you can achieve reaping on OR logic by simply running another pod-reaper.

For example, in the same pod-reaper container:

CHAOS_CHANCE=.01
MAX_DURATION=2h

Means that 1/100 pods that also have a run duration of over 2 hours will be reaped. If you want 1/100 pods reaped regardless of duration and also want all pods with a run duration of over hours to be reaped, run two pod-reapers. one with: CHAOS_CHANCE=.01 and another with MAX_DURATION=2h.

Deployments

Multiple pod-reapers can be easily managed and configured with kubernetes deployments. It is encouraged that if you are using deployments, that you leave the RUN_DURATION environment variable unset (or "0s") to let the reaper run forever, since the deployment will reschedule it anyway. Note that the pod-reaper can and will reap itself if it is not excluded.

One Time Runs

You can run run pod-reaper as a one time, limited duration container by usable the RUN_DURATION environment variable. An example use case might be wanting to introduce a high degree of chaos into your kubernetes environment for a short duration:

# 30% chaos chance every 1 minute for 15 minutes
SCHEDULE=@every 1m
RUN_DURATION=15m
CHAOS_CHANCE=.3

pod-reaper's People

Contributors

Stargazers

Watchers

Forkers

pforman-driver jonathanaaron feicc schillyvanilly sjqnn zahradtj jasondtgt fakoli akria18 acumenix cyborgjourney jnavarro86 stihldev brianberzins mrwako jdharmon ojagodzinski matttattoli hongshibao dubalda bambooengineering bravecorvus deenewearth seancurran157 cargill berlineric plotly vumdao jmturwy fschueller bewing udyshnkr bernardalvares skoef deepaksundar gobars drobit rogerioefonseca sankalpverma hardev isker fentonfentonfenton luis-sousa-pinto shape-klug hasland hopisaurus capetrei

pod-reaper's Issues

Option to prevent pod-reaper from killing all pods in a replica set at the same time

This is one that I have discussed with a few people in person. On one hand: it would allow for a safer learning curve: particularly for things that are clustered.

A couple of options:

optional variable to prevent more than n pods for any one replica set from being deleted at one time.
optional variable to ensure that pod reaper does not kill pods to reduce a replica set below n pods.

There is some fair discussion about whether or not this is a feature pod-reaper should have. I would like to avoid letting people hide problems with this option. For example: if you're running a single pod, and we're using option 2: then pod reaper would effectively be whitelisting that pod. Another example, in the first case: a small n value doesn't necessarily provide much value to large replica sets.

Standard Working Hours

There has been a request for a rule that could be used to control when pods are reaped relative to the time of the day/week.

Use case: I only want to clean stuff up after standard working hours.
Use case: I want to periodically kill pods, but only when most people are in the office.

MAX_DURATION option does not count the Pod Status Start time

The option MAX_DURATION option does not count the Pod Status Start time but instead the Pod Start time.

I deployed a pod with an entry point that Evicts it after 10 minutes using the command below:
sleep 600; apt update; apt install curl -y; while true; do curl http://some.url --output some.file; done

The pod reaper is configured with a MAX_DURATION of 5 minutes, POD_STATUSES with Evicted, and running every 1 minute.

I was expecting to see the pod-reaper reaping the Evicted pod at minute 15 of the POD, but instead, the POD was reaped right away at minute 11.

I took a look in the code and it is using the Pod Status start time, but looks like it is getting the first status Start time and not the pod-reaper configured.
https://github.com/target/pod-reaper/blob/master/rules/duration.go#L33

panic with pod not being found:

panic: pods "hello-cloud-deployment-4100001433-scb8x" not found

goroutine 1 [running]:
panic(0x1275080, 0xc420319880)
	/usr/local/Cellar/go/1.7.5/libexec/src/runtime/panic.go:500 +0x1a1
main.reap(0x1bf08eb000, 0x989680, 0xc4204325c0, 0x2, 0x2, 0x13a7d7b, 0xa, 0x13a598e, 0x8, 0xc42001204a, ...)
	/Users/z001kkm/code/go/src/pod-reaper/main.go:52 +0x319
main.main()
	/Users/z001kkm/code/go/src/pod-reaper/main.go:102 +0x98

Thrown by this line:

err := clientSet.Core().Pods(pod.ObjectMeta.Namespace).Delete(pod.ObjectMeta.Name, nil)

This shouldn't be a panic: if the pod isn't found (might have been deleted by something else/some other event might have happened). If the pod isn't found/the delete fails, we should probably just log it and continue happily.

Kubernetes | Pods | Deleting evicted pods

Hi,

Pod-reaper is not deleting pods which are in evicted state.
Is it the expected behavior? If yes, then can we have a feature in place which deletes pods which are in evicted state.
Please let us know your inputs. @brianberzins @hblanks

Thanks.

Question: Does pod-reaper need ClusterRole binding ?

Hi, thanks for your tool.
I'd like to apply pod-reaper to my k8s cluster, but In the example deployment.yaml, I don't see any policy setting for granting pod-reaper to have permission to delete pods.

Update Dependencies to use Go Modules

We should consider updating the dependencies to use the newer built in modules of go.

https://blog.golang.org/migrating-to-go-modules

Allow default configuration override with annotations

It would be nice to be able to override default pod-reaper settings with annotations.

For example if MAX_DURATION=1d, but a pod had the annotation pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

Pod status rule is misleading

Kubectl shows the pod phase as the status, e.g. Running, Succeeded, Failed. The pod status rule checks the optional reason, e.g. Evicted. This is confusing, and also means you cannot create a POD_STATUS=Running rule. Can we change the pod status rule to use phase? Should the current reason rule be renamed/use a different env variable? POD_STATUS_REASON?

Relevant documentation:

	// The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
	// The conditions array, the reason and message fields, and the individual container status
	// arrays contain more detail about the pod's status.
	// There are five possible phase values:
	//
	// Pending: The pod has been accepted by the Kubernetes system, but one or more of the
	// container images has not been created. This includes time before being scheduled as
	// well as time spent downloading images over the network, which could take a while.
	// Running: The pod has been bound to a node, and all of the containers have been created.
	// At least one container is still running, or is in the process of starting or restarting.
	// Succeeded: All containers in the pod have terminated in success, and will not be restarted.
	// Failed: All containers in the pod have terminated, and at least one container has
	// terminated in failure. The container either exited with non-zero status or was terminated
	// by the system.
	// Unknown: For some reason the state of the pod could not be obtained, typically due to an
	// error in communicating with the host of the pod.
	//
	// More info: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#pod-phase
	// +optional
	Phase PodPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase,casttype=PodPhase"`
	// A brief CamelCase message indicating details about why the pod is in this state.
	// e.g. 'Evicted'
	// +optional
	Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`

Split the helm chart into a new repo or Equate the the Helm chart version with the pod-reaper version

Hello, I'm configuring the pod-reaper helm chart with Renovate bot, then I figured that the helm chart version is not the same as the pod-reaper code.
For that there are 2 strategies that I can see now:
1 - Separate the Helm chart code in a new repo and maintain it there (That would be a better OMHO)
2 - Equate the Helm Chart version to match the pod-reaper version so then I could create a new MR with a real tagging version.

Pod reaping strategy

Hello!

Thank you for pod-reaper!

One feature we would find useful is to have the ability to decide which pods should be reaped when max_pods is defined. A few strategies that come to mind:

Random: any pod that matches the defined rules gets terminated.
Age based: sort matching pods by age and terminate the youngest/oldest.
Annotation based: Based on the pod-deletion-cost annotation, or any other annotation that could be configured. In this case an integer value would be set on the pods and we would sort based on the same principles used for pod-deletion-cost.

The current behavior is random based on observations.

Configure logging via environment variable.

Logging level should be configurable from environment variables. It's possible that this is already done by the logrus library, in which case the only changes would need to be documentation

Improve dry run log accuracy

While testing pod-reaper in dry run, one issue we observed when numerous pods would match the rules defined and MAX_PODS is defined is that all the matching pods would be marked as pod would be reaped but pod-reaper is in dry-run mode.

Our expectation in this case would have been that we would see MAX_PODS pods marked as can reap, while the remainder would be marked aspod would be reaped but maxPods is exceeded (possibly also indicating pod-reaper is in dry-run mode). This would better reflect the non dry run behavior (i.e., reaping at most MAX_PODS pods) and would appear safer if dry run was turned off.

A simple approach to solve this issue would be to log/indicate that we're in dry run mode on start, and keep all subsequent log output just as if it was a live run, simply not executing the reaping process.

Publish to docker hub

Need to get access from @jmccann to publish to https://hub.docker.com/r/target/.

Metrics

Current thought would be to allow this to interact with prometheus and/or statsD. Looking for feedback on how people would like metrics to look from the reaper.

Deployment bug

Hi I am trying to run pod reaper as a deployment but keep getting this bug during run time of the reaper {"error":"no rules were loaded","level":"panic","msg":"error loading options","time":"2020-03-26T04:11:46Z"} panic: (*logrus.Entry) (0x142fba0,0xc42034f810) goroutine 1 [running]: github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.Entry.log(0xc42004e060, 0xc420211620, 0x0, 0x0, 0x0, 0x0, 0x0 /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:239 +0x350 github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.(*Entry).Log(0xc42034f7a0, 0xc400000000, 0xc4205f9d30, 0x1, 0 /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:268 +0xc8 github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus.(*Entry).Panic(0xc42034f7a0, 0xc4205f9d30, 0x1, 0x1) /go/src/github.com/target/pod-reaper/vendor/github.com/sirupsen/logrus/entry.go:306 +0x55 main.newReaper(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/github.com/target/pod-reaper/reaper/reaper.go:37 +0x2de main.main() /go/src/github.com/target/pod-reaper/reaper/main.go:22 +0x50

Here is my manifest that includes the resources I am deploying.
`

apiVersion: v1
kind: Namespace
metadata:
name: reaper

apiVersion: v1
kind: ServiceAccount
metadata:
name: pod-reaper-service-account
namespace: reaper

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: pod-reaper-cluster-role
rules:

apiGroups: [""]
resources: ["pods"]
verbs: ["list", "delete"]

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: pod-reaper-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pod-reaper-cluster-role
subjects:

kind: ServiceAccount
name: pod-reaper-service-account
namespace: reaper

apiVersion: apps/v1
kind: Deployment
metadata:
name: pod-reaper
namespace: reaper
spec:
replicas: 1
selector:
matchLabels:
app: pod-reaper
template:
metadata:
labels:
app: pod-reaper
pod-reaper: disabled
spec:
serviceAccount: pod-reaper-service-account
containers:
- name: airflow-scheduler-terminator
image: target/pod-reaper
resources:
limits:
cpu: 30m
memory: 30Mi
requests:
cpu: 20m
memory: 20Mi
env:
- name: NAMESPACE
value: dataloader-airflow-blue
- name: SCHEDULE
value: "@every 15m"
- name: REQUIRE_LABEL_KEY
value: component
- name: REQUIRE_LABEL_VALUES
value: scheduler
`
Thanks in advance for a great tool.

Reap pods in error status

Aim for a set of statuses: and delete all pods with that status.

cannot use comma separated values in environment variables via `kubectl run`

kubectl run cannot be used when an environment variables has commas due to the way that the command parses the command line flags. More investigation needs to happen about this, as it might have been fixed upstream.

Easier local development

Working on #42 and running some local testing of #40 made me think more about my local development. I found a lot of quick success playing around with KinD (kubernetes in docker) https://github.com/kubernetes-sigs/kind

I know that I probably overdo local testing on pod-reaper because I want to make sure that something that is capable of killing every pod in a cluster functioning like I want. As part of that, I want to make sure that I, and anyone else, has a quick and easy way to try out changes locally without throwing potentially dangerous prototype versions out into non-local docker repositories.

pod reaper health checks

If a pod-reaper is being managed with a deployment, how can we implement health checks against it?

Upgrade docker base image to use golang 1.15

have all new features applied
patch bug fixes and security flaws
allow downstream forks to utilize up to date Dockerfile

Logging updates

Requests that have been made for logging

messages when when pods are polled
json logging for easy automated ingestion
timestamps

In nonprod, reduce resouces: Reap pod/apps so they don't consume resources on weekend

How to configure this tool for this usecase

User deploys an app
A central db stores a cron or an uptime-schedule E.g: app should be up on weekdays , 9-5 , US/EST for example.
Can pod -reaper reap all app resources that are outside uptime-schedule ? Can you please provide example config?
Input for pod-reaper : list of app/labels that need to be down to 0

Docker builds no longer happening automatically

@slushpupie
With recent changes to docker's licensing, the automated builds are now a payed feature.
It doesn't look like I have permissions to upload new builds myself, and I was previously relying on the automated builds.

Do you have thoughts on this?
I'm pretty upset with Docker's license changes overall, but I'll have to find time to pickup podman or an alternative.
Is there a reasonable thing to do in the meantime?

Setup CI/CD outside of docker

Related to #50

After we get end-to-end testing that executes against a cluster, what I've got setup right now for CI isn't going to be good enough. Specifically, it won't handle the end-to-end testing well.

Figure this would be a good time to look into github actions!

Helm chart

Is there a desire for a helm chart for this? Even just a folder in the main repo which can be referred to.

And... has anyone done that work already?

readme should mention how the reaper logs

Should also consider json logging.

Explicit rule enable

This feature would allow explicitly configuring the pod reaper to only load/respect/look at rules that have been explicitly set.

This feature doesn't really make sense on it's own, but the combination of #45 and #44 would greatly benefit from something like this to help keep configuration sane.

Hard kill option!

Currently the pod-reaper terminates "nicely".
Consider an option to hard kill with no SIGTERM... simulating a VERY hard kill

Does pod-reaper act on 1 pod at a time? or all pods simultaneously

I apologize in advance if this was specified in your documentation, but I could not find it in either Github or Docker Hub.

I was wondering if pod-reaper acts on all pods that match based on REQUIRE_LABEL_KEY/EXCLUDE_LABEL_KEY at the same time, or if it iteratively does 1 pod at a time.

This matters to me because I need to ensure when pod-reaper kills off pods, we have zero downtime. So in a way, I am basically looking for a RollingUpdate + maxUnavailable: 0, option for killing off pods.

I understand I can use CHAOS_CHANCE to try to ensure some pods stay alive. But a rolling strategy for killing off pods would be far more deterministic and predictable.

Please let me know if this is the default implementation, or if there is something I can set to make this happen.

Thank you.

Add a minimum run duration rule

A minimum duration rule could be useful to prevent the pod-reaper from killing any pods that are just starting or have only be alive for a short duration.

Example use case: rolling deployments. Having a pod killed during a rolling deployment isn't necessarily bad, but it could cause undesirable effects in the case of automated canary analysis: where a pod being killed could prevent a move forward towards production for no fault of the pod.

Log messages not parsed by Stackdriver

Firstly thank you for creating this tool - I really like its simplicity. I've been experimenting with it on GCP and whilst not really an issue with pod-reaper, the default Logrus structured logs do not get handled well by the fleuntd/stackdriver collectors, and all the log messages, regardless of severity, get logged as Errors.

I've put together a small PR to allow for different formatting of the logs, let me know what you think.

Dry run mode

I've started looking at pod-reaper after reaping 10,000 old pods in my cluster (don't ask...). This is probably the first of several feature requests, sorry if they're a bit spammy.

One feature that would make adoption easier and less risky is a dry run mode, where it does all the work, but doesn't kill anything — and probably exits right away.

v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; Use v1 ClusterRole

To avoid the warning messages below and future blocking issues we need to start using the v1 RBAC API instead.

W1206 09:54:41.302957 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W1206 09:54:41.502812 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding W1206 09:54:42.119382 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole W1206 09:54:42.306488 6924 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding

Run Duration can be misleading

RUN_DURATION is unsafe in the case that pod-reaper is killed. It should be better documented that you should NOT use this configuration option if you are controlling the pod-reaper via a self healing process (such as a kubernetes deployment) since each time the reaper is restarted it will recalculate the run duration.

This was something that I was "vaguely aware of" when I was writing the feature as I was imagining two desperate use cases:

a long lived pod-reaper that runs continuously against a kubernetes system
a short lived pod-reaper that runs once and is done

This should really be documented clearly.

Schedule doesn't seem to work correctly

Hi,

First of all, thanks for open sourcing this interesting project.

I was playing with it and found out something odd. However, I am not completely sure if the issue is in your service, or in the cron library that you are using down the line.

I have set up a deployment using the following Schedule option:

- name: SCHEDULE
   value: "0 20 * * *"

My expectation was that pod-reaper would check for pods to 🔥 at 20h everyday. However, this is the result I am getting:

{"level":"info","msg":"loaded rule: chaos chance 0.999","time":"2020-03-26T16:04:16Z"}                                                                                    │
│ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-kft22","reasons":["was flagged for chaos"],"time":"2020-03-26T16:20:00Z"}     │
│ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-wfqs2","reasons":["was flagged for chaos"],"time":"2020-03-26T16:20:00Z"}      │
│ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-mxcwg","reasons":["was flagged for chaos"],"time":"2020-03-26T17:20:00Z"}     │
│ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-vx4kb","reasons":["was flagged for chaos"],"time":"2020-03-26T17:20:00Z"}      │
│ {"level":"info","msg":"reaping pod","pod":"service-2-5468d66b7b-cgq9c","reasons":["was flagged for chaos"],"time":"2020-03-26T18:20:00Z"}     │
│ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-7zlv9","reasons":["was flagged for chaos"],"time":"2020-03-26T18:20:00Z"}      │
│ {"level":"info","msg":"reaping pod","pod":"service-2-764cdfbbdd-rkcmh","reasons":["was flagged for chaos"],"time":"2020-03-26T19:20:00Z"}     │
│ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-cgq9c","reasons":["was flagged for chaos"],"time":"2020-03-26T19:20:00Z"}      │
│ {"level":"info","msg":"reaping pod","pod":"service-2-asfasfsasf-fsfds","reasons":["was flagged for chaos"],"time":"2020-03-26T20:20:00Z"}     │
│ {"level":"info","msg":"reaping pod","pod":"service-1-5468d66b7b-2mcpp","reasons":["was flagged for chaos"],"time":"2020-03-26T20:20:00Z"}

Is there anything that I am doing incorrectly?

Thanks

target / pod-reaper Goto Github PK

pod-reaper's Introduction

pod-reaper: kills pods dead

Configuring Pod Reaper

NAMESPACE

GRACE_PERIOD

SCHEDULE

RUN_DURATION

EVICT

EXCLUDE_LABEL_KEY and EXCLUDE_LABEL_VALUES

REQUIRE_LABEL_KEY and REQUIRE_LABEL_VALUES

REQUIRE_ANNOTATION_KEY and REQUIRE_ANNOTATION_VALUES

DRY_RUN

MAX_PODS

POD_SORTING_STRATEGY

Logging

LOG_LEVEL

Example Log

LOG_FORMAT

Implemented Rules

CHAOS_CHANCE

CONTAINER_STATUSES

POD_STATUS

MAX_DURATION

UNREADY

Running Pod-Reapers

Service Accounts

Combining Rules

Deployments

One Time Runs

pod-reaper's People

Contributors

Stargazers

Watchers

Forkers

pod-reaper's Issues

Recommend Projects

Recommend Topics

Recommend Org

`NAMESPACE`

`GRACE_PERIOD`

`SCHEDULE`

`RUN_DURATION`

`EVICT`

`EXCLUDE_LABEL_KEY` and `EXCLUDE_LABEL_VALUES`

`REQUIRE_LABEL_KEY` and `REQUIRE_LABEL_VALUES`

`REQUIRE_ANNOTATION_KEY` and `REQUIRE_ANNOTATION_VALUES`

`DRY_RUN`

`MAX_PODS`

`POD_SORTING_STRATEGY`

`LOG_LEVEL`

`LOG_FORMAT`

`CHAOS_CHANCE`

`CONTAINER_STATUSES`

`POD_STATUS`

`MAX_DURATION`

`UNREADY`