keikoproj / governor Goto Github PK

A collection of cluster reliability tools for Kubernetes

License: Apache License 2.0

Dockerfile 0.30% Makefile 0.41% Go 99.30%

kubernetes aws kubernetes-node kubernetes-pod kubernetes-tools kubernetes-cluster auto-healing self-healing eks eks-cluster

governor's People

Contributors

Stargazers

Watchers

governor's Issues

Helm chart

Request / Offer

This tool looks great for us. Is there a helm chart (maybe in a private repo?). If so can it be made public, and if not would you like one?

Node Reaper - Drain timeout has no effect

Is this a BUG REPORT or FEATURE REQUEST?:
BUG

What happened:
Node reaper issues a kubectl drain to drain the node before reaping a node. This drain command is getting invoked with a 10 minute timeout. However this timeout is not effective as the kuebctl drain command is invoked with --force --grace-period=1 options. These option forces the pods to be terminated immediately with out giving enough time to run PreStop hooks

What you expected to happen:
Pods should be given some time out terminate

Pod-Reaper: Delete Completed/Failed pods

We should take the following flags:
--reap-completed => Bool
--reap-completed-after => Float64 (minutes)
--reap-failed => Bool
--reap-failed-after => Float64 (minutes)

Reaping should be determined by the last container's finishedAt diff against --reap-x-after flag.

node reaper should clean up in transition bad nodes

node reaper should clean up in transition bad nodes which are stuck for a long time

In this case, the node was in a bad state for more then 20 hours, when the upgrade manager is rotating nodes in the cluster


  | time="2021-06-29T19:00:15Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
  | time="2021-06-29T19:00:15Z" level=info msg="instance 'i-029a8***************' has been running for 1343.561029 minutes but is not joined to cluster"
  | time="2021-06-29T18:40:09Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
  | time="2021-06-29T18:40:09Z" level=info msg="instance 'i-029a8***************' has been running for 1323.459359 minutes but is not joined to cluster"
  | time="2021-06-29T18:30:17Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
  | time="2021-06-29T18:30:17Z" level=info msg="instance 'i-029a8***************' has been running for 1313.589378 minutes but is not joined to cluster"
  | time="2021-06-29T18:20:17Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
  | time="2021-06-29T18:20:16Z" level=info msg="instance 'i-029a8***************' has been running for 1303.581947 minutes but is not joined to cluster"
  | time="2021-06-29T18:10:12Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
  | time="2021-06-29T18:10:12Z" level=info msg="instance 'i-029a8***************' has been running for 1293.507664 minutes but is not joined to cluster"

Pod-Reaper: Namespace exclusion annotation

We should have an annotation to exclude a namespace from being operated on.

so that if "governor.keikoproj.io/pod-reaper-disabled" is "true" (string), we skip any operations in that namespace.

Migrate to modules

Getting this project onto modules would bring it in line with the current direction of Go.

Randomize the ordering of nodes in age reapable node

Is this a BUG REPORT or FEATURE REQUEST?:

FEATURE REQUEST

What happened:

Currently Age reapable nodes are sorted on their age which gives a fixed ordering always. Due to which if first node has problem then other nodes will not reaped in next runs.

Randomize the order of age repable nodes so that if any node has problem, nodes after them in fixed order can be reaped in next run.

What you expected to happen:

Randomize the order of age repable nodes so that if any node has problem, nodes after them in fixed order can be reaped in next run.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

AZ Cordon: Use tagging to restore the same NAT that was excluded

In the case where there are multiple NATs in a zone, and that zone is cordoned, we may restore a different NAT.
We should use tagging or similar approach to persist which NAT was cordoned so that we can restore the same NAT.

Otherwise this may cause issues for NATs which have dowstream ACLs to the public IP

BDD e2e test

governor packages should have a end to end functional test.
we can have a travis cronjob that runs nightly and run this test

test should:

setup eks cluster
test pod-reaper:
- create a pod with infinite sleep, terminate it - it is now stuck, let reaper kill it
test node-reaper:
- schedule a pod on a node with hostNetwork and run ip link set dev eth0 down, node will become NotReady, let reaper kill it
- Test age-reap by setting a low threshold
- Test flappy node reap by faking events or restarting kubelet on the node

Update aws-sdk-go version to support IRSA

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:

What you expected to happen:
In order to support IAM roles for Service Accounts in EKS, the aws sdk version needs to be updated from 1.16 to at least 1.23.13
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html

Feature: Terminate EC2 instances that never join the cluster

I've observed this several times, a node fails to join the cluster for some reason (e.g. got ec2 on bad physical host, etc), however the EC2 is still hanging around, this is bad because:

Autoscaling Group considers it a healthy node
Waste of $$

We should consider:

scan all cluster scaling groups
scan all instances belonging to scaling groups
if matching node is not found AND ec2 instance has been running for more than N minutes, terminate the ec2 instance

Question: Option to configure node-reaper differently for specific node group

Is this a BUG REPORT or FEATURE REQUEST?:
QUESTION

What happened:
We are running node reaper in our kube cluster for reaping nodes older than 7d for security and compliance reasons. Some of our workloads are ML workloads (Apache Flink jobs) and they run on a specific node group. When we reap a node in that node group, the entire ML job need to get restarted due to the architecture of Flink (same with many ML architectures). So if we reap nodes in the node group one by one (many times a day when nodes are 7d old), the job needs to be restarted many times in a day and it is adding up significant processing lag. As reaping a node older than 7d is same as reaping all nodes in that node group, we are wondering:

Is there an option in node reaper to wait until all nodes in a specific node group to be older than Nd and reap them all at the same time?
Is there an option to reap nodes in a specific node group at a specific time window so that we can contain the downtime?
Any other suggestion to address our scenario?

As noted above we are looking for the advance configuration for few special node groups in addition to the regular options for all other nodes.

Thanks in advance for any help on this.

What you expected to happen:

Option to configure a specific node group differently

How to reproduce it (as minimally and precisely as possible):

N/A

Anything else we need to know?:

N/A

Environment:

Kubernetes version: v1.23

Other debugging information (if applicable):

relevant logs: N/A

Node Reaper - Add Support To Skip Nodes

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

What happened:
We need the ability to conditionally skip reaping flappy nodes based on an instance group. We have one instance group where we expect nodes to flap between Ready and NotReady and we want to disable the node-reaper from reaping this nodes. The flapping stems from IOPS exhaustion on these nodes and we explicitly decided to not add additional IOPs capacity to save $$$$. We need the ability to disable reaping flappy nodes in this one instance group, but continue to have it enabled for our other instance groups.

What you expected to happen:
The node-reaper ignores flappy nodes within one specific instance group but continues to act on flappy nodes in all other instance groups.

How to reproduce it (as minimally and precisely as possible):
N/A

Anything else we need to know?:

Environment:
N/A

Kubernetes version:
1.15.6

→ kubectl version -o yaml
clientVersion:
  buildDate: "2020-02-13T18:06:54Z"
  compiler: gc
  gitCommit: 06ad960bfd03b39c8310aaf92d1e7c12ce618213
  gitTreeState: clean
  gitVersion: v1.17.3
  goVersion: go1.13.8
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2019-11-13T11:11:50Z"
  compiler: gc
  gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
  gitTreeState: clean
  gitVersion: v1.15.6
  goVersion: go1.12.12
  major: "1"
  minor: "15"
  platform: linux/amd64

Other debugging information (if applicable):
N/A

Corporate or individual CLA is required for keikoproj / governor?

Hi Team,

Apologies for not using the issue template, but it is not exactly applicable to my question 😄

I am managing the CLAs for the company I am working for and we have an employee that wants to contribute to the keikoproj
/governor project.

From what we have found it looks like that to contribute to the project, a contributor is required and prompted to sign an Individual CLA- https://cla-assistant.io/keikoproj/governor. However, on the keiko/community (https://github.com/keikoproj/keiko/tree/master/community) page, a corporate CLA is available, and we were wondering if we may sign that instead of the ICLA.

May you please advise if the Keiko corporate CLA is a viable option for us to sign? And if the answer is "Yes", please advise if there are any specifics steps, apart from sending the signed CLA to Ms. Mukulika Kapas ([email protected]), that we will need to complete so we may accept our employees PRs?

Thanks,
Radi

Feature: Dynamic Rule Configuration for pod-reaper

Pod reaper should support some sort of rule configuration to make it much more dynamic.
Instead of using flags, we can load a configuration file that contains something like:

podReaperRules:
- name: killTerminatingPods
  status: Terminating
  reapAfter: 10
  softReap: true
- name: killContainerCreatingPods
  status: ContainerCreating
  reapAfter: 5
  softReap: false
- name: killInitPods
  status: Init
  reapAfter: 10
  softReap: false

valid statuses = Terminating(by timestamp), ContainerCreating(by ContainerStatuses), Init (by ContainerStatuses)

This can be a great enhancement to pod reaper

pdb-reaper needs to support K8s 1.25

Is this a BUG REPORT or FEATURE REQUEST?:

FEATURE REQUEST

What happened:

Governor (pdb-reaper in specific) is using policy/v1beta1, which is deprecated in K8s 1.25. Need to update to use policy/v1.

What you expected to happen:

Governor should run on Kubernetes 1.25.

How to reproduce it (as minimally and precisely as possible):

For policy/v1, I think the client-go needs to be updated to v0.21.0 or later. And then the following code needs to be changed:

https://github.com/search?q=repo%3Akeikoproj%2Fgovernor%20%22v1beta1%22&type=code

However, it looks like client-go changed after v0.17, with some additional required arguments, etc. So some refactoring may be required.

Anything else we need to know?:

See PR #96, which is failing the CI build, for the errors after updating client-go.

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

Node reaper not reaping aged nodes again which are failed to drain once.

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT/FEATURE REQUEST

What happened:
When node reaper failed to drain a node once, it adds annotation governor.keikoproj.io/age-unreapable: true

governor/pkg/reaper/nodereaper/helpers.go

Line 157 in 84d1d6e

ctx.annotateNode(name, ageUnreapableAnnotationKey, "true")

and these nodes are not considered again as they gets filtered out using this statement.

governor/pkg/reaper/nodereaper/nodereaper.go

Line 340 in 84d1d6e

    
           if !nodeHasAnnotation(node, ageUnreapableAnnotationKey, "true") && !hasSkipLabel(node, reapOldDisabledLabelKey) {

What you expected to happen:

It should reconsider age nodes again when it failed once.

How to reproduce it (as minimally and precisely as possible):
Make sure a node drain failed by adding some PDBs which does not allow it drain that node or add this ``governor.keikoproj.io/age-unreapable: true` annotation

Anything else we need to know?:

We can add a flag --reap-age-unreapable when set to true it will consider failed nodes again. By default that flag can be set to false.

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

Node reaper check for missing node in cloud provider before reaping

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:

Node reaper got stuck on processing a dud node that didn't exist in AWS but was still in k8s.

What you expected to happen:

We should see if we can introduce a check in node reaper to validate that the node exists in the underlying cloud-provider and if not, delete it from k8s.

How to reproduce it (as minimally and precisely as possible):

Probably remove a node from cloud provider once node reaper has scanned the nodes in k8s and trying to remove them.

Anything else we need to know?:

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

improve readme docs on pod reaper

it has lots of options and functions and the readme does not mention any of them
should be improved to encourage adoption and for users to know what they install

Feature: Delete node objects missing from EC2

It's been observed on several occasions, an issue where an EC2 instance is terminated, but the node object is never removed for some reason, the node name can then be re-allocated to a different node which joins the cluster with the same IP of the terminated instance.

This can cause major issues with alb-ingress-controller, or other cluster components.

We should consider:

Get all nodes
Get their instance ids
If instance ID does not exist on ec2 AND instance has been running for more than ReapAfter threshold, we should delete the node from the API.

Increase test coverage

unit-test coverage is around 50%, should get it >75% at the very least.

Node Reaper - annotate nodes being terminated

For the sake of other components in the cluster, we should annotate the node that is being terminated.

travis-ci improvements

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

What happened:
Currently Travis CI only builds PRs, and pushes images post-merge. this is insufficient when wanting to test out a PR on a real cluster (or run bdd test)

What you expected to happen:

We should push an image even on PR, and use the commit hash as tag.
We should also run the BDD test nightly on governor:master and publish governor:nightly as a result.
We should consider some sort of hourly functional test as well

[Enhancement] Check for PDBs status before draining Node

Is this a BUG REPORT or FEATURE REQUEST?:

enhancement

What happened:

When node reaper tries to drain a node, it does not check if all pods can be disrupted and node can be drained. It keeps trying to drain the node until command does not timesout.

What you expected to happen:

It should check before trying to drain a node.

How to reproduce it (as minimally and precisely as possible):

Try to drain a node which is having pods with 0 disruption allowed.

Anything else we need to know?:

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

Node Reaper: Refactor

We need to refactor node reaper, it started as a simple script, but as we add more and more logic, it seems the current structure is becoming a bit flakey.

We should have a make some structural improvements.

K8s/AWS clients should be instantiated outside, this will allow to unit tests Run()
Better data structures for for reapable/drainable
Remove duplicate code
A lot of unnecessary if conditions
Pass arguments to ReaperContext and validate them in a cleaner way

Support deletion of Evicted pods

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
When nodes are drained for reclaiming some resources (e.g. filesystem capacity), the pods that were running on that node are marked Evicted. These pods stay in this state unless the node is deleted or the pods are explicitly deleted.

What you expected to happen:
Governor should support an option where in it deletes such Evicted pods after a certain period of time.

Push PDB reaper metrics to prometheus

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
PDB Reaper only publishes events when do reaping. It is important to push metrics to prometheus pushgateway

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

node-reaper to handle longer time node drain case

Is this a BUG REPORT or FEATURE REQUEST?:
feature request

What happened:
Currently, when node drain timeout, node-reaper will mark the node uncordon, and then try to drain it again in next round (approx in next 10m )

for the node that has pod take long time to be evicted, it is possible that the same node can be under constant drain.

What you expected to happen:

set node drain timeout to 1hr or longer if possible.

How to reproduce it (as minimally and precisely as possible):
run the node-reaper with pod that take more than 10m to be evicted

Anything else we need to know?:
N/A

Environment:
N/A

Kubernetes version:
N/A

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

Ability to node reap non-ASG nodes

Feature request

Governor relies on k8s nodes being provisioned by an ASG. For k8s nodes that are provisioned using other methods, such as the new-ish Karpenter project, they do not belong to an ASG. Governor currently will not interact with those nodes. It would be good if Governor could reap Karpenter provisioned nodes.

FYI, about Karpenter specifically, it has the ability to reap old notes (a node TTL) although this feature doesn't have the configuration such as maxReapNodes that Governor currently has. For reaping NotReady or other unhealthy nodes, that is currently not in-scope for the Karpenter project.

Inaccurate taint flag documentation

Is this a BUG REPORT or FEATURE REQUEST?:

Bug report

What happened:

The documentation on --reap-tainted doesn't seem to be correct

What you expected to happen:

The documentation should be accurate

How to reproduce it (as minimally and precisely as possible):

the documentation says:

"marks nodes with a given taint reapable, must be in format of comma separated taints key=value:effect, key:effect or key"

But this flag seems to be using StringArrayVar that means we need to pass the flag multiple times instead of a comma separated string, see: https://pkg.go.dev/github.com/spf13/pflag#StringArrayVar

StringArrayVar defines a string flag with specified name, default value, and usage string. The argument p points to a []string variable in which to store the value of the flag. The value of each argument will not try to be separated by comma. Use a StringSlice for that.

Introduce flag for drain timeout

Is this a BUG REPORT or FEATURE REQUEST?:

FEATURE REQUEST

What happened:
Drain timeout value is not configurable and it uses a hard coded value 600 seconds ( 10 minutes). But this can vary based on different factors.

What you expected to happen:

Drain timed out value should be configurable

How to reproduce it (as minimally and precisely as possible):

governor/pkg/reaper/nodereaper/helpers.go

Line 144 in 926911b

drainTimeoutSeconds := 600

Anything else we need to know?:

introduce a flag --drain-timeout to make that configurable

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

Reap Unjoined discovers terminated instances

Reap unjoined instance scan also find terminated instances which has the tag key/value.
This causes the job to fail when it tries to terminate an already termianted node.

Rename Org

Org has been renamed to keikoproj, need to change all references

keikoproj / governor Goto Github PK

governor's People

Contributors

Stargazers

Watchers

Forkers

governor's Issues

Request / Offer

Recommend Projects

Recommend Topics

Recommend Org