keikoproj / governor Goto Github PK
View Code? Open in Web Editor NEWA collection of cluster reliability tools for Kubernetes
License: Apache License 2.0
A collection of cluster reliability tools for Kubernetes
License: Apache License 2.0
This tool looks great for us. Is there a helm chart (maybe in a private repo?). If so can it be made public, and if not would you like one?
Is this a BUG REPORT or FEATURE REQUEST?:
BUG
What happened:
Node reaper issues a kubectl drain
to drain the node before reaping a node. This drain command is getting invoked with a 10 minute timeout. However this timeout is not effective as the kuebctl drain
command is invoked with --force --grace-period=1
options. These option forces the pods to be terminated immediately with out giving enough time to run PreStop hooks
What you expected to happen:
Pods should be given some time out terminate
We should take the following flags:
--reap-completed
=> Bool
--reap-completed-after
=> Float64 (minutes)
--reap-failed
=> Bool
--reap-failed-after
=> Float64 (minutes)
Reaping should be determined by the last container's finishedAt
diff against --reap-x-after
flag.
node reaper should clean up in transition bad nodes which are stuck for a long time
In this case, the node was in a bad state for more then 20 hours, when the upgrade manager is rotating nodes in the cluster
| time="2021-06-29T19:00:15Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
| time="2021-06-29T19:00:15Z" level=info msg="instance 'i-029a8***************' has been running for 1343.561029 minutes but is not joined to cluster"
| time="2021-06-29T18:40:09Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
| time="2021-06-29T18:40:09Z" level=info msg="instance 'i-029a8***************' has been running for 1323.459359 minutes but is not joined to cluster"
| time="2021-06-29T18:30:17Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
| time="2021-06-29T18:30:17Z" level=info msg="instance 'i-029a8***************' has been running for 1313.589378 minutes but is not joined to cluster"
| time="2021-06-29T18:20:17Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
| time="2021-06-29T18:20:16Z" level=info msg="instance 'i-029a8***************' has been running for 1303.581947 minutes but is not joined to cluster"
| time="2021-06-29T18:10:12Z" level=info msg="autoscaling-group is in transition, will not reap unjoined-i-029a8***************"
| time="2021-06-29T18:10:12Z" level=info msg="instance 'i-029a8***************' has been running for 1293.507664 minutes but is not joined to cluster"
We should have an annotation to exclude a namespace from being operated on.
so that if "governor.keikoproj.io/pod-reaper-disabled" is "true" (string), we skip any operations in that namespace.
Getting this project onto modules would bring it in line with the current direction of Go.
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
Currently Age reapable nodes are sorted on their age which gives a fixed ordering always. Due to which if first node has problem then other nodes will not reaped in next runs.
Randomize the order of age repable nodes so that if any node has problem, nodes after them in fixed order can be reaped in next run.
What you expected to happen:
Randomize the order of age repable nodes so that if any node has problem, nodes after them in fixed order can be reaped in next run.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
In the case where there are multiple NATs in a zone, and that zone is cordoned, we may restore a different NAT.
We should use tagging or similar approach to persist which NAT was cordoned so that we can restore the same NAT.
Otherwise this may cause issues for NATs which have dowstream ACLs to the public IP
governor packages should have a end to end functional test.
we can have a travis cronjob that runs nightly and run this test
test should:
hostNetwork
and run ip link set dev eth0 down
, node will become NotReady, let reaper kill itIs this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
What you expected to happen:
In order to support IAM roles for Service Accounts in EKS, the aws sdk version needs to be updated from 1.16 to at least 1.23.13
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html
I've observed this several times, a node fails to join the cluster for some reason (e.g. got ec2 on bad physical host, etc), however the EC2 is still hanging around, this is bad because:
We should consider:
Is this a BUG REPORT or FEATURE REQUEST?:
QUESTION
What happened:
We are running node reaper in our kube cluster for reaping nodes older than 7d for security and compliance reasons. Some of our workloads are ML workloads (Apache Flink jobs) and they run on a specific node group. When we reap a node in that node group, the entire ML job need to get restarted due to the architecture of Flink (same with many ML architectures). So if we reap nodes in the node group one by one (many times a day when nodes are 7d old), the job needs to be restarted many times in a day and it is adding up significant processing lag. As reaping a node older than 7d is same as reaping all nodes in that node group, we are wondering:
As noted above we are looking for the advance configuration for few special node groups in addition to the regular options for all other nodes.
Thanks in advance for any help on this.
What you expected to happen:
Option to configure a specific node group differently
How to reproduce it (as minimally and precisely as possible):
N/A
Anything else we need to know?:
N/A
Environment:
Other debugging information (if applicable):
Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
We need the ability to conditionally skip reaping flappy nodes based on an instance group. We have one instance group where we expect nodes to flap between Ready
and NotReady
and we want to disable the node-reaper from reaping this nodes. The flapping stems from IOPS exhaustion on these nodes and we explicitly decided to not add additional IOPs capacity to save $$$$. We need the ability to disable reaping flappy nodes in this one instance group, but continue to have it enabled for our other instance groups.
What you expected to happen:
The node-reaper ignores flappy nodes within one specific instance group but continues to act on flappy nodes in all other instance groups.
How to reproduce it (as minimally and precisely as possible):
N/A
Anything else we need to know?:
Environment:
N/A
1.15.6
→ kubectl version -o yaml
clientVersion:
buildDate: "2020-02-13T18:06:54Z"
compiler: gc
gitCommit: 06ad960bfd03b39c8310aaf92d1e7c12ce618213
gitTreeState: clean
gitVersion: v1.17.3
goVersion: go1.13.8
major: "1"
minor: "17"
platform: darwin/amd64
serverVersion:
buildDate: "2019-11-13T11:11:50Z"
compiler: gc
gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
gitTreeState: clean
gitVersion: v1.15.6
goVersion: go1.12.12
major: "1"
minor: "15"
platform: linux/amd64
Other debugging information (if applicable):
N/A
Hi Team,
Apologies for not using the issue template, but it is not exactly applicable to my question 😄
I am managing the CLAs for the company I am working for and we have an employee that wants to contribute to the keikoproj
/governor project.
From what we have found it looks like that to contribute to the project, a contributor is required and prompted to sign an Individual CLA- https://cla-assistant.io/keikoproj/governor. However, on the keiko/community (https://github.com/keikoproj/keiko/tree/master/community) page, a corporate CLA is available, and we were wondering if we may sign that instead of the ICLA.
May you please advise if the Keiko corporate CLA is a viable option for us to sign? And if the answer is "Yes", please advise if there are any specifics steps, apart from sending the signed CLA to Ms. Mukulika Kapas ([email protected]), that we will need to complete so we may accept our employees PRs?
Thanks,
Radi
Pod reaper should support some sort of rule configuration to make it much more dynamic.
Instead of using flags, we can load a configuration file that contains something like:
podReaperRules:
- name: killTerminatingPods
status: Terminating
reapAfter: 10
softReap: true
- name: killContainerCreatingPods
status: ContainerCreating
reapAfter: 5
softReap: false
- name: killInitPods
status: Init
reapAfter: 10
softReap: false
valid statuses = Terminating
(by timestamp), ContainerCreating
(by ContainerStatuses), Init
(by ContainerStatuses)
This can be a great enhancement to pod reaper
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
Governor (pdb-reaper in specific) is using policy/v1beta1
, which is deprecated in K8s 1.25. Need to update to use policy/v1
.
What you expected to happen:
Governor should run on Kubernetes 1.25.
How to reproduce it (as minimally and precisely as possible):
For policy/v1
, I think the client-go
needs to be updated to v0.21.0
or later. And then the following code needs to be changed:
https://github.com/search?q=repo%3Akeikoproj%2Fgovernor%20%22v1beta1%22&type=code
However, it looks like client-go
changed after v0.17, with some additional required arguments, etc. So some refactoring may be required.
Anything else we need to know?:
See PR #96, which is failing the CI build, for the errors after updating client-go
.
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT/FEATURE REQUEST
What happened:
When node reaper failed to drain a node once, it adds annotation governor.keikoproj.io/age-unreapable: true
governor/pkg/reaper/nodereaper/helpers.go
Line 157 in 84d1d6e
and these nodes are not considered again as they gets filtered out using this statement.
governor/pkg/reaper/nodereaper/nodereaper.go
Line 340 in 84d1d6e
What you expected to happen:
It should reconsider age nodes again when it failed once.
How to reproduce it (as minimally and precisely as possible):
Make sure a node drain failed by adding some PDBs which does not allow it drain that node or add this ``governor.keikoproj.io/age-unreapable: true` annotation
Anything else we need to know?:
We can add a flag --reap-age-unreapable
when set to true
it will consider failed nodes again. By default that flag can be set to false
.
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
Node reaper got stuck on processing a dud node that didn't exist in AWS but was still in k8s.
What you expected to happen:
We should see if we can introduce a check in node reaper to validate that the node exists in the underlying cloud-provider and if not, delete it from k8s.
How to reproduce it (as minimally and precisely as possible):
Probably remove a node from cloud provider once node reaper has scanned the nodes in k8s and trying to remove them.
Anything else we need to know?:
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
it has lots of options and functions and the readme does not mention any of them
should be improved to encourage adoption and for users to know what they install
It's been observed on several occasions, an issue where an EC2 instance is terminated, but the node object is never removed for some reason, the node name can then be re-allocated to a different node which joins the cluster with the same IP of the terminated instance.
This can cause major issues with alb-ingress-controller
, or other cluster components.
We should consider:
unit-test coverage is around 50%, should get it >75% at the very least.
For the sake of other components in the cluster, we should annotate the node that is being terminated.
Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
Currently Travis CI only builds PRs, and pushes images post-merge. this is insufficient when wanting to test out a PR on a real cluster (or run bdd test)
What you expected to happen:
We should push an image even on PR, and use the commit hash as tag.
We should also run the BDD test nightly on governor:master
and publish governor:nightly
as a result.
We should consider some sort of hourly functional test as well
Is this a BUG REPORT or FEATURE REQUEST?:
enhancement
What happened:
When node reaper tries to drain a node, it does not check if all pods can be disrupted and node can be drained. It keeps trying to drain the node until command does not timesout.
What you expected to happen:
It should check before trying to drain a node.
How to reproduce it (as minimally and precisely as possible):
Try to drain a node which is having pods with 0 disruption allowed.
Anything else we need to know?:
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
We need to refactor node reaper, it started as a simple script, but as we add more and more logic, it seems the current structure is becoming a bit flakey.
We should have a make some structural improvements.
Run()
reapable
/drainable
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
When nodes are drained for reclaiming some resources (e.g. filesystem capacity), the pods that were running on that node are marked Evicted
. These pods stay in this state unless the node is deleted or the pods are explicitly deleted.
What you expected to happen:
Governor should support an option where in it deletes such Evicted
pods after a certain period of time.
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
PDB Reaper only publishes events when do reaping. It is important to push metrics to prometheus pushgateway
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
Is this a BUG REPORT or FEATURE REQUEST?:
feature request
What happened:
Currently, when node drain timeout, node-reaper will mark the node uncordon, and then try to drain it again in next round (approx in next 10m )
for the node that has pod take long time to be evicted, it is possible that the same node can be under constant drain.
What you expected to happen:
set node drain timeout to 1hr or longer if possible.
How to reproduce it (as minimally and precisely as possible):
run the node-reaper with pod that take more than 10m to be evicted
Anything else we need to know?:
N/A
Environment:
N/A
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
Feature request
Governor relies on k8s nodes being provisioned by an ASG. For k8s nodes that are provisioned using other methods, such as the new-ish Karpenter project, they do not belong to an ASG. Governor currently will not interact with those nodes. It would be good if Governor could reap Karpenter provisioned nodes.
FYI, about Karpenter specifically, it has the ability to reap old notes (a node TTL) although this feature doesn't have the configuration such as maxReapNodes that Governor currently has. For reaping NotReady or other unhealthy nodes, that is currently not in-scope for the Karpenter project.
Is this a BUG REPORT or FEATURE REQUEST?:
Bug report
What happened:
The documentation on --reap-tainted
doesn't seem to be correct
What you expected to happen:
The documentation should be accurate
How to reproduce it (as minimally and precisely as possible):
the documentation says:
"marks nodes with a given taint reapable, must be in format of comma separated taints key=value:effect, key:effect or key"
But this flag seems to be using StringArrayVar
that means we need to pass the flag multiple times instead of a comma separated string, see: https://pkg.go.dev/github.com/spf13/pflag#StringArrayVar
StringArrayVar defines a string flag with specified name, default value, and usage string. The argument p points to a []string variable in which to store the value of the flag. The value of each argument will not try to be separated by comma. Use a StringSlice for that.
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
Drain timeout value is not configurable and it uses a hard coded value 600 seconds ( 10 minutes)
. But this can vary based on different factors.
What you expected to happen:
Drain timed out value should be configurable
How to reproduce it (as minimally and precisely as possible):
governor/pkg/reaper/nodereaper/helpers.go
Line 144 in 926911b
Anything else we need to know?:
introduce a flag --drain-timeout
to make that configurable
Environment:
kubectl version -o yaml
Other debugging information (if applicable):
kubectl logs <governor-pod>
Reap unjoined instance scan also find terminated instances which has the tag key/value.
This causes the job to fail when it tries to terminate an already termianted node.
Org has been renamed to keikoproj, need to change all references
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.