twin / aws-eks-asg-rolling-update-handler Goto Github PK
View Code? Open in Web Editor NEWHandles rolling upgrades for AWS ASGs on EKS
License: Apache License 2.0
Handles rolling upgrades for AWS ASGs on EKS
License: Apache License 2.0
We're running Cluster Autoscaler
and aws-eks-asg-rolling-update-handler
in the same EKS cluster. While Cluster Autoscaler
was trying to scale up the nodes due to increased traffic, aws-eks-asg-rolling-update-handler
made a SetDesiredCapacityInfo
request that lowered the number of active instances in ASG, causing an outage.
"eventTime": "2023-06-28T23:21:46Z",
"arn": "arn:aws:iam::xxxxx:role/xxxxx-aws-eks-asg-rolling-update-handler",
"requestParameters": {
"desiredCapacity": 90,
"autoScalingGroupName": "xxxxx-2022020200515369970000000b",
"honorCooldown": true
},
"requestID": "b02fcaa3-cbbd-4d13-a183-828fdae4477f",
"eventID": "41e31359-c299-4481-bed5-06bed8261347",
ASG activity history
Successful
Terminating EC2 instance: i-0f525e751d21ff18b
At 2023-06-28T23:21:46Z a user request explicitly set group desired capacity changing the desired capacity from 99 to 90.
At 2023-06-28T23:21:48Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 99 to 90.
At 2023-06-28T23:21:49Z instance i-087d61bdf2a96135d was selected for termination.
At 2023-06-28T23:21:49Z instance i-07c86111988c8f033 was selected for termination.
At 2023-06-28T23:21:49Z instance i-0635b60dc9625ee25 was selected for termination.
At 2023-06-28T23:21:49Z instance i-036cf4d15fee9fe41 was selected for termination.
At 2023-06-28T23:21:49Z instance i-021dc05fa50e14344 was selected for termination.
At 2023-06-28T23:21:49Z instance i-012bfb97195f1054d was selected for termination.
At 2023-06-28T23:21:49Z instance i-0fcd7d7595f9e49a8 was selected for termination.
At 2023-06-28T23:21:50Z instance i-0f9a8745a9e9f2e23 was selected for termination.
aws-eks-asg-rolling-update-handler
to not use SetDesiredCapacityInfo
by itself and instead rely on Cluster Autoscaler
to bring up new nodes. aws-eks-asg-rolling-update-handler
should evict node and let Cluster Autoscaler
deal with missing nodesNo response
No response
Create EKS cluster environment with Cluster Autoscaler
& aws-eks-asg-rolling-update-handler
enabled
While rolling through nodes with many pods with PDB of maxUnavailable: 0
, keep on increasing hpa to cause Cluster Autoscaler
to bring up new nodes.
At some point, because of 5 minute timeout of eviction, aws-eks-asg-rolling-update-handler
will be out of sync with the number of nodes of what Cluster Autoscaler
brought up and aws-eks-asg-rolling-update-handler
will send SetDesiredCapacityInfo
that is lower than the current ASG size
1.8.0
No response
Currently, aws-eks-asg-rolling-update-handler
uses the ec2 instance id (i-xxxxxxxxxxxxx
) to find the matching node in the k8s api, however, the hostname label defaults to ip-x-x-x-x.<searchdomain>
. A fallback option would be nice to make sure that matches succeed on any combination
The searchdomain is the first domain in the dhcpopts object.
Logs:
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c] outdated=2; updated=0; updatedAndReady=0; asgCurrent=2; asgDesired=2; asgMax=10
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-02943160e67727188] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-029xxx60e67727188" not found
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-0bbxxx783491a4d77] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-0bbxxx783491a4d77" not found
2020/07/22 15:27:08 Sleeping for 20 seconds
kubectl describe node ip-10-5-233-178.us-west-2.compute.internal
Name: ip-10-5-233-178.us-west-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=c5.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-west-2
failure-domain.beta.kubernetes.io/zone=us-west-2c
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-5-233-178.us-west-2.i.test.top.secret.com <---- custom hostname
kubernetes.io/os=linux
pool=xxxxxxxxxxx
See openshift/kubernetes-drain#4
openshift/kubernetes-drain looks dead, might have to switch to a different library or implement my own drain.
The releases v0.0.10 and v0.0.11 both added decent self-healing capabilities to aws-eks-asg-rolling-update-handler, but perhaps healing through a panic is a little too violent and the health of the application should be exposed through an HTTP endpoint and consumed by the liveness probe.
Checklist:
[cluster-a-us-east-1-large-nodes-az1] outdated=10; updated=0; updatedAndReady=0; asgCurrent=10; asgDesired=10; asgMax=10
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Node already started rollout process
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Updated nodes do not have enough resources available, increasing desired count by 1
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Unable to increase ASG desired size: cannot increase ASG desired size above max ASG size
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Skipping
Imaging a world where there is primary, secondary, tertiary ASGs available to use. In that scenario, if primary ASG is maxed bumped, secondary or tertiary can be used without any issues.
Would it be possible to implement a feature where aws-eks-asg-rolling-update-handler bumps a secondary or tertiary ASG if primary is full. It would determine if 2 or more ASGs are grouped together with help of ASG tag provided by the user (i.e. if a TAG X has value Y then it belongs to group Y. If a TAG X has value Z then it belongs to group Z).
So I don't have to manually bump the MAX ASG.
8 months
An easy win would be to expose this error as a separate Prometheus metric. While rolling_update_handler_errors is good, it doesn't differentiate between different types of errors (or maybe add error type as a cardinality in rolling_update_handler_errors metric). This way I can create an alert when this happens rather than constantly monitoring the logs.
When checking if the updated instances have enough resources, pods from daemon sets should be excluded.
Hi I have recently updated the EKS
to 1.19 and updated the AMI to 1.19 version in the autoscaling group which uses launch template and deployed the handler successfully, but after checking the logs nothing happens it just writes staring execution and execution took ...
Please advice. Here is my configuration file:
apiVersion: v1
kind: ServiceAccount
metadata:
name: aws-eks-asg-rolling-update-handler
namespace: kube-system
labels:
k8s-app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: aws-eks-asg-rolling-update-handler
labels:
k8s-app: aws-eks-asg-rolling-update-handler
rules:
- apiGroups:
- "*"
resources:
- "*"
verbs:
- get
- list
- watch
- apiGroups:
- "*"
resources:
- nodes
verbs:
- get
- list
- watch
- update
- patch
- apiGroups:
- "*"
resources:
- pods/eviction
verbs:
- get
- list
- create
- apiGroups:
- "*"
resources:
- pods
verbs:
- get
- list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aws-eks-asg-rolling-update-handler
labels:
k8s-app: aws-eks-asg-rolling-update-handler
roleRef:
kind: ClusterRole
name: aws-eks-asg-rolling-update-handler
apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
name: aws-eks-asg-rolling-update-handler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: aws-eks-asg-rolling-update-handler
namespace: kube-system
labels:
k8s-app: aws-eks-asg-rolling-update-handler
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: aws-eks-asg-rolling-update-handler
spec:
automountServiceAccountToken: true
serviceAccountName: aws-eks-asg-rolling-update-handler
restartPolicy: Always
dnsPolicy: Default
containers:
- name: aws-eks-asg-rolling-update-handler
image: twinproduction/aws-eks-asg-rolling-update-handler
imagePullPolicy: Always
env:
- name: AUTO_SCALING_GROUP_NAMES
value: "asg-name"
- name: CLUSTER_NAME
value: some-name
- name: AWS_REGION
value: some-region
selector:
matchLabels:
k8s-app: aws-eks-asg-rolling-update-handler
logs:
2021/09/23 08:06:06 Starting execution
2021/09/23 08:06:09 Execution took 3267ms, sleeping for 20s
2021/09/23 08:06:29 Starting execution
2021/09/23 08:06:29 Execution took 103ms, sleeping for 20s
2021/09/23 08:06:49 Starting execution
2021/09/23 08:06:49 Execution took 73ms, sleeping for 20s
2021/09/23 08:07:09 Starting execution
2021/09/23 08:07:09 Execution took 95ms, sleeping for 20s
2021/09/23 08:07:29 Starting execution
2021/09/23 08:07:29 Execution took 87ms, sleeping for 20s
2021/09/23 08:07:49 Starting execution
2021/09/23 08:07:50 Execution took 80ms, sleeping for 20s
2021/09/23 08:08:10 Starting execution
2021/09/23 08:08:10 Execution took 128ms, sleeping for 20s
Newer versions require a context in the helper:
1.4.0 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.0/vendor/k8s.io/kubectl/pkg/drain/default.go#L51
1.4.2 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.2/vendor/k8s.io/kubectl/pkg/drain/default.go#L53
2022/08/11 06:36:10 [xx][i-1] Updated nodes have enough resources available
2022/08/11 06:36:10 [xx][i-1] Draining node
2022/08/11 06:36:10 [ip-1][DRAINER] Failed to cordon node: RunCordonOrUncordon error: drainer.Ctx can't be nil
2022/08/11 06:36:10 [xx][i-1] Skipping because ran into error while draining node: RunCordonOrUncordon error: drainer.Ctx can't be nil
Hi,
I'm getting the following error where its stuck with the following error.
Unable to increase ASG desired size: unable to increase ASG <ASGNAME> desired count to 7: ScalingActivityInProgress: Scaling activity 0ea5f0da-97c6-1d64-5127-43e061c16819 is in progress and blocks this action
status code: 400, request id: 740c62cf-be5a-4dab-9e9d-466d3b8fd36f
ASG Status:
it seems when desired instances and current running instances are equal its throwing this error.
The error went off when i manually increased the desired count to 9 and its starting rolling out the instances.
Please advice.
The getRollingUpdateTimestampsFromNode(node)
conditions used throughout the code to check if a node should be cordoned or not have a flaw. If something or someone decides to stop the rolling update, and manually uncordon the nodes, the handler on next start will happily evict pods in that node without actually cordoning anything. It is specially problematic when using the eager cordoning feature, since it leads to an upgrade that can never end.
No response
No response
No response
No response
The current behaviour is to iterate over every outdated node and then first cordon and then drain it immediately afterwards. I think the behaviour should actually be to first cordon all outdated instances before doing anything and then just behave as usual.
I whish this feature to be implemented because the current behaviour often (in my experience) leads to pods beeing replaced onto an outdated instance. This leads to a lot of pod restarts during rolling updates as pods get replaced more than once. This is espacially bad for pods with a long terminationGracePeriod or a long startup period. It happens that a pod doesn't even get ready after a replacement before it gets replaced again.
~3-4 months
I would volunteer to implement this feature, even with backward compatibility if required.
This is a bit tricky, because if, for instance, the max is 1 and there's a desired size of 1 before the upgrade, it would be fine to increase the ASG by 1 temporarily, let the new instance spin up, evict it, delete the old node and set decrease the ASG back to 1, but what if there's not enough space in the 1 new instance (i.e. if the instance type changed)? Then should we keep increasing? But then that would mean we wouldn't be able to go back to the original max size of the ASG.
Hi,
Apologies if this is not raised in the correct way or its the wrong place, but not sure where else to ask this.
I'd like to find out whether or not aws-eks-asg-rolling-update-handler can be scaled to more than 1 replica for HA without encountering any issues (i.e. conflicts, duplication etc..)?
I've been trying to look for docs based on this, but currently I have not been successful in doing so. Any help or guidance around aws-eks-asg-rolling-update-handler for HA would be greatly appreciated.
Thanks, look forward to your reply
Nitin
I was trying to upgrade EKS from v1.19 to v1.20 but the handler panic'ed
2021/08/23 09:17:03 Starting execution
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009] outdated=2; updated=1; updatedAndReady=1; asgCurrent=3; asgDesired=3; asgMax=3
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Node already started rollout process
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Updated nodes have enough resources available
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Draining node
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x167787a]
goroutine 41 [running]:
golang.org/x/time/rate.(*Limiter).WaitN(0xc00007f180, 0x0, 0x0, 0x1, 0x0, 0x0)
/app/vendor/golang.org/x/time/rate/rate.go:237 +0xba
golang.org/x/time/rate.(*Limiter).Wait(...)
/app/vendor/golang.org/x/time/rate/rate.go:219
k8s.io/client-go/util/flowcontrol.(*tokenBucketRateLimiter).Wait(0xc0002c3d80, 0x0, 0x0, 0xc000644680, 0xc0009550d8)
/app/vendor/k8s.io/client-go/util/flowcontrol/throttle.go:106 +0x4b
k8s.io/client-go/rest.(*Request).tryThrottleWithInfo(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x42, 0x40)
/app/vendor/k8s.io/client-go/rest/request.go:587 +0xa5
k8s.io/client-go/rest.(*Request).tryThrottle(...)
/app/vendor/k8s.io/client-go/rest/request.go:613
k8s.io/client-go/rest.(*Request).request(0xc0007685a0, 0x0, 0x0, 0xc0009556c8, 0x0, 0x0)
/app/vendor/k8s.io/client-go/rest/request.go:873 +0x2fc
k8s.io/client-go/rest.(*Request).Do(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/app/vendor/k8s.io/client-go/rest/request.go:980 +0xf1
k8s.io/client-go/kubernetes/typed/core/v1.(*nodes).Patch(0xc000722480, 0x0, 0x0, 0xc0008ca000, 0x2e, 0x1f67f8d, 0x26, 0xc000573460, 0x1f, 0x20, ...)
/app/vendor/k8s.io/client-go/kubernetes/typed/core/v1/node.go:186 +0x237
k8s.io/kubectl/pkg/drain.(*CordonHelper).PatchOrReplaceWithContext(0xc000955ab0, 0x0, 0x0, 0x226c958, 0xc0002dab00, 0x1cf0100, 0x0, 0x0, 0x7fa2a30ae8f0, 0x10)
/app/vendor/k8s.io/kubectl/pkg/drain/cordon.go:102 +0x416
k8s.io/kubectl/pkg/drain.RunCordonOrUncordon(0xc00070e8f0, 0xc00077ef00, 0xc000722401, 0xc000504a80, 0x2e)
/app/vendor/k8s.io/kubectl/pkg/drain/default.go:60 +0xb3
github.com/TwinProduction/aws-eks-asg-rolling-update-handler/k8s.(*KubernetesClient).Drain(0xc0003e4760, 0xc000504a80, 0x2e, 0x101, 0x2, 0x1)
/app/k8s/client.go:125 +0x245
main.DoHandleRollingUpgrade(0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3, 0x0)
/app/main.go:161 +0x14b4
main.HandleRollingUpgrade.func2(0xc0007df500, 0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3)
/app/main.go:96 +0x94
created by main.HandleRollingUpgrade
/app/main.go:95 +0x12e
It was deployed via helm with override config:
image:
tag: "latest"
environmentVars:
- name: CLUSTER_NAME
value: "cluster_name"
- name: AWS_REGION
value: "eu-central-1"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::111111:role/RollingUpdate"
https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/master/cloud/aws.go#L29
Paging is the recommended approach here.
The current tag lookup works if you want to manage all ASGs uniformly, however you cannot exclude ASGs that are still managed by the OSS Cluster Autoscaler.
Support adding a specific tag to enable/disable the update handler on ASGs for the update handler to lookup at runtime.
No response
No response
No response
It will allow us to define an off-hour time range and only roll nodes during that time.
e.g. HANDLER_START_HOUR, HANDLER_STOP_HOUR
== from HANDLER_START_HOUR to HANDLER_STOP_HOUR ==
it should do what it currently does
== from HANDLER_STOP_HOUR to HANDLER_START_HOUR ==
it should not cordon OR drain any node
I don't have to wake up at night to merge my PR and/OR trigger the pipeline to change the template and kick off asg-rolling-update-handler :)
6 months
This is an amazing project!!! Thank you!!!
In client.Drain()
the value of GracePeriodSeconds
is hard coded but should be configurable via an environment variable. The default should stay -1
if no other value is set.
Letting the Pod decide how long the GracePeriod should be is perfectly fine for production environments in order to not forcefully delete any workload but for test environments this delays the rolling update a lot.
I have a lot of pods with different GracePeriod configurations on a large amount of nodes and thus I would like to increase the rolling update speed by setting a smaller GracePeriod. Setting a small GracePeriod introduces the risk of forcefully killing Pods before they terminated gracefully but one might be totally fine with such behaviour in explained circumstances.
No response
I would be happy to implement this feature myself during the next weeks if @TwiN is fine with the proposal
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.