twin / aws-eks-asg-rolling-update-handler Goto Github PK

View Code? Open in Web Editor NEW

43.0 3.0 3.0 15.37 MB

Handles rolling upgrades for AWS ASGs on EKS

License: Apache License 2.0

Go 99.60% Dockerfile 0.40%

golang eks kubernetes aws rolling-update rolling-upgrade handler go controller launch-template

aws-eks-asg-rolling-update-handler's Issues

SetDesiredCapacityInfo lowered the number of nodes in ASG

Describe the bug

We're running Cluster Autoscaler and aws-eks-asg-rolling-update-handler in the same EKS cluster. While Cluster Autoscaler was trying to scale up the nodes due to increased traffic, aws-eks-asg-rolling-update-handler made a SetDesiredCapacityInfo request that lowered the number of active instances in ASG, causing an outage.

    "eventTime": "2023-06-28T23:21:46Z",
    "arn": "arn:aws:iam::xxxxx:role/xxxxx-aws-eks-asg-rolling-update-handler",
    "requestParameters": {
        "desiredCapacity": 90,
        "autoScalingGroupName": "xxxxx-2022020200515369970000000b",
        "honorCooldown": true
    },
    "requestID": "b02fcaa3-cbbd-4d13-a183-828fdae4477f",
    "eventID": "41e31359-c299-4481-bed5-06bed8261347",

ASG activity history

Successful
Terminating EC2 instance: i-0f525e751d21ff18b	
At 2023-06-28T23:21:46Z a user request explicitly set group desired capacity changing the desired capacity from 99 to 90. 
At 2023-06-28T23:21:48Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 99 to 90. 
At 2023-06-28T23:21:49Z instance i-087d61bdf2a96135d was selected for termination. 
At 2023-06-28T23:21:49Z instance i-07c86111988c8f033 was selected for termination. 
At 2023-06-28T23:21:49Z instance i-0635b60dc9625ee25 was selected for termination.
At 2023-06-28T23:21:49Z instance i-036cf4d15fee9fe41 was selected for termination.
At 2023-06-28T23:21:49Z instance i-021dc05fa50e14344 was selected for termination.
At 2023-06-28T23:21:49Z instance i-012bfb97195f1054d was selected for termination.
At 2023-06-28T23:21:49Z instance i-0fcd7d7595f9e49a8 was selected for termination.
At 2023-06-28T23:21:50Z instance i-0f9a8745a9e9f2e23 was selected for termination.

There should be a portion in the docs about enabling ASG Scale In Protection on all the ASG & Active Nodes for safety
There should be an option in aws-eks-asg-rolling-update-handler to not use SetDesiredCapacityInfo by itself and instead rely on Cluster Autoscaler to bring up new nodes. aws-eks-asg-rolling-update-handler should evict node and let Cluster Autoscaler deal with missing nodes

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

Create EKS cluster environment with Cluster Autoscaler & aws-eks-asg-rolling-update-handler enabled

While rolling through nodes with many pods with PDB of maxUnavailable: 0, keep on increasing hpa to cause Cluster Autoscaler to bring up new nodes.

At some point, because of 5 minute timeout of eviction, aws-eks-asg-rolling-update-handler will be out of sync with the number of nodes of what Cluster Autoscaler brought up and aws-eks-asg-rolling-update-handler will send SetDesiredCapacityInfo that is lower than the current ASG size

Version

1.8.0

Additional information

No response

Handle different hostname label options

Currently, aws-eks-asg-rolling-update-handler uses the ec2 instance id (i-xxxxxxxxxxxxx) to find the matching node in the k8s api, however, the hostname label defaults to ip-x-x-x-x.<searchdomain>. A fallback option would be nice to make sure that matches succeed on any combination

i-xxxxxxxxxxx
ip-x-x-x-x.
ip-x-x-x-x..compute.internal (regionless for us-east-1)

The searchdomain is the first domain in the dhcpopts object.

Logs:

2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c] outdated=2; updated=0; updatedAndReady=0; asgCurrent=2; asgDesired=2; asgMax=10
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-02943160e67727188] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-029xxx60e67727188" not found
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-0bbxxx783491a4d77] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-0bbxxx783491a4d77" not found
2020/07/22 15:27:08 Sleeping for 20 seconds

kubectl describe node ip-10-5-233-178.us-west-2.compute.internal

Name:        ip-10-5-233-178.us-west-2.compute.internal
Roles:       <none>
Labels:      beta.kubernetes.io/arch=amd64
             beta.kubernetes.io/instance-type=c5.xlarge
             beta.kubernetes.io/os=linux
             failure-domain.beta.kubernetes.io/region=us-west-2
             failure-domain.beta.kubernetes.io/zone=us-west-2c
             kubernetes.io/arch=amd64
             kubernetes.io/hostname=ip-10-5-233-178.us-west-2.i.test.top.secret.com     <---- custom hostname
             kubernetes.io/os=linux
             pool=xxxxxxxxxxx

Occasional deadlock during drain

See openshift/kubernetes-drain#4

openshift/kubernetes-drain looks dead, might have to switch to a different library or implement my own drain.

Create /health endpoint and report health there instead of panicking

The releases v0.0.10 and v0.0.11 both added decent self-healing capabilities to aws-eks-asg-rolling-update-handler, but perhaps healing through a panic is a little too violent and the health of the application should be exposed through an HTTP endpoint and consumed by the liveness probe.

Checklist:

HTTP server
/health endpoint
Replace the two panics below by exposing an object through the new /health endpoint https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/6eab2639174ba8835e5e546b0b224d7e4abad7a0/main.go#L44-L46 https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/6eab2639174ba8835e5e546b0b224d7e4abad7a0/main.go#L74-L77
Usage instructions must be updated to include the liveness probe (README)

Ability to bump secondary ASG if primary is full

Describe the feature request

[cluster-a-us-east-1-large-nodes-az1] outdated=10; updated=0; updatedAndReady=0; asgCurrent=10; asgDesired=10; asgMax=10
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Node already started rollout process
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Updated nodes do not have enough resources available, increasing desired count by 1
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Unable to increase ASG desired size: cannot increase ASG desired size above max ASG size
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Skipping

Imaging a world where there is primary, secondary, tertiary ASGs available to use. In that scenario, if primary ASG is maxed bumped, secondary or tertiary can be used without any issues.

Would it be possible to implement a feature where aws-eks-asg-rolling-update-handler bumps a secondary or tertiary ASG if primary is full. It would determine if 2 or more ASGs are grouped together with help of ASG tag provided by the user (i.e. if a TAG X has value Y then it belongs to group Y. If a TAG X has value Z then it belongs to group Z).

Why do you personally want this feature to be implemented?

So I don't have to manually bump the MAX ASG.

How long have you been using this project?

8 months

Additional information

An easy win would be to expose this error as a separate Prometheus metric. While rolling_update_handler_errors is good, it doesn't differentiate between different types of errors (or maybe add error type as a cardinality in rolling_update_handler_errors metric). This way I can create an alert when this happens rather than constantly monitoring the logs.

Filter out pods from daemon sets when calculating resources

When checking if the updated instances have enough resources, pods from daemon sets should be excluded.

deployed the handler successfully, but nothing happens just write staring execution and execution took ...

Hi I have recently updated the EKS to 1.19 and updated the AMI to 1.19 version in the autoscaling group which uses launch template and deployed the handler successfully, but after checking the logs nothing happens it just writes staring execution and execution took ...

Please advice. Here is my configuration file:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
rules:
  - apiGroups:
      - "*"
    resources:
      - "*"
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "*"
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
      - update
      - patch
  - apiGroups:
      - "*"
    resources:
      - pods/eviction
    verbs:
      - get
      - list
      - create
  - apiGroups:
      - "*"
    resources:
      - pods
    verbs:
      - get
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
roleRef:
  kind: ClusterRole
  name: aws-eks-asg-rolling-update-handler
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: aws-eks-asg-rolling-update-handler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: aws-eks-asg-rolling-update-handler
    spec:
      automountServiceAccountToken: true
      serviceAccountName: aws-eks-asg-rolling-update-handler
      restartPolicy: Always
      dnsPolicy: Default
      containers:
        - name: aws-eks-asg-rolling-update-handler
          image: twinproduction/aws-eks-asg-rolling-update-handler
          imagePullPolicy: Always
          env:
            - name: AUTO_SCALING_GROUP_NAMES
              value: "asg-name"
            - name: CLUSTER_NAME
              value: some-name
            - name: AWS_REGION
              value: some-region
  selector:
    matchLabels:
      k8s-app: aws-eks-asg-rolling-update-handler

logs:

2021/09/23 08:06:06 Starting execution
2021/09/23 08:06:09 Execution took 3267ms, sleeping for 20s
2021/09/23 08:06:29 Starting execution
2021/09/23 08:06:29 Execution took 103ms, sleeping for 20s
2021/09/23 08:06:49 Starting execution
2021/09/23 08:06:49 Execution took 73ms, sleeping for 20s
2021/09/23 08:07:09 Starting execution
2021/09/23 08:07:09 Execution took 95ms, sleeping for 20s
2021/09/23 08:07:29 Starting execution
2021/09/23 08:07:29 Execution took 87ms, sleeping for 20s
2021/09/23 08:07:49 Starting execution
2021/09/23 08:07:50 Execution took 80ms, sleeping for 20s
2021/09/23 08:08:10 Starting execution
2021/09/23 08:08:10 Execution took 128ms, sleeping for 20s

Unable to terminate instance when desired capacity equals ASG min size

1.4.2 fails to drain nodes

Newer versions require a context in the helper:

kubernetes/kubernetes#105297

1.4.0 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.0/vendor/k8s.io/kubectl/pkg/drain/default.go#L51
1.4.2 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.2/vendor/k8s.io/kubectl/pkg/drain/default.go#L53

2022/08/11 06:36:10 [xx][i-1] Updated nodes have enough resources available
2022/08/11 06:36:10 [xx][i-1] Draining node
2022/08/11 06:36:10 [ip-1][DRAINER] Failed to cordon node: RunCordonOrUncordon error: drainer.Ctx can't be nil
2022/08/11 06:36:10 [xx][i-1] Skipping because ran into error while draining node: RunCordonOrUncordon error: drainer.Ctx can't be nil

Unable to increase ASG desired size: unable to increase ASG

Hi,

I'm getting the following error where its stuck with the following error.

Unable to increase ASG desired size: unable to increase ASG <ASGNAME> desired count to 7: ScalingActivityInProgress: Scaling activity 0ea5f0da-97c6-1d64-5127-43e061c16819 is in progress and blocks this action
	status code: 400, request id: 740c62cf-be5a-4dab-9e9d-466d3b8fd36f

ASG Status:

Desired Instances: 7
Current running instances: 7
Min: 6
Max: 15

it seems when desired instances and current running instances are equal its throwing this error.

The error went off when i manually increased the desired count to 9 and its starting rolling out the instances.

Please advice.

The handler sometimes doesn't cordon any node

Describe the bug

The getRollingUpdateTimestampsFromNode(node) conditions used throughout the code to check if a node should be cordoned or not have a flaw. If something or someone decides to stop the rolling update, and manually uncordon the nodes, the handler on next start will happily evict pods in that node without actually cordoning anything. It is specially problematic when using the eager cordoning feature, since it leads to an upgrade that can never end.

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

Start the handler
Let it cordon a node
Stop the handler
Uncordon the node manually
Start the handler again

Version

No response

Additional information

No response

Cordon all outdated nodes before any rolling update action

Describe the feature request

The current behaviour is to iterate over every outdated node and then first cordon and then drain it immediately afterwards. I think the behaviour should actually be to first cordon all outdated instances before doing anything and then just behave as usual.

Why do you personally want this feature to be implemented?

I whish this feature to be implemented because the current behaviour often (in my experience) leads to pods beeing replaced onto an outdated instance. This leads to a lot of pod restarts during rolling updates as pods get replaced more than once. This is espacially bad for pods with a long terminationGracePeriod or a long startup period. It happens that a pod doesn't even get ready after a replacement before it gets replaced again.

How long have you been using this project?

~3-4 months

Additional information

I would volunteer to implement this feature, even with backward compatibility if required.

Handle rolling upgrade when ASG desired capacity has reached max size

This is a bit tricky, because if, for instance, the max is 1 and there's a desired size of 1 before the upgrade, it would be fine to increase the ASG by 1 temporarily, let the new instance spin up, evict it, delete the old node and set decrease the ASG back to 1, but what if there's not enough space in the 1 new instance (i.e. if the instance type changed)? Then should we keep increasing? But then that would mean we wouldn't be able to go back to the original max size of the ASG.

High Availability (HA) for aws-eks-asg-rolling-update-handler

Hi,

Apologies if this is not raised in the correct way or its the wrong place, but not sure where else to ask this.

I'd like to find out whether or not aws-eks-asg-rolling-update-handler can be scaled to more than 1 replica for HA without encountering any issues (i.e. conflicts, duplication etc..)?

I've been trying to look for docs based on this, but currently I have not been successful in doing so. Any help or guidance around aws-eks-asg-rolling-update-handler for HA would be greatly appreciated.

Thanks, look forward to your reply

Nitin

Panic while upgrading EKS ASGs

I was trying to upgrade EKS from v1.19 to v1.20 but the handler panic'ed

2021/08/23 09:17:03 Starting execution
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009] outdated=2; updated=1; updatedAndReady=1; asgCurrent=3; asgDesired=3; asgMax=3
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Node already started rollout process
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Updated nodes have enough resources available
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Draining node
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x167787a]

goroutine 41 [running]:
golang.org/x/time/rate.(*Limiter).WaitN(0xc00007f180, 0x0, 0x0, 0x1, 0x0, 0x0)
	/app/vendor/golang.org/x/time/rate/rate.go:237 +0xba
golang.org/x/time/rate.(*Limiter).Wait(...)
	/app/vendor/golang.org/x/time/rate/rate.go:219
k8s.io/client-go/util/flowcontrol.(*tokenBucketRateLimiter).Wait(0xc0002c3d80, 0x0, 0x0, 0xc000644680, 0xc0009550d8)
	/app/vendor/k8s.io/client-go/util/flowcontrol/throttle.go:106 +0x4b
k8s.io/client-go/rest.(*Request).tryThrottleWithInfo(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x42, 0x40)
	/app/vendor/k8s.io/client-go/rest/request.go:587 +0xa5
k8s.io/client-go/rest.(*Request).tryThrottle(...)
	/app/vendor/k8s.io/client-go/rest/request.go:613
k8s.io/client-go/rest.(*Request).request(0xc0007685a0, 0x0, 0x0, 0xc0009556c8, 0x0, 0x0)
	/app/vendor/k8s.io/client-go/rest/request.go:873 +0x2fc
k8s.io/client-go/rest.(*Request).Do(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/app/vendor/k8s.io/client-go/rest/request.go:980 +0xf1
k8s.io/client-go/kubernetes/typed/core/v1.(*nodes).Patch(0xc000722480, 0x0, 0x0, 0xc0008ca000, 0x2e, 0x1f67f8d, 0x26, 0xc000573460, 0x1f, 0x20, ...)
	/app/vendor/k8s.io/client-go/kubernetes/typed/core/v1/node.go:186 +0x237
k8s.io/kubectl/pkg/drain.(*CordonHelper).PatchOrReplaceWithContext(0xc000955ab0, 0x0, 0x0, 0x226c958, 0xc0002dab00, 0x1cf0100, 0x0, 0x0, 0x7fa2a30ae8f0, 0x10)
	/app/vendor/k8s.io/kubectl/pkg/drain/cordon.go:102 +0x416
k8s.io/kubectl/pkg/drain.RunCordonOrUncordon(0xc00070e8f0, 0xc00077ef00, 0xc000722401, 0xc000504a80, 0x2e)
	/app/vendor/k8s.io/kubectl/pkg/drain/default.go:60 +0xb3
github.com/TwinProduction/aws-eks-asg-rolling-update-handler/k8s.(*KubernetesClient).Drain(0xc0003e4760, 0xc000504a80, 0x2e, 0x101, 0x2, 0x1)
	/app/k8s/client.go:125 +0x245
main.DoHandleRollingUpgrade(0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3, 0x0)
	/app/main.go:161 +0x14b4
main.HandleRollingUpgrade.func2(0xc0007df500, 0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3)
	/app/main.go:96 +0x94
created by main.HandleRollingUpgrade
	/app/main.go:95 +0x12e

It was deployed via helm with override config:

image:
  tag: "latest"

environmentVars:
- name: CLUSTER_NAME
  value: "cluster_name"
- name: AWS_REGION
  value: "eu-central-1"

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::111111:role/RollingUpdate"

There is a potential you won't get all ASGs returned if you have more than 50 ASGs in the account

https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/master/cloud/aws.go#L29

Paging is the recommended approach here.

Get ASGs to manage based on tag

Describe the feature request

The current tag lookup works if you want to manage all ASGs uniformly, however you cannot exclude ASGs that are still managed by the OSS Cluster Autoscaler.

Support adding a specific tag to enable/disable the update handler on ASGs for the update handler to lookup at runtime.

Why do you personally want this feature to be implemented?

No response

How long have you been using this project?

No response

Additional information

No response

Ability to roll nodes at a specific time range

Describe the feature request

It will allow us to define an off-hour time range and only roll nodes during that time.
e.g. HANDLER_START_HOUR, HANDLER_STOP_HOUR

== from HANDLER_START_HOUR to HANDLER_STOP_HOUR ==
it should do what it currently does

== from HANDLER_STOP_HOUR to HANDLER_START_HOUR ==
it should not cordon OR drain any node

Why do you personally want this feature to be implemented?

I don't have to wake up at night to merge my PR and/OR trigger the pipeline to change the template and kick off asg-rolling-update-handler :)

How long have you been using this project?

6 months

Additional information

This is an amazing project!!! Thank you!!!

GracePeriod in client.Drain() should be configurable

Describe the feature request

In client.Drain() the value of GracePeriodSeconds is hard coded but should be configurable via an environment variable. The default should stay -1 if no other value is set.

Why do you personally want this feature to be implemented?

Letting the Pod decide how long the GracePeriod should be is perfectly fine for production environments in order to not forcefully delete any workload but for test environments this delays the rolling update a lot.

I have a lot of pods with different GracePeriod configurations on a large amount of nodes and thus I would like to increase the rolling update speed by setting a smaller GracePeriod. Setting a small GracePeriod introduces the risk of forcefully killing Pods before they terminated gracefully but one might be totally fine with such behaviour in explained circumstances.

How long have you been using this project?

No response

Additional information

I would be happy to implement this feature myself during the next weeks if @TwiN is fine with the proposal

twin / aws-eks-asg-rolling-update-handler Goto Github PK

aws-eks-asg-rolling-update-handler's Issues

Describe the bug

What do you see?

What do you expect to see?

List the steps that must be taken to reproduce this issue

Version

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the bug

What do you see?

What do you expect to see?

List the steps that must be taken to reproduce this issue

Version

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Recommend Projects

Recommend Topics

Recommend Org