twin / aws-eks-asg-rolling-update-handler Goto Github PK

Handles rolling upgrades for AWS ASGs on EKS

License: Apache License 2.0

Go 99.60% Dockerfile 0.40%

golang eks kubernetes aws rolling-update rolling-upgrade handler go controller launch-template

aws-eks-asg-rolling-update-handler's Introduction

aws-eks-asg-rolling-update-handler

This application handles rolling upgrades for AWS ASGs for EKS by replacing outdated nodes by new nodes. Outdated nodes are defined as nodes whose current configuration does not match its ASG's current launch template version or launch configuration.

Inspired by aws-asg-roller, this application only has one purpose: Scale down outdated nodes gracefully.

Unlike aws-asg-roller, it will not attempt to control the amount of nodes at all; it will scale up enough new nodes to move the pods from the old nodes to the new nodes, and then evict the old nodes.

It will not adjust the desired size back to its initial desired size like aws-asg-roller does, it will simply leave everything else up to cluster-autoscaler.

Note that unlike other solutions, this application actually uses the resources to determine how many instances should be spun up before draining the old nodes. This is much better, because simply using the initial number of instances is completely useless in the event that the ASG's update on the launch configuration/template is a change of instance type.

Behavior

On interval, this application:

Iterates over each ASG discovered by the CLUSTER_NAME, AUTODISCOVERY_TAGS environment variables or the ones defined in the AUTO_SCALING_GROUP_NAMES environment variable, in that order.
Iterates over each instance of each ASG
Checks if there's any instance with an outdated launch template version
If ASG uses MixedInstancesPolicy, checks if there's any instances with an instance type that isn't part of the list of instance type overrides
Checks if there's any instance with an outdated launch configuration
If any of the conditions defined in the step 3, 4 or 5 are met for any instance, begin the rolling update process for that instance

The steps of each action are persisted directly on the old nodes via annotations (i.e. when the old node starts rolling out, gets drained, and gets scheduled for termination). Therefore, this application will not run into any issues if it is restarted, rescheduled or stopped at any point in time.

NOTE: Ensure that your PodDisruptionBudgets - if you have any - are properly configured. This usually means having at least 1 allowed disruption at all time (i.e. at least minAvailable: 1 with at least 2 replicas OR maxUnavailable: 1)

Usage

Environment variable	Description	Required	Default
CLUSTER_NAME	Name of the eks-cluster, used in place of `AUTODISCOVERRY_TAGS` and `AUTO_SCALING_GROUP_NAMES`. Checks for `k8s.io/cluster-autoscaler/<CLUSTER_NAME>: owned` and `k8s.io/cluster-autoscaler/enabled: true` tags on ASG	yes	`""`
AUTODISCOVERY_TAGS	Comma separated key value string with tags to autodiscover ASGs, used in place of `CLUSTER_NAME` and `AUTO_SCALING_GROUP_NAMES`.	yes	`""`
AUTO_SCALING_GROUP_NAMES	Comma-separated list of ASGs, CLUSTER_NAME takes priority.	yes	`""`
IGNORE_DAEMON_SETS	Whether to ignore DaemonSets when draining the nodes	no	`true`
DELETE_EMPTY_DIR_DATA	Whether to delete empty dir data when draining the nodes	no	`true`
AWS_REGION	Self-explanatory	no	`us-west-2`
ENVIRONMENT	If set to `dev`, will try to create the Kubernetes client using your local kubeconfig. Any other values will use the in-cluster configuration	no	`""`
EXECUTION_INTERVAL	Duration to sleep between each execution in seconds	no	`20`
EXECUTION_TIMEOUT	Maximum execution duration before timing out in seconds	no	`900`
POD_TERMINATION_GRACE_PERIOD	How long to wait for a pod to terminate in seconds; 0 means "delete immediately"; set to a negative value to use the pod's terminationGracePeriodSeconds.	no	`-1`
METRICS_PORT	Port to bind metrics server to	no	`8080`
METRICS	Expose metrics in Prometheus format at `:${METRICS_PORT}/metrics`	no	`""`
SLOW_MODE	If enabled, every time a node is terminated during an execution, the current execution will stop rather than continuing to the next ASG	no	`false`
EAGER_CORDONING	If enabled, all outdated nodes will get cordoned before any rolling update action. The default mode is to cordon a node just before draining it. See #41 for possible consequences of enabling this.	no	`false`
EXCLUDE_FROM_EXTERNAL_LOAD_BALANCERS	If enabled, node label `node.kubernetes.io/exclude-from-external-load-balancers=true` will be added to nodes before draining. See #131 for more information	no	`false`

NOTE: Only one of CLUSTER_NAME, AUTODISCOVERY_TAGS or AUTO_SCALING_GROUP_NAMES must be set.

Metrics

Metric name	Metric type	Labels	Description
rolling_update_handler_node_groups	Gauge		Node groups managed by the handler
rolling_update_handler_outdated_nodes	Gauge	`node_group`	The number of outdated nodes
rolling_update_handler_updated_nodes	Gauge	`node_group`	The number of updated nodes
rolling_update_handler_scaled_up_nodes	Counter	`node_group`	The total number of nodes scaled up
rolling_update_handler_scaled_down_nodes	Counter	`node_group`	The total number of nodes scaled down
rolling_update_handler_drained_nodes_total	Counter	`node_group`	The total number of drained nodes
rolling_update_handler_errors	Counter		The total number of errors

Permissions

To function properly, this application requires the following permissions on AWS:

autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeAutoScalingInstances
autoscaling:DescribeLaunchConfigurations
autoscaling:SetDesiredCapacity
autoscaling:TerminateInstanceInAutoScalingGroup
autoscaling:UpdateAutoScalingGroup
ec2:DescribeLaunchTemplates
ec2:DescribeInstances

Deploying on Kubernetes

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    app: aws-eks-asg-rolling-update-handler
rules:
  - apiGroups:
      - "*"
    resources:
      - "*"
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "*"
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
      - update
      - patch
  - apiGroups:
      - "*"
    resources:
      - pods/eviction
    verbs:
      - get
      - list
      - create
  - apiGroups:
      - "*"
    resources:
      - pods
    verbs:
      - get
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    app: aws-eks-asg-rolling-update-handler
roleRef:
  kind: ClusterRole
  name: aws-eks-asg-rolling-update-handler
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: aws-eks-asg-rolling-update-handler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    app: aws-eks-asg-rolling-update-handler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aws-eks-asg-rolling-update-handler
  template:
    metadata:
      labels:
        app: aws-eks-asg-rolling-update-handler
    spec:
      automountServiceAccountToken: true
      serviceAccountName: aws-eks-asg-rolling-update-handler
      restartPolicy: Always
      dnsPolicy: Default
      containers:
        - name: aws-eks-asg-rolling-update-handler
          image: twinproduction/aws-eks-asg-rolling-update-handler
          imagePullPolicy: Always
          env:
            - name: AUTO_SCALING_GROUP_NAMES
              value: "asg-1,asg-2,asg-3" # REPLACE THESE VALUES FOR THE NAMES OF THE ASGs

Deploying with Helm

For the chart associated to this project, see TwiN/helm-charts:

helm repo add twin https://twin.github.io/helm-charts
helm repo update
helm install aws-eks-asg-rolling-update-handler twin/aws-eks-asg-rolling-update-handler

Developing

To run the application locally, make sure your local kubeconfig file is configured properly (i.e. you can use kubectl).

Once you've done that, set the local environment variable ENVIRONMENT to dev and AUTO_SCALING_GROUP_NAMES to a comma-separated list of auto scaling group names.

Your local aws credentials must also be valid (i.e. you can use awscli)

Special thanks

I had originally worked on deitch/aws-asg-roller, but due to the numerous conflicts it had with cluster-autoscaler, I decided to make a project that heavily relies on cluster-autoscaler rather than simply coexist with it, with a much bigger emphasis on maintaining high availability during rolling upgrades.

In any case, this project was inspired by aws-asg-roller and the code for comparing launch template versions also comes from there, hence why this special thanks section exists.

aws-eks-asg-rolling-update-handler's People

Contributors

Stargazers

Watchers

Forkers

ryanjkemper derbauer97 pathcl

aws-eks-asg-rolling-update-handler's Issues

Filter out pods from daemon sets when calculating resources

When checking if the updated instances have enough resources, pods from daemon sets should be excluded.

GracePeriod in client.Drain() should be configurable

Describe the feature request

In client.Drain() the value of GracePeriodSeconds is hard coded but should be configurable via an environment variable. The default should stay -1 if no other value is set.

Why do you personally want this feature to be implemented?

Letting the Pod decide how long the GracePeriod should be is perfectly fine for production environments in order to not forcefully delete any workload but for test environments this delays the rolling update a lot.

I have a lot of pods with different GracePeriod configurations on a large amount of nodes and thus I would like to increase the rolling update speed by setting a smaller GracePeriod. Setting a small GracePeriod introduces the risk of forcefully killing Pods before they terminated gracefully but one might be totally fine with such behaviour in explained circumstances.

How long have you been using this project?

No response

Additional information

I would be happy to implement this feature myself during the next weeks if @TwiN is fine with the proposal

Get ASGs to manage based on tag

Describe the feature request

The current tag lookup works if you want to manage all ASGs uniformly, however you cannot exclude ASGs that are still managed by the OSS Cluster Autoscaler.

Support adding a specific tag to enable/disable the update handler on ASGs for the update handler to lookup at runtime.

Why do you personally want this feature to be implemented?

No response

How long have you been using this project?

No response

Additional information

No response

deployed the handler successfully, but nothing happens just write staring execution and execution took ...

Hi I have recently updated the EKS to 1.19 and updated the AMI to 1.19 version in the autoscaling group which uses launch template and deployed the handler successfully, but after checking the logs nothing happens it just writes staring execution and execution took ...

Please advice. Here is my configuration file:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
rules:
  - apiGroups:
      - "*"
    resources:
      - "*"
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "*"
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
      - update
      - patch
  - apiGroups:
      - "*"
    resources:
      - pods/eviction
    verbs:
      - get
      - list
      - create
  - apiGroups:
      - "*"
    resources:
      - pods
    verbs:
      - get
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
roleRef:
  kind: ClusterRole
  name: aws-eks-asg-rolling-update-handler
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: aws-eks-asg-rolling-update-handler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: aws-eks-asg-rolling-update-handler
    spec:
      automountServiceAccountToken: true
      serviceAccountName: aws-eks-asg-rolling-update-handler
      restartPolicy: Always
      dnsPolicy: Default
      containers:
        - name: aws-eks-asg-rolling-update-handler
          image: twinproduction/aws-eks-asg-rolling-update-handler
          imagePullPolicy: Always
          env:
            - name: AUTO_SCALING_GROUP_NAMES
              value: "asg-name"
            - name: CLUSTER_NAME
              value: some-name
            - name: AWS_REGION
              value: some-region
  selector:
    matchLabels:
      k8s-app: aws-eks-asg-rolling-update-handler

logs:

2021/09/23 08:06:06 Starting execution
2021/09/23 08:06:09 Execution took 3267ms, sleeping for 20s
2021/09/23 08:06:29 Starting execution
2021/09/23 08:06:29 Execution took 103ms, sleeping for 20s
2021/09/23 08:06:49 Starting execution
2021/09/23 08:06:49 Execution took 73ms, sleeping for 20s
2021/09/23 08:07:09 Starting execution
2021/09/23 08:07:09 Execution took 95ms, sleeping for 20s
2021/09/23 08:07:29 Starting execution
2021/09/23 08:07:29 Execution took 87ms, sleeping for 20s
2021/09/23 08:07:49 Starting execution
2021/09/23 08:07:50 Execution took 80ms, sleeping for 20s
2021/09/23 08:08:10 Starting execution
2021/09/23 08:08:10 Execution took 128ms, sleeping for 20s

Unable to terminate instance when desired capacity equals ASG min size

Ability to bump secondary ASG if primary is full

Describe the feature request

[cluster-a-us-east-1-large-nodes-az1] outdated=10; updated=0; updatedAndReady=0; asgCurrent=10; asgDesired=10; asgMax=10
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Node already started rollout process
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Updated nodes do not have enough resources available, increasing desired count by 1
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Unable to increase ASG desired size: cannot increase ASG desired size above max ASG size
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Skipping

Imaging a world where there is primary, secondary, tertiary ASGs available to use. In that scenario, if primary ASG is maxed bumped, secondary or tertiary can be used without any issues.

Would it be possible to implement a feature where aws-eks-asg-rolling-update-handler bumps a secondary or tertiary ASG if primary is full. It would determine if 2 or more ASGs are grouped together with help of ASG tag provided by the user (i.e. if a TAG X has value Y then it belongs to group Y. If a TAG X has value Z then it belongs to group Z).

Why do you personally want this feature to be implemented?

So I don't have to manually bump the MAX ASG.

How long have you been using this project?

8 months

Additional information

An easy win would be to expose this error as a separate Prometheus metric. While rolling_update_handler_errors is good, it doesn't differentiate between different types of errors (or maybe add error type as a cardinality in rolling_update_handler_errors metric). This way I can create an alert when this happens rather than constantly monitoring the logs.

Unable to increase ASG desired size: unable to increase ASG

Hi,

I'm getting the following error where its stuck with the following error.

Unable to increase ASG desired size: unable to increase ASG <ASGNAME> desired count to 7: ScalingActivityInProgress: Scaling activity 0ea5f0da-97c6-1d64-5127-43e061c16819 is in progress and blocks this action
	status code: 400, request id: 740c62cf-be5a-4dab-9e9d-466d3b8fd36f

ASG Status:

Desired Instances: 7
Current running instances: 7
Min: 6
Max: 15

it seems when desired instances and current running instances are equal its throwing this error.

The error went off when i manually increased the desired count to 9 and its starting rolling out the instances.

Please advice.

There is a potential you won't get all ASGs returned if you have more than 50 ASGs in the account

https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/master/cloud/aws.go#L29

Paging is the recommended approach here.

1.4.2 fails to drain nodes

Newer versions require a context in the helper:

kubernetes/kubernetes#105297

1.4.0 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.0/vendor/k8s.io/kubectl/pkg/drain/default.go#L51
1.4.2 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.2/vendor/k8s.io/kubectl/pkg/drain/default.go#L53

2022/08/11 06:36:10 [xx][i-1] Updated nodes have enough resources available
2022/08/11 06:36:10 [xx][i-1] Draining node
2022/08/11 06:36:10 [ip-1][DRAINER] Failed to cordon node: RunCordonOrUncordon error: drainer.Ctx can't be nil
2022/08/11 06:36:10 [xx][i-1] Skipping because ran into error while draining node: RunCordonOrUncordon error: drainer.Ctx can't be nil

Cordon all outdated nodes before any rolling update action

Describe the feature request

The current behaviour is to iterate over every outdated node and then first cordon and then drain it immediately afterwards. I think the behaviour should actually be to first cordon all outdated instances before doing anything and then just behave as usual.

Why do you personally want this feature to be implemented?

I whish this feature to be implemented because the current behaviour often (in my experience) leads to pods beeing replaced onto an outdated instance. This leads to a lot of pod restarts during rolling updates as pods get replaced more than once. This is espacially bad for pods with a long terminationGracePeriod or a long startup period. It happens that a pod doesn't even get ready after a replacement before it gets replaced again.

How long have you been using this project?

~3-4 months

Additional information

I would volunteer to implement this feature, even with backward compatibility if required.

Handle different hostname label options

Currently, aws-eks-asg-rolling-update-handler uses the ec2 instance id (i-xxxxxxxxxxxxx) to find the matching node in the k8s api, however, the hostname label defaults to ip-x-x-x-x.<searchdomain>. A fallback option would be nice to make sure that matches succeed on any combination

i-xxxxxxxxxxx
ip-x-x-x-x.
ip-x-x-x-x..compute.internal (regionless for us-east-1)

The searchdomain is the first domain in the dhcpopts object.

Logs:

2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c] outdated=2; updated=0; updatedAndReady=0; asgCurrent=2; asgDesired=2; asgMax=10
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-02943160e67727188] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-029xxx60e67727188" not found
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-0bbxxx783491a4d77] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-0bbxxx783491a4d77" not found
2020/07/22 15:27:08 Sleeping for 20 seconds

kubectl describe node ip-10-5-233-178.us-west-2.compute.internal

Name:        ip-10-5-233-178.us-west-2.compute.internal
Roles:       <none>
Labels:      beta.kubernetes.io/arch=amd64
             beta.kubernetes.io/instance-type=c5.xlarge
             beta.kubernetes.io/os=linux
             failure-domain.beta.kubernetes.io/region=us-west-2
             failure-domain.beta.kubernetes.io/zone=us-west-2c
             kubernetes.io/arch=amd64
             kubernetes.io/hostname=ip-10-5-233-178.us-west-2.i.test.top.secret.com     <---- custom hostname
             kubernetes.io/os=linux
             pool=xxxxxxxxxxx

Ability to roll nodes at a specific time range

Describe the feature request

It will allow us to define an off-hour time range and only roll nodes during that time.
e.g. HANDLER_START_HOUR, HANDLER_STOP_HOUR

== from HANDLER_START_HOUR to HANDLER_STOP_HOUR ==
it should do what it currently does

== from HANDLER_STOP_HOUR to HANDLER_START_HOUR ==
it should not cordon OR drain any node

Why do you personally want this feature to be implemented?

I don't have to wake up at night to merge my PR and/OR trigger the pipeline to change the template and kick off asg-rolling-update-handler :)

How long have you been using this project?

6 months

Additional information

This is an amazing project!!! Thank you!!!

Panic while upgrading EKS ASGs

I was trying to upgrade EKS from v1.19 to v1.20 but the handler panic'ed

2021/08/23 09:17:03 Starting execution
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009] outdated=2; updated=1; updatedAndReady=1; asgCurrent=3; asgDesired=3; asgMax=3
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Node already started rollout process
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Updated nodes have enough resources available
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Draining node
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x167787a]

goroutine 41 [running]:
golang.org/x/time/rate.(*Limiter).WaitN(0xc00007f180, 0x0, 0x0, 0x1, 0x0, 0x0)
	/app/vendor/golang.org/x/time/rate/rate.go:237 +0xba
golang.org/x/time/rate.(*Limiter).Wait(...)
	/app/vendor/golang.org/x/time/rate/rate.go:219
k8s.io/client-go/util/flowcontrol.(*tokenBucketRateLimiter).Wait(0xc0002c3d80, 0x0, 0x0, 0xc000644680, 0xc0009550d8)
	/app/vendor/k8s.io/client-go/util/flowcontrol/throttle.go:106 +0x4b
k8s.io/client-go/rest.(*Request).tryThrottleWithInfo(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x42, 0x40)
	/app/vendor/k8s.io/client-go/rest/request.go:587 +0xa5
k8s.io/client-go/rest.(*Request).tryThrottle(...)
	/app/vendor/k8s.io/client-go/rest/request.go:613
k8s.io/client-go/rest.(*Request).request(0xc0007685a0, 0x0, 0x0, 0xc0009556c8, 0x0, 0x0)
	/app/vendor/k8s.io/client-go/rest/request.go:873 +0x2fc
k8s.io/client-go/rest.(*Request).Do(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/app/vendor/k8s.io/client-go/rest/request.go:980 +0xf1
k8s.io/client-go/kubernetes/typed/core/v1.(*nodes).Patch(0xc000722480, 0x0, 0x0, 0xc0008ca000, 0x2e, 0x1f67f8d, 0x26, 0xc000573460, 0x1f, 0x20, ...)
	/app/vendor/k8s.io/client-go/kubernetes/typed/core/v1/node.go:186 +0x237
k8s.io/kubectl/pkg/drain.(*CordonHelper).PatchOrReplaceWithContext(0xc000955ab0, 0x0, 0x0, 0x226c958, 0xc0002dab00, 0x1cf0100, 0x0, 0x0, 0x7fa2a30ae8f0, 0x10)
	/app/vendor/k8s.io/kubectl/pkg/drain/cordon.go:102 +0x416
k8s.io/kubectl/pkg/drain.RunCordonOrUncordon(0xc00070e8f0, 0xc00077ef00, 0xc000722401, 0xc000504a80, 0x2e)
	/app/vendor/k8s.io/kubectl/pkg/drain/default.go:60 +0xb3
github.com/TwinProduction/aws-eks-asg-rolling-update-handler/k8s.(*KubernetesClient).Drain(0xc0003e4760, 0xc000504a80, 0x2e, 0x101, 0x2, 0x1)
	/app/k8s/client.go:125 +0x245
main.DoHandleRollingUpgrade(0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3, 0x0)
	/app/main.go:161 +0x14b4
main.HandleRollingUpgrade.func2(0xc0007df500, 0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3)
	/app/main.go:96 +0x94
created by main.HandleRollingUpgrade
	/app/main.go:95 +0x12e

It was deployed via helm with override config:

image:
  tag: "latest"

environmentVars:
- name: CLUSTER_NAME
  value: "cluster_name"
- name: AWS_REGION
  value: "eu-central-1"

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::111111:role/RollingUpdate"

Occasional deadlock during drain

See openshift/kubernetes-drain#4

openshift/kubernetes-drain looks dead, might have to switch to a different library or implement my own drain.

Handle rolling upgrade when ASG desired capacity has reached max size

This is a bit tricky, because if, for instance, the max is 1 and there's a desired size of 1 before the upgrade, it would be fine to increase the ASG by 1 temporarily, let the new instance spin up, evict it, delete the old node and set decrease the ASG back to 1, but what if there's not enough space in the 1 new instance (i.e. if the instance type changed)? Then should we keep increasing? But then that would mean we wouldn't be able to go back to the original max size of the ASG.

Create /health endpoint and report health there instead of panicking

The releases v0.0.10 and v0.0.11 both added decent self-healing capabilities to aws-eks-asg-rolling-update-handler, but perhaps healing through a panic is a little too violent and the health of the application should be exposed through an HTTP endpoint and consumed by the liveness probe.

Checklist:

HTTP server
/health endpoint
Replace the two panics below by exposing an object through the new /health endpoint https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/6eab2639174ba8835e5e546b0b224d7e4abad7a0/main.go#L44-L46 https://github.com/TwinProduction/aws-eks-asg-rolling-update-handler/blob/6eab2639174ba8835e5e546b0b224d7e4abad7a0/main.go#L74-L77
Usage instructions must be updated to include the liveness probe (README)

The handler sometimes doesn't cordon any node

Describe the bug

The getRollingUpdateTimestampsFromNode(node) conditions used throughout the code to check if a node should be cordoned or not have a flaw. If something or someone decides to stop the rolling update, and manually uncordon the nodes, the handler on next start will happily evict pods in that node without actually cordoning anything. It is specially problematic when using the eager cordoning feature, since it leads to an upgrade that can never end.

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

Start the handler
Let it cordon a node
Stop the handler
Uncordon the node manually
Start the handler again

Version

No response

Additional information

No response

SetDesiredCapacityInfo lowered the number of nodes in ASG

Describe the bug

We're running Cluster Autoscaler and aws-eks-asg-rolling-update-handler in the same EKS cluster. While Cluster Autoscaler was trying to scale up the nodes due to increased traffic, aws-eks-asg-rolling-update-handler made a SetDesiredCapacityInfo request that lowered the number of active instances in ASG, causing an outage.

    "eventTime": "2023-06-28T23:21:46Z",
    "arn": "arn:aws:iam::xxxxx:role/xxxxx-aws-eks-asg-rolling-update-handler",
    "requestParameters": {
        "desiredCapacity": 90,
        "autoScalingGroupName": "xxxxx-2022020200515369970000000b",
        "honorCooldown": true
    },
    "requestID": "b02fcaa3-cbbd-4d13-a183-828fdae4477f",
    "eventID": "41e31359-c299-4481-bed5-06bed8261347",

ASG activity history

Successful
Terminating EC2 instance: i-0f525e751d21ff18b	
At 2023-06-28T23:21:46Z a user request explicitly set group desired capacity changing the desired capacity from 99 to 90. 
At 2023-06-28T23:21:48Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 99 to 90. 
At 2023-06-28T23:21:49Z instance i-087d61bdf2a96135d was selected for termination. 
At 2023-06-28T23:21:49Z instance i-07c86111988c8f033 was selected for termination. 
At 2023-06-28T23:21:49Z instance i-0635b60dc9625ee25 was selected for termination.
At 2023-06-28T23:21:49Z instance i-036cf4d15fee9fe41 was selected for termination.
At 2023-06-28T23:21:49Z instance i-021dc05fa50e14344 was selected for termination.
At 2023-06-28T23:21:49Z instance i-012bfb97195f1054d was selected for termination.
At 2023-06-28T23:21:49Z instance i-0fcd7d7595f9e49a8 was selected for termination.
At 2023-06-28T23:21:50Z instance i-0f9a8745a9e9f2e23 was selected for termination.

There should be a portion in the docs about enabling ASG Scale In Protection on all the ASG & Active Nodes for safety
There should be an option in aws-eks-asg-rolling-update-handler to not use SetDesiredCapacityInfo by itself and instead rely on Cluster Autoscaler to bring up new nodes. aws-eks-asg-rolling-update-handler should evict node and let Cluster Autoscaler deal with missing nodes

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

Create EKS cluster environment with Cluster Autoscaler & aws-eks-asg-rolling-update-handler enabled

While rolling through nodes with many pods with PDB of maxUnavailable: 0, keep on increasing hpa to cause Cluster Autoscaler to bring up new nodes.

At some point, because of 5 minute timeout of eviction, aws-eks-asg-rolling-update-handler will be out of sync with the number of nodes of what Cluster Autoscaler brought up and aws-eks-asg-rolling-update-handler will send SetDesiredCapacityInfo that is lower than the current ASG size

Version

1.8.0

Additional information

No response

High Availability (HA) for aws-eks-asg-rolling-update-handler

Hi,

Apologies if this is not raised in the correct way or its the wrong place, but not sure where else to ask this.

I'd like to find out whether or not aws-eks-asg-rolling-update-handler can be scaled to more than 1 replica for HA without encountering any issues (i.e. conflicts, duplication etc..)?

I've been trying to look for docs based on this, but currently I have not been successful in doing so. Any help or guidance around aws-eks-asg-rolling-update-handler for HA would be greatly appreciated.

Thanks, look forward to your reply

Nitin

twin / aws-eks-asg-rolling-update-handler Goto Github PK

aws-eks-asg-rolling-update-handler's Introduction

aws-eks-asg-rolling-update-handler

Behavior

Usage

Metrics

Permissions

Deploying on Kubernetes

Deploying with Helm

Developing

Special thanks

aws-eks-asg-rolling-update-handler's People

Contributors

Stargazers

Watchers

Forkers

aws-eks-asg-rolling-update-handler's Issues

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the feature request

Why do you personally want this feature to be implemented?

How long have you been using this project?

Additional information

Describe the bug

What do you see?

What do you expect to see?

List the steps that must be taken to reproduce this issue

Version

Additional information

Describe the bug

What do you see?

What do you expect to see?

List the steps that must be taken to reproduce this issue

Version

Additional information

Recommend Projects

Recommend Topics

Recommend Org