Coder Social home page Coder Social logo

twin / aws-eks-asg-rolling-update-handler Goto Github PK

View Code? Open in Web Editor NEW
43.0 3.0 3.0 15.46 MB

Handles rolling upgrades for AWS ASGs on EKS

License: Apache License 2.0

Go 99.60% Dockerfile 0.40%
golang eks kubernetes aws rolling-update rolling-upgrade handler go controller launch-template

aws-eks-asg-rolling-update-handler's Introduction

aws-eks-asg-rolling-update-handler

test Go Report Card Docker pulls

This application handles rolling upgrades for AWS ASGs for EKS by replacing outdated nodes by new nodes. Outdated nodes are defined as nodes whose current configuration does not match its ASG's current launch template version or launch configuration.

Inspired by aws-asg-roller, this application only has one purpose: Scale down outdated nodes gracefully.

Unlike aws-asg-roller, it will not attempt to control the amount of nodes at all; it will scale up enough new nodes to move the pods from the old nodes to the new nodes, and then evict the old nodes.

It will not adjust the desired size back to its initial desired size like aws-asg-roller does, it will simply leave everything else up to cluster-autoscaler.

Note that unlike other solutions, this application actually uses the resources to determine how many instances should be spun up before draining the old nodes. This is much better, because simply using the initial number of instances is completely useless in the event that the ASG's update on the launch configuration/template is a change of instance type.

Behavior

On interval, this application:

  1. Iterates over each ASG discovered by the CLUSTER_NAME, AUTODISCOVERY_TAGS environment variables or the ones defined in the AUTO_SCALING_GROUP_NAMES environment variable, in that order.
  2. Iterates over each instance of each ASG
  3. Checks if there's any instance with an outdated launch template version
  4. If ASG uses MixedInstancesPolicy, checks if there's any instances with an instance type that isn't part of the list of instance type overrides
  5. Checks if there's any instance with an outdated launch configuration
  6. If any of the conditions defined in the step 3, 4 or 5 are met for any instance, begin the rolling update process for that instance

The steps of each action are persisted directly on the old nodes via annotations (i.e. when the old node starts rolling out, gets drained, and gets scheduled for termination). Therefore, this application will not run into any issues if it is restarted, rescheduled or stopped at any point in time.

NOTE: Ensure that your PodDisruptionBudgets - if you have any - are properly configured. This usually means having at least 1 allowed disruption at all time (i.e. at least minAvailable: 1 with at least 2 replicas OR maxUnavailable: 1)

Usage

Environment variable Description Required Default
CLUSTER_NAME Name of the eks-cluster, used in place of AUTODISCOVERRY_TAGS and AUTO_SCALING_GROUP_NAMES. Checks for k8s.io/cluster-autoscaler/<CLUSTER_NAME>: owned and k8s.io/cluster-autoscaler/enabled: true tags on ASG yes ""
AUTODISCOVERY_TAGS Comma separated key value string with tags to autodiscover ASGs, used in place of CLUSTER_NAME and AUTO_SCALING_GROUP_NAMES. yes ""
AUTO_SCALING_GROUP_NAMES Comma-separated list of ASGs, CLUSTER_NAME takes priority. yes ""
IGNORE_DAEMON_SETS Whether to ignore DaemonSets when draining the nodes no true
DELETE_EMPTY_DIR_DATA Whether to delete empty dir data when draining the nodes no true
AWS_REGION Self-explanatory no us-west-2
ENVIRONMENT If set to dev, will try to create the Kubernetes client using your local kubeconfig. Any other values will use the in-cluster configuration no ""
EXECUTION_INTERVAL Duration to sleep between each execution in seconds no 20
EXECUTION_TIMEOUT Maximum execution duration before timing out in seconds no 900
POD_TERMINATION_GRACE_PERIOD How long to wait for a pod to terminate in seconds; 0 means "delete immediately"; set to a negative value to use the pod's terminationGracePeriodSeconds. no -1
METRICS_PORT Port to bind metrics server to no 8080
METRICS Expose metrics in Prometheus format at :${METRICS_PORT}/metrics no ""
SLOW_MODE If enabled, every time a node is terminated during an execution, the current execution will stop rather than continuing to the next ASG no false
EAGER_CORDONING If enabled, all outdated nodes will get cordoned before any rolling update action. The default mode is to cordon a node just before draining it. See #41 for possible consequences of enabling this. no false
EXCLUDE_FROM_EXTERNAL_LOAD_BALANCERS If enabled, node label node.kubernetes.io/exclude-from-external-load-balancers=true will be added to nodes before draining. See #131 for more information no false

NOTE: Only one of CLUSTER_NAME, AUTODISCOVERY_TAGS or AUTO_SCALING_GROUP_NAMES must be set.

Metrics

Metric name Metric type Labels Description
rolling_update_handler_node_groups Gauge Node groups managed by the handler
rolling_update_handler_outdated_nodes Gauge node_group The number of outdated nodes
rolling_update_handler_updated_nodes Gauge node_group The number of updated nodes
rolling_update_handler_scaled_up_nodes Counter node_group The total number of nodes scaled up
rolling_update_handler_scaled_down_nodes Counter node_group The total number of nodes scaled down
rolling_update_handler_drained_nodes_total Counter node_group The total number of drained nodes
rolling_update_handler_errors Counter The total number of errors

Permissions

To function properly, this application requires the following permissions on AWS:

  • autoscaling:DescribeAutoScalingGroups
  • autoscaling:DescribeAutoScalingInstances
  • autoscaling:DescribeLaunchConfigurations
  • autoscaling:SetDesiredCapacity
  • autoscaling:TerminateInstanceInAutoScalingGroup
  • autoscaling:UpdateAutoScalingGroup
  • ec2:DescribeLaunchTemplates
  • ec2:DescribeInstances

Deploying on Kubernetes

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    app: aws-eks-asg-rolling-update-handler
rules:
  - apiGroups:
      - "*"
    resources:
      - "*"
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "*"
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
      - update
      - patch
  - apiGroups:
      - "*"
    resources:
      - pods/eviction
    verbs:
      - get
      - list
      - create
  - apiGroups:
      - "*"
    resources:
      - pods
    verbs:
      - get
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    app: aws-eks-asg-rolling-update-handler
roleRef:
  kind: ClusterRole
  name: aws-eks-asg-rolling-update-handler
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: aws-eks-asg-rolling-update-handler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    app: aws-eks-asg-rolling-update-handler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aws-eks-asg-rolling-update-handler
  template:
    metadata:
      labels:
        app: aws-eks-asg-rolling-update-handler
    spec:
      automountServiceAccountToken: true
      serviceAccountName: aws-eks-asg-rolling-update-handler
      restartPolicy: Always
      dnsPolicy: Default
      containers:
        - name: aws-eks-asg-rolling-update-handler
          image: twinproduction/aws-eks-asg-rolling-update-handler
          imagePullPolicy: Always
          env:
            - name: AUTO_SCALING_GROUP_NAMES
              value: "asg-1,asg-2,asg-3" # REPLACE THESE VALUES FOR THE NAMES OF THE ASGs

Deploying with Helm

For the chart associated to this project, see TwiN/helm-charts:

helm repo add twin https://twin.github.io/helm-charts
helm repo update
helm install aws-eks-asg-rolling-update-handler twin/aws-eks-asg-rolling-update-handler

Developing

To run the application locally, make sure your local kubeconfig file is configured properly (i.e. you can use kubectl).

Once you've done that, set the local environment variable ENVIRONMENT to dev and AUTO_SCALING_GROUP_NAMES to a comma-separated list of auto scaling group names.

Your local aws credentials must also be valid (i.e. you can use awscli)

Special thanks

I had originally worked on deitch/aws-asg-roller, but due to the numerous conflicts it had with cluster-autoscaler, I decided to make a project that heavily relies on cluster-autoscaler rather than simply coexist with it, with a much bigger emphasis on maintaining high availability during rolling upgrades.

In any case, this project was inspired by aws-asg-roller and the code for comparing launch template versions also comes from there, hence why this special thanks section exists.

aws-eks-asg-rolling-update-handler's People

Contributors

dependabot[bot] avatar derbauer97 avatar lacodon avatar mvaal avatar mvaalexp avatar ryanjkemper avatar someone-stole-my-name avatar twin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

aws-eks-asg-rolling-update-handler's Issues

GracePeriod in client.Drain() should be configurable

Describe the feature request

In client.Drain() the value of GracePeriodSeconds is hard coded but should be configurable via an environment variable. The default should stay -1 if no other value is set.

Why do you personally want this feature to be implemented?

Letting the Pod decide how long the GracePeriod should be is perfectly fine for production environments in order to not forcefully delete any workload but for test environments this delays the rolling update a lot.

I have a lot of pods with different GracePeriod configurations on a large amount of nodes and thus I would like to increase the rolling update speed by setting a smaller GracePeriod. Setting a small GracePeriod introduces the risk of forcefully killing Pods before they terminated gracefully but one might be totally fine with such behaviour in explained circumstances.

How long have you been using this project?

No response

Additional information

I would be happy to implement this feature myself during the next weeks if @TwiN is fine with the proposal

Get ASGs to manage based on tag

Describe the feature request

The current tag lookup works if you want to manage all ASGs uniformly, however you cannot exclude ASGs that are still managed by the OSS Cluster Autoscaler.

Support adding a specific tag to enable/disable the update handler on ASGs for the update handler to lookup at runtime.

Why do you personally want this feature to be implemented?

No response

How long have you been using this project?

No response

Additional information

No response

deployed the handler successfully, but nothing happens just write staring execution and execution took ...

Hi I have recently updated the EKS to 1.19 and updated the AMI to 1.19 version in the autoscaling group which uses launch template and deployed the handler successfully, but after checking the logs nothing happens it just writes staring execution and execution took ...

Please advice. Here is my configuration file:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
rules:
  - apiGroups:
      - "*"
    resources:
      - "*"
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - "*"
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
      - update
      - patch
  - apiGroups:
      - "*"
    resources:
      - pods/eviction
    verbs:
      - get
      - list
      - create
  - apiGroups:
      - "*"
    resources:
      - pods
    verbs:
      - get
      - list
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aws-eks-asg-rolling-update-handler
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
roleRef:
  kind: ClusterRole
  name: aws-eks-asg-rolling-update-handler
  apiGroup: rbac.authorization.k8s.io
subjects:
  - kind: ServiceAccount
    name: aws-eks-asg-rolling-update-handler
    namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: aws-eks-asg-rolling-update-handler
  namespace: kube-system
  labels:
    k8s-app: aws-eks-asg-rolling-update-handler
spec:
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: aws-eks-asg-rolling-update-handler
    spec:
      automountServiceAccountToken: true
      serviceAccountName: aws-eks-asg-rolling-update-handler
      restartPolicy: Always
      dnsPolicy: Default
      containers:
        - name: aws-eks-asg-rolling-update-handler
          image: twinproduction/aws-eks-asg-rolling-update-handler
          imagePullPolicy: Always
          env:
            - name: AUTO_SCALING_GROUP_NAMES
              value: "asg-name"
            - name: CLUSTER_NAME
              value: some-name
            - name: AWS_REGION
              value: some-region
  selector:
    matchLabels:
      k8s-app: aws-eks-asg-rolling-update-handler

logs:

2021/09/23 08:06:06 Starting execution
2021/09/23 08:06:09 Execution took 3267ms, sleeping for 20s
2021/09/23 08:06:29 Starting execution
2021/09/23 08:06:29 Execution took 103ms, sleeping for 20s
2021/09/23 08:06:49 Starting execution
2021/09/23 08:06:49 Execution took 73ms, sleeping for 20s
2021/09/23 08:07:09 Starting execution
2021/09/23 08:07:09 Execution took 95ms, sleeping for 20s
2021/09/23 08:07:29 Starting execution
2021/09/23 08:07:29 Execution took 87ms, sleeping for 20s
2021/09/23 08:07:49 Starting execution
2021/09/23 08:07:50 Execution took 80ms, sleeping for 20s
2021/09/23 08:08:10 Starting execution
2021/09/23 08:08:10 Execution took 128ms, sleeping for 20s

Ability to bump secondary ASG if primary is full

Describe the feature request

[cluster-a-us-east-1-large-nodes-az1] outdated=10; updated=0; updatedAndReady=0; asgCurrent=10; asgDesired=10; asgMax=10
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Node already started rollout process
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Updated nodes do not have enough resources available, increasing desired count by 1
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Unable to increase ASG desired size: cannot increase ASG desired size above max ASG size
[cluster-a-us-east-1-large-nodes-az1][i-0xyz] Skipping

Imaging a world where there is primary, secondary, tertiary ASGs available to use. In that scenario, if primary ASG is maxed bumped, secondary or tertiary can be used without any issues.

Would it be possible to implement a feature where aws-eks-asg-rolling-update-handler bumps a secondary or tertiary ASG if primary is full. It would determine if 2 or more ASGs are grouped together with help of ASG tag provided by the user (i.e. if a TAG X has value Y then it belongs to group Y. If a TAG X has value Z then it belongs to group Z).

Why do you personally want this feature to be implemented?

So I don't have to manually bump the MAX ASG.

How long have you been using this project?

8 months

Additional information

An easy win would be to expose this error as a separate Prometheus metric. While rolling_update_handler_errors is good, it doesn't differentiate between different types of errors (or maybe add error type as a cardinality in rolling_update_handler_errors metric). This way I can create an alert when this happens rather than constantly monitoring the logs.

Unable to increase ASG desired size: unable to increase ASG

Hi,

I'm getting the following error where its stuck with the following error.

Unable to increase ASG desired size: unable to increase ASG <ASGNAME> desired count to 7: ScalingActivityInProgress: Scaling activity 0ea5f0da-97c6-1d64-5127-43e061c16819 is in progress and blocks this action
	status code: 400, request id: 740c62cf-be5a-4dab-9e9d-466d3b8fd36f

ASG Status:

  1. Desired Instances: 7
  2. Current running instances: 7
  3. Min: 6
  4. Max: 15

it seems when desired instances and current running instances are equal its throwing this error.

The error went off when i manually increased the desired count to 9 and its starting rolling out the instances.

Please advice.

1.4.2 fails to drain nodes

Newer versions require a context in the helper:

kubernetes/kubernetes#105297

1.4.0 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.0/vendor/k8s.io/kubectl/pkg/drain/default.go#L51
1.4.2 -> https://github.com/TwiN/aws-eks-asg-rolling-update-handler/blob/v1.4.2/vendor/k8s.io/kubectl/pkg/drain/default.go#L53

2022/08/11 06:36:10 [xx][i-1] Updated nodes have enough resources available
2022/08/11 06:36:10 [xx][i-1] Draining node
2022/08/11 06:36:10 [ip-1][DRAINER] Failed to cordon node: RunCordonOrUncordon error: drainer.Ctx can't be nil
2022/08/11 06:36:10 [xx][i-1] Skipping because ran into error while draining node: RunCordonOrUncordon error: drainer.Ctx can't be nil

Cordon all outdated nodes before any rolling update action

Describe the feature request

The current behaviour is to iterate over every outdated node and then first cordon and then drain it immediately afterwards. I think the behaviour should actually be to first cordon all outdated instances before doing anything and then just behave as usual.

Why do you personally want this feature to be implemented?

I whish this feature to be implemented because the current behaviour often (in my experience) leads to pods beeing replaced onto an outdated instance. This leads to a lot of pod restarts during rolling updates as pods get replaced more than once. This is espacially bad for pods with a long terminationGracePeriod or a long startup period. It happens that a pod doesn't even get ready after a replacement before it gets replaced again.

How long have you been using this project?

~3-4 months

Additional information

I would volunteer to implement this feature, even with backward compatibility if required.

Handle different hostname label options

Currently, aws-eks-asg-rolling-update-handler uses the ec2 instance id (i-xxxxxxxxxxxxx) to find the matching node in the k8s api, however, the hostname label defaults to ip-x-x-x-x.<searchdomain>. A fallback option would be nice to make sure that matches succeed on any combination

  • i-xxxxxxxxxxx
  • ip-x-x-x-x.
  • ip-x-x-x-x..compute.internal (regionless for us-east-1)

The searchdomain is the first domain in the dhcpopts object.

Logs:

2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c] outdated=2; updated=0; updatedAndReady=0; asgCurrent=2; asgDesired=2; asgMax=10
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-02943160e67727188] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-029xxx60e67727188" not found
2020/07/22 15:27:08 [cluster-name-xxxxxxxxxxx_us-west-2c2020072119285176410000000c][i-0bbxxx783491a4d77] Skipping because unable to get outdated node from Kubernetes: nodes with hostname "i-0bbxxx783491a4d77" not found
2020/07/22 15:27:08 Sleeping for 20 seconds
kubectl describe node ip-10-5-233-178.us-west-2.compute.internal
Name:        ip-10-5-233-178.us-west-2.compute.internal
Roles:       <none>
Labels:      beta.kubernetes.io/arch=amd64
             beta.kubernetes.io/instance-type=c5.xlarge
             beta.kubernetes.io/os=linux
             failure-domain.beta.kubernetes.io/region=us-west-2
             failure-domain.beta.kubernetes.io/zone=us-west-2c
             kubernetes.io/arch=amd64
             kubernetes.io/hostname=ip-10-5-233-178.us-west-2.i.test.top.secret.com     <---- custom hostname
             kubernetes.io/os=linux
             pool=xxxxxxxxxxx

Ability to roll nodes at a specific time range

Describe the feature request

It will allow us to define an off-hour time range and only roll nodes during that time.
e.g. HANDLER_START_HOUR, HANDLER_STOP_HOUR

== from HANDLER_START_HOUR to HANDLER_STOP_HOUR ==
it should do what it currently does

== from HANDLER_STOP_HOUR to HANDLER_START_HOUR ==
it should not cordon OR drain any node

Why do you personally want this feature to be implemented?

I don't have to wake up at night to merge my PR and/OR trigger the pipeline to change the template and kick off asg-rolling-update-handler :)

How long have you been using this project?

6 months

Additional information

This is an amazing project!!! Thank you!!!

Panic while upgrading EKS ASGs

I was trying to upgrade EKS from v1.19 to v1.20 but the handler panic'ed

2021/08/23 09:17:03 Starting execution
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009] outdated=2; updated=1; updatedAndReady=1; asgCurrent=3; asgDesired=3; asgMax=3
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Node already started rollout process
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Updated nodes have enough resources available
2021/08/23 09:17:04 [worker-eu-central-1a-020210823065201810300000009][i-xxxxxxxx] Draining node
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x167787a]

goroutine 41 [running]:
golang.org/x/time/rate.(*Limiter).WaitN(0xc00007f180, 0x0, 0x0, 0x1, 0x0, 0x0)
	/app/vendor/golang.org/x/time/rate/rate.go:237 +0xba
golang.org/x/time/rate.(*Limiter).Wait(...)
	/app/vendor/golang.org/x/time/rate/rate.go:219
k8s.io/client-go/util/flowcontrol.(*tokenBucketRateLimiter).Wait(0xc0002c3d80, 0x0, 0x0, 0xc000644680, 0xc0009550d8)
	/app/vendor/k8s.io/client-go/util/flowcontrol/throttle.go:106 +0x4b
k8s.io/client-go/rest.(*Request).tryThrottleWithInfo(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x42, 0x40)
	/app/vendor/k8s.io/client-go/rest/request.go:587 +0xa5
k8s.io/client-go/rest.(*Request).tryThrottle(...)
	/app/vendor/k8s.io/client-go/rest/request.go:613
k8s.io/client-go/rest.(*Request).request(0xc0007685a0, 0x0, 0x0, 0xc0009556c8, 0x0, 0x0)
	/app/vendor/k8s.io/client-go/rest/request.go:873 +0x2fc
k8s.io/client-go/rest.(*Request).Do(0xc0007685a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/app/vendor/k8s.io/client-go/rest/request.go:980 +0xf1
k8s.io/client-go/kubernetes/typed/core/v1.(*nodes).Patch(0xc000722480, 0x0, 0x0, 0xc0008ca000, 0x2e, 0x1f67f8d, 0x26, 0xc000573460, 0x1f, 0x20, ...)
	/app/vendor/k8s.io/client-go/kubernetes/typed/core/v1/node.go:186 +0x237
k8s.io/kubectl/pkg/drain.(*CordonHelper).PatchOrReplaceWithContext(0xc000955ab0, 0x0, 0x0, 0x226c958, 0xc0002dab00, 0x1cf0100, 0x0, 0x0, 0x7fa2a30ae8f0, 0x10)
	/app/vendor/k8s.io/kubectl/pkg/drain/cordon.go:102 +0x416
k8s.io/kubectl/pkg/drain.RunCordonOrUncordon(0xc00070e8f0, 0xc00077ef00, 0xc000722401, 0xc000504a80, 0x2e)
	/app/vendor/k8s.io/kubectl/pkg/drain/default.go:60 +0xb3
github.com/TwinProduction/aws-eks-asg-rolling-update-handler/k8s.(*KubernetesClient).Drain(0xc0003e4760, 0xc000504a80, 0x2e, 0x101, 0x2, 0x1)
	/app/k8s/client.go:125 +0x245
main.DoHandleRollingUpgrade(0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3, 0x0)
	/app/main.go:161 +0x14b4
main.HandleRollingUpgrade.func2(0xc0007df500, 0x22537d8, 0xc0003e4760, 0x227b0e8, 0xc0003e4480, 0x2270ef8, 0xc0003e4490, 0xc0007001b0, 0x3, 0x3)
	/app/main.go:96 +0x94
created by main.HandleRollingUpgrade
	/app/main.go:95 +0x12e

It was deployed via helm with override config:

image:
  tag: "latest"

environmentVars:
- name: CLUSTER_NAME
  value: "cluster_name"
- name: AWS_REGION
  value: "eu-central-1"

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::111111:role/RollingUpdate"

Handle rolling upgrade when ASG desired capacity has reached max size

This is a bit tricky, because if, for instance, the max is 1 and there's a desired size of 1 before the upgrade, it would be fine to increase the ASG by 1 temporarily, let the new instance spin up, evict it, delete the old node and set decrease the ASG back to 1, but what if there's not enough space in the 1 new instance (i.e. if the instance type changed)? Then should we keep increasing? But then that would mean we wouldn't be able to go back to the original max size of the ASG.

Create /health endpoint and report health there instead of panicking

The releases v0.0.10 and v0.0.11 both added decent self-healing capabilities to aws-eks-asg-rolling-update-handler, but perhaps healing through a panic is a little too violent and the health of the application should be exposed through an HTTP endpoint and consumed by the liveness probe.

Checklist:

The handler sometimes doesn't cordon any node

Describe the bug

The getRollingUpdateTimestampsFromNode(node) conditions used throughout the code to check if a node should be cordoned or not have a flaw. If something or someone decides to stop the rolling update, and manually uncordon the nodes, the handler on next start will happily evict pods in that node without actually cordoning anything. It is specially problematic when using the eager cordoning feature, since it leads to an upgrade that can never end.

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

  1. Start the handler
  2. Let it cordon a node
  3. Stop the handler
  4. Uncordon the node manually
  5. Start the handler again

Version

No response

Additional information

No response

SetDesiredCapacityInfo lowered the number of nodes in ASG

Describe the bug

We're running Cluster Autoscaler and aws-eks-asg-rolling-update-handler in the same EKS cluster. While Cluster Autoscaler was trying to scale up the nodes due to increased traffic, aws-eks-asg-rolling-update-handler made a SetDesiredCapacityInfo request that lowered the number of active instances in ASG, causing an outage.

    "eventTime": "2023-06-28T23:21:46Z",
    "arn": "arn:aws:iam::xxxxx:role/xxxxx-aws-eks-asg-rolling-update-handler",
    "requestParameters": {
        "desiredCapacity": 90,
        "autoScalingGroupName": "xxxxx-2022020200515369970000000b",
        "honorCooldown": true
    },
    "requestID": "b02fcaa3-cbbd-4d13-a183-828fdae4477f",
    "eventID": "41e31359-c299-4481-bed5-06bed8261347",

ASG activity history

Successful
Terminating EC2 instance: i-0f525e751d21ff18b	
At 2023-06-28T23:21:46Z a user request explicitly set group desired capacity changing the desired capacity from 99 to 90. 
At 2023-06-28T23:21:48Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 99 to 90. 
At 2023-06-28T23:21:49Z instance i-087d61bdf2a96135d was selected for termination. 
At 2023-06-28T23:21:49Z instance i-07c86111988c8f033 was selected for termination. 
At 2023-06-28T23:21:49Z instance i-0635b60dc9625ee25 was selected for termination.
At 2023-06-28T23:21:49Z instance i-036cf4d15fee9fe41 was selected for termination.
At 2023-06-28T23:21:49Z instance i-021dc05fa50e14344 was selected for termination.
At 2023-06-28T23:21:49Z instance i-012bfb97195f1054d was selected for termination.
At 2023-06-28T23:21:49Z instance i-0fcd7d7595f9e49a8 was selected for termination.
At 2023-06-28T23:21:50Z instance i-0f9a8745a9e9f2e23 was selected for termination.
  • There should be a portion in the docs about enabling ASG Scale In Protection on all the ASG & Active Nodes for safety
  • There should be an option in aws-eks-asg-rolling-update-handler to not use SetDesiredCapacityInfo by itself and instead rely on Cluster Autoscaler to bring up new nodes. aws-eks-asg-rolling-update-handler should evict node and let Cluster Autoscaler deal with missing nodes

What do you see?

No response

What do you expect to see?

No response

List the steps that must be taken to reproduce this issue

Create EKS cluster environment with Cluster Autoscaler & aws-eks-asg-rolling-update-handler enabled

While rolling through nodes with many pods with PDB of maxUnavailable: 0, keep on increasing hpa to cause Cluster Autoscaler to bring up new nodes.

At some point, because of 5 minute timeout of eviction, aws-eks-asg-rolling-update-handler will be out of sync with the number of nodes of what Cluster Autoscaler brought up and aws-eks-asg-rolling-update-handler will send SetDesiredCapacityInfo that is lower than the current ASG size

Version

1.8.0

Additional information

No response

High Availability (HA) for aws-eks-asg-rolling-update-handler

Hi,

Apologies if this is not raised in the correct way or its the wrong place, but not sure where else to ask this.

I'd like to find out whether or not aws-eks-asg-rolling-update-handler can be scaled to more than 1 replica for HA without encountering any issues (i.e. conflicts, duplication etc..)?

I've been trying to look for docs based on this, but currently I have not been successful in doing so. Any help or guidance around aws-eks-asg-rolling-update-handler for HA would be greatly appreciated.

Thanks, look forward to your reply

Nitin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.