Coder Social home page Coder Social logo

Comments (5)

github-actions avatar github-actions commented on July 17, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

from eksctl.

punkwalker avatar punkwalker commented on July 17, 2024

@rwilson-release
The difference you are noticing is expected because you started using ManagedNodeGroup and the PropagateAtLaunch is statically set as false.

However, ClusterAutoscaler (CAS) does not have any hard requirement to propagate these tags to Nodes.. CAS only requires the tags to be present on ASG so that it can scale from 0 and after launching the node, CAS will add those taints and labels. Having PropagateAtLaunch: true should not make any difference on whether CA adds those taints properly or not.

On the other hand for unmanaged nodes the PropagateAtLaunch is set to true, @TiberiuGC can you shed some light on this?

Update:
The scale up by CAS should happen even if of PropagateAtLaunch: false.
Here is my Cluster-Config:

---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
 name: "propagate-test"
 region: "us-west-2"
 version: "1.29"
 tags:
   VantaOwner: "Release"
   VantaDescription: "EKS cluster for Release"

addons:
 - name: "vpc-cni"
 - name: "kube-proxy"
 - name: "coredns"
 - name: "aws-ebs-csi-driver"

iam:
 withOIDC: true

managedNodeGroups:
 - name: "standard-workers-4c7827"
   amiFamily: "AmazonLinux2"
   disableIMDSv1: true
   instanceType: "m6a.2xlarge"
   labels: {"ghRunners":"true"}
   minSize: 0
   maxSize: 10
   propagateASGTags: true
   privateNetworking: true
   iam:
     attachPolicyARNs:
       - "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
       - "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
       - "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
     withAddonPolicies:
       imageBuilder: true
       autoScaler: true
       ebs: true
       efs: true
   tags:
     VantaOwner: "Release"
     VantaDescription: "EKS nodes for Release"
   taints: [{"key":"nodeUsage","value":"ghRunners","effect":"NoSchedule"}]

Test Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
 labels:
   app: label-test
 name: label-test
spec:
 replicas: 1
 selector:
   matchLabels:
     app: label-test
 template:
   metadata:
     labels:
       app: label-test
   spec:
     nodeSelector:
       ghRunners: "true"
     tolerations:
     - key: nodeUsage
       value: ghRunners
       effect: NoSchedule
     containers:
     - image: nginx
       name: nginx

Scale up log:

I0422 20:58:20.491115       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"label-test-6bf88dd7c5-kqdcb", UID:"26973148-4d7d-44ef-8242-75e835903ed3", APIVersion:"v1", ResourceVersion:"17802", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up: [{eks-standard-workers-4c7827-04c783ae-70db-f136-a40e-80fe783199d2 0->1 (max: 10)}]

@rwilson-release
My suggestion is to investigate CAS logs for identifying the reason for Node group xxx is not ready for scaleup - unhealthy. This is not related to eksctl.
Thanks

from eksctl.

cPu1 avatar cPu1 commented on July 17, 2024

The difference you are noticing is expected because you started using ManagedNodeGroup and the PropagateAtLaunch is statically set as false.

On the other hand for unmanaged nodes the PropagateAtLaunch is set to true

That code is for tagging the backing ASGs for managed nodegroups and PropagateAtLaunch is set to false for the ASG resource itself. I believe the reason it's false for managed nodegroups is that the EKS Managed Nodegroups API also propagates tags to the EC2 instances, so eksctl tries not to override them by propagating them.

For managed nodegroups, EKS does not propagate any tags to the ASG resource, they only apply to the EKS Nodegroup resource and to the EC2 instances launched as part of the nodegroup.

@rwilson-release,

Notice in particular

                "ResourceId": "eks-standard-workers-4c7826-08c6c84c-b284-a68f-0305-fa2ec7b4565c",
                "ResourceType": "auto-scaling-group",
                "Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage",
                "Value": "ghRunners",
                "PropagateAtLaunch": false

Should be "PropagateAtLaunch": true

You are viewing tags for the ASG resource itself, those tags do not need to be propagated anywhere for scale-from-zero to work.

As @punkwalker noted, this is not an eksctl bug and you might be facing other issues. Can you try upgrading CAS?

Additionally, if the IAM role for Cluster Autoscaler has the eks:DescribeNodegroup permission, CAS can use that to pull labels and taints from the EKS API, eliminating the need to use propagateASGTags: true. The set of policies for CAS was updated last year in eksctl so if it was created before that, it'll be missing from your role.

from eksctl.

rwilson-release avatar rwilson-release commented on July 17, 2024

As punkwalker noted, this is not an eksctl bug and you might be facing other issues. Can you try upgrading CAS?

I created a brand new cluster on 1.29 with the CAS version 1.29.0, I am pretty sure that is relatively recent. There was a new version released a few days ago, so maybe there is a long shot there.

Additionally, if the IAM role for Cluster Autoscaler has the eks:DescribeNodegroup permission...

I was initially excited about this possibility since we had a few clusters of different ages and different eksctl versions and this seemed like a very simple fix! Unfortunately, we have updated the policy and use the helm charts with the correct policy and verifying all affected clusters confirmed the policy was correct, including the one you mentioned. This policy has been correctly updated in our configs since 2022-Dec.

My suggestion is to investigate CAS logs for identifying the reason for Node group xxx is not ready for scaleup - unhealthy. This is not related to eksctl.

I almost agreed this was not related to eksctl but all roads lead back to the label or tags -- and those are created by eksctl so please bear with me. If you investigate the errors that I posted, you will find several github issues -- in particular this thread which describes almost identical problems related to scaling from 0 for the purposes of github self-hosted runners (our use case). See kubernetes/autoscaler#3780 (comment) but none of the fixes in that thread help.

Another clue is here in the README: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#auto-discovery-setup (scroll down to the sections labled The following is only required if scaling up from 0 nodes where the examples are listed. In particular,

k8s.io/cluster-autoscaler/node-template/label/foo: bar for labels (which are correct ✅ )
k8s.io/cluster-autoscaler/node-template/taint/dedicated: true:NoSchedule (which are slightly off ⚠️ )

The actual tag I see is

                "Key": "k8s.io/cluster-autoscaler/node-template/taint/nodeUsage",
                "Value": "ghRunners",

Notice it might need to be

                "Value": "ghRunners:NoSchedule",

This is a bit of a reach -- but is it possible the unmanaged node groups are labeled correctly vs the managed?

from eksctl.

punkwalker avatar punkwalker commented on July 17, 2024

@rwilson-release
Thank you for pointing that out.

is it possible the unmanaged node groups are labeled correctly vs the managed?

Even for unmanaged nodegroups, the taint effect was never added to the ASG tag 🙂. Ref

	for _, taint := range taints {
		addTag(taintsPrefix+taint.Key, taint.Value)
	}

I think it has to be changed to something like this:

	for _, taint := range taints {
		addTag(taintsPrefix+taint.Key, taint.Value+":"+string(taint.Effect))
	}

@cPu1 What do you think?

from eksctl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.