Comments (6)
Hi @sameerjoshinice,
Thanks for reporting this to us. An i/o timeout and context deadline exceed often mean network connection error. I have a few questions:
- Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
- Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
- How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
- If you could get more logs for troubleshooting, that would be great
from keda.
Hi @dttung2905 ,
Please see answers inline
Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
[SJ]: First time this scaler has been run causing the outage.
Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
[SJ]: There are other clients which are contacting the MSK with same role and are working fine. Those clients are Java based mostly.
How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
[SJ]: There are 3 brokers in shared MSK and all of them got affected. This happened twice and both the time, it was KEDA scaler whose permissions were enabled for access to the MSK and issue started happening.
If you could get more logs for troubleshooting, that would be great.
[SJ]: I will try to get more logs as and when I get something of importance.
from keda.
We also saw this error from the Keda operator before the timeouts and context deadline started happening:
ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "mynamespace", "scaledObject.Name": "myscaler", "trigger": "apacheKafkaScaler", "error": "error listing consumer group offset: %!w()"}
from keda.
Our suspicion is that the scaler caused a flood of broken connections that didn't close properly and eventually caused all of the groups to rebalance which pegged the CPU. The rebalances can be seen within a few minutes of starting the scalingobject.
I also have this email which highlights some things AWS was finding at the same time:
Iโve been talking to our AWS TAM and the AWS folks about this issue. They still believe based on the logs that they have access to (which we donโt) that the problems are related to a new IAM permission that is required when running under the newest Kafka version. They are seeing many authentication issues related to the routing pods. My coworker and I have been playing with permissions to give the application rights requested by AWS. The CPU on the cluster dropped slightly when we did that, however, we are getting the following error still even after applying the update on the routing pods:
Connection to node -2 () terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue.
AWS believes that the authentication sessions on the backend have been marked as expired, but they have not been removed and are treated as invalid. They have been attempting to manually remove them, but have run into problems doing so. They are going to restart all the MSK brokers to clear out the session cache.
from keda.
Unfortunately, restarting the brokers didn't fix the CPU problems.
from keda.
Did you try restarting KEDA operator?
I'm checking and apparently we are closing the connection correctly in case of failure:
keda/pkg/scalers/apache_kafka_scaler.go
Lines 564 to 574 in 367fcd3
But maybe there is any other way to close the client that we've missed :/
from keda.
Related Issues (20)
- Fallback doesn't work in case of RabbitMQ connection failure HOT 8
- Unpausing scaledobject broken HOT 4
- Unable to get external metric on GKE HOT 13
- Need Circuit breaker functionality for scaler HOT 1
- Add support for Cassandra TLS auth HOT 3
- KEDA creates more jobs than the Redis list size HOT 2
- Get rid of `cortexOrgID` within prometheus scaler HOT 1
- Support Kafka SASL MSK IAM authentication using sarama client
- ERROR Reconciler error {"controller": "cert-rotator", "object": {"name":"kedaorg-certs","namespace":"keda"} HOT 4
- Add option databaseIndexFromEnv with Redis trigger HOT 1
- MongoDB Scaler - MongoDB Atlas connection string support HOT 6
- Fail fast during e2e tests if dependencies aren't ready HOT 1
- keda operator pod crashes daily once with an error code 2 HOT 8
- Azure Go SDK Service Bus / Event Hubs idle connection bug possibly impacting Keda HOT 1
- Difficulty Specifying Target Group Dimension for AWS CloudWatch in KEDA 2.13.0 HOT 2
- MinReplicaCount for ScaledObject can't be set to 0 HOT 8
- KEDA is not working as expected HOT 5
- Keda polling doesn't respect license count while queuing azure pipelines HOT 14
- Azure.Messaging.EventHubs Checkpointing Implementation Changed as of v5.11.0 Causing Incorrect Behavior In Event Hub scaler HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from keda.