Coder Social home page Coder Social logo

Comments (6)

dttung2905 avatar dttung2905 commented on June 11, 2024

Hi @sameerjoshinice,

Thanks for reporting this to us. An i/o timeout and context deadline exceed often mean network connection error. I have a few questions:

  • Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
  • Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
  • How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
  • If you could get more logs for troubleshooting, that would be great

from keda.

sameerjoshinice avatar sameerjoshinice commented on June 11, 2024

Hi @dttung2905 ,
Please see answers inline
Has it setup been working well for you before you encounter this problem? Or this is the first time this scaler has been run, causing the outage?
[SJ]: First time this scaler has been run causing the outage.
Did you try to debug by setting up a testing pod, making the same sasl + tls connection using Kafka cli instead? If this test does not pass, it means there are errors with the tls cert + sasl
[SJ]: There are other clients which are contacting the MSK with same role and are working fine. Those clients are Java based mostly.
How did you manage to find out that KEDA operator is causing CPU spike in AWS MSK brokers ? What was the number of affected brokers out of the AWS MSK fleet ?
[SJ]: There are 3 brokers in shared MSK and all of them got affected. This happened twice and both the time, it was KEDA scaler whose permissions were enabled for access to the MSK and issue started happening.
If you could get more logs for troubleshooting, that would be great.
[SJ]: I will try to get more logs as and when I get something of importance.

from keda.

jared-schmidt-niceincontact avatar jared-schmidt-niceincontact commented on June 11, 2024

We also saw this error from the Keda operator before the timeouts and context deadline started happening:

ERROR scale_handler error getting metric for trigger {"scaledObject.Namespace": "mynamespace", "scaledObject.Name": "myscaler", "trigger": "apacheKafkaScaler", "error": "error listing consumer group offset: %!w()"}

from keda.

jared-schmidt-niceincontact avatar jared-schmidt-niceincontact commented on June 11, 2024

Our suspicion is that the scaler caused a flood of broken connections that didn't close properly and eventually caused all of the groups to rebalance which pegged the CPU. The rebalances can be seen within a few minutes of starting the scalingobject.

I also have this email which highlights some things AWS was finding at the same time:

Iโ€™ve been talking to our AWS TAM and the AWS folks about this issue. They still believe based on the logs that they have access to (which we donโ€™t) that the problems are related to a new IAM permission that is required when running under the newest Kafka version. They are seeing many authentication issues related to the routing pods. My coworker and I have been playing with permissions to give the application rights requested by AWS. The CPU on the cluster dropped slightly when we did that, however, we are getting the following error still even after applying the update on the routing pods:

Connection to node -2 () terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue.

AWS believes that the authentication sessions on the backend have been marked as expired, but they have not been removed and are treated as invalid. They have been attempting to manually remove them, but have run into problems doing so. They are going to restart all the MSK brokers to clear out the session cache.

from keda.

jared-schmidt-niceincontact avatar jared-schmidt-niceincontact commented on June 11, 2024

Unfortunately, restarting the brokers didn't fix the CPU problems.

from keda.

JorTurFer avatar JorTurFer commented on June 11, 2024

Did you try restarting KEDA operator?
I'm checking and apparently we are closing the connection correctly in case of failure:

// Close closes the kafka client
func (s *apacheKafkaScaler) Close(context.Context) error {
if s.client == nil {
return nil
}
transport := s.client.Transport.(*kafka.Transport)
if transport != nil {
transport.CloseIdleConnections()
}
return nil
}

But maybe there is any other way to close the client that we've missed :/

from keda.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.