Not sure the right fix for this. Was playing with the kafka trigger again today, here's the cycle:
- Created a kafka topic with a single partition
- Deploy a function with KEDA. KEDA activated, the first function locked the partition
- KEDA kept scaling out (which is fine for now) until I had 4 instances. Only 1 was active (the first one). Once it caught up KEDA scaled down to 1 instance.
however at this point the instance that was left remaining was one of the additional instances that never got a lock. When checking the logs for that function it was more or less dead.
info: Host.General[0]
Host lock lease acquired by instance ID '000000000000000000000000448490CC'.
fail: Host.Triggers.Kafka[0]
kafka-cp-kafka-headless:9092/bootstrap: Failed to resolve 'kafka-cp-kafka-headless:9092': Temporary failure in name resolution (after 5298ms in state CONNECT)
fail: Host.Triggers.Kafka[0]
1/1 brokers are down
I'm not sure if I really had a reliability issue, or if this was one of the ones that didn't have an available partition.
In my mind a few thoughts:
- Should the Kafka trigger keep retrying to connect if it fails? I assume the runtime in general doesn't do this?
- Should Kubernetes know that this function is in a dead state so it can do the CrashBackoffCycle and restart it? If so, is there an existing health probe we should be hooking up?
Realize this isn't really a KEDA issue but didn't know where else to put.
/cc @ahmedelnably @fabiocav would be interested to get your thoughts here