payback159 / openfero Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 1006 KB

OpenFero is intended as an event-triggered job scheduler framework for code agnostic recovery jobs.

Home Page: https://jelinek.website/openfero/

License: Apache License 2.0

Dockerfile 1.47% Go 70.04% CSS 1.54% templ 10.92% Shell 16.04%

kubernetes prometheus remediation self-healing

openfero's Introduction

Hi there, my name is Alex 👋

I had an early interest in Linux and open source software in general, but it was Kubernetes and my entry into the cloud native world that sparked my interest in learning the different ways to deploy a platform. I've been working in Kubernetes cluster administration, development and architecture for ~4 years (12.02.22) and find it as exciting as my first kubectl command on the first cluster I installed myself.

Mainly I use Bash, Golang, Python, Terrafom and Ansible in my daily work.

Since then I'm always on the track to learn new things to be able to provide even better platform services.

Feel free to contact me, I am always happy to exchange ideas with like-minded people.

Made with the help of https://github.com/anuraghazra/github-readme-stats

openfero's People

Contributors

Stargazers

Watchers

Forkers

step-security-bot

openfero's Issues

Check if kube-go-client informer wouldn't be a better solution for cleanupJob

Is your feature request related to a problem? Please describe.

No direct problem at the moment but working with larger clusters the default list function of the kube client seems to put heavy load to the api-server and etcd. So it would be better to use the informer pattern.

Describe the solution you'd like

I want that for each control-loop an informer or maybe sharedInformer should be used to behave more "nicely" to the api-server and etcd. Possible areas in the code where the optimization should be noticeable are in

...
func readinessHandler
...
func cleanupJob
...

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

using K8s TTL-Controller to cleanup executed jobs

Is your feature request related to a problem? Please describe.

no related problem

Describe the solution you'd like

Starting with Kubernetes v1.23, Kubernetes offers a TTL controller. The controller can be used to have Kubernetes clean up jobs that have run after a certain time. To do this, the ttlSecondsAfterFinished field must be set in the job's specification.

https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs

...
spec:
  ttlSecondsAfterFinished: 100
...

By setting a default TTL when nothing is set in the job specification we could offload the cleanup logic from OpenFero to the Kubernetes cluster.

Describe alternatives you've considered

Alternative is the current implementation of the cleanup logic. From my point of view the current cleanup logic does not consider more than the TTL controller would do.

Additional context

Reaction of OpenFero to jobs that have not completed successfully

Currently openfero only reacts to successfully completed jobs and cleans them up to avoid unnecessary load on etcd and kubernets-api but maybe openfero can react to other exit codes in the future.

At the moment I think this is difficult, because the reaction to exit codes != 0 depends on the situation and the operarios itself, but maybe in the future there will be a way to improve OpenFero in this aspect.

De-duplication of Prometheus alerts

Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when openfero already have an job running for it.

We should detect this scenario to reduce the amount of jobs and avoid duplicate job running.

Prometheus sends a groupKey which is a unique identifier for each alert group. Before starting a new job for a given alert, we should first check to see if an existing job is already running for a given groupKey. If one is already running, we should only log it that the alert triggered again rather than creating a new job.

The main decision here I think will be how we link a groupKey to a specific job. If we implement this feature we also need a way to synchronize this information between multiple OpenFero instances.

add different authentication methods e.g. basicAuth, oauth2

Is your feature request related to a problem? Please describe.

OpenFero's goal is to react to various errors in an event-based manner (as of today prometheus alerts) and then take administrative steps. Although OpenFero itself has little to no rights on the systems to be healed, it indirectly gains powerful authorization on the systems to be healed through the Operarios.

Describe the solution you'd like

To prevent misuse, the webhook endpoint of OpenFero should be protected with common authentication methods so that only authenticated systems can send an alert to OpenFero and thus execute remediation jobs.

Describe alternatives you've considered

The alternative would be network insulation (firewalling, network policies, auth-proxy and much more). Although this solution would no longer be the focus of OpenFero, it would be highly dependent on the environment in which OpenFero is used and could therefore complicate and/or limit the use of OpenFero.

Additional context
Add any other context or screenshots about the feature request here.

Integrate in-memory Alert-Store

The idea is to include an in-memory hook store so that you can quickly and easily analyze the webhooks sent by the alert manager to the endpoint.

The store should only have a certain size and then the oldest alert simply expires.

TTL for Jobs is not set for jobs

Describe the bug

Actually, a default TTL of 300 should be set for the job if nothing is set in the job definition. However, no default TTL is set.

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

docs: Define and document requirements for an Operarios

Define and document requirements for an Operarios so that future Operarios developers can consider them during implementation.

For example:

Using OPENFERO_ environment variables for parameterization.
*clear use of exit codes so that openfero can better recognize already completed operarios and its cleanup mechanisms are sufficient.

Check if the key of the ConfigMap datablock must represent the name of the alert.

Currently OpenFero assumes that the key from the data block in the ConfigMap has the name from the alert. However, we already have the unique assignment between alert and jobs through the ConfigMap naming scheme.

In this issue we want to check if we can implement the dependency to the key more generically.

payback159 / openfero Goto Github PK

openfero's Introduction

Hi there, my name is Alex 👋

openfero's People

Contributors

Stargazers

Watchers

Forkers

openfero's Issues

Check if kube-go-client informer wouldn't be a better solution for cleanupJob

using K8s TTL-Controller to cleanup executed jobs

Reaction of OpenFero to jobs that have not completed successfully

De-duplication of Prometheus alerts

add different authentication methods e.g. basicAuth, oauth2

Integrate in-memory Alert-Store

TTL for Jobs is not set for jobs

docs: Define and document requirements for an Operarios

Check if the key of the ConfigMap datablock must represent the name of the alert.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent