Coder Social home page Coder Social logo

payback159 / openfero Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 953 KB

OpenFero is intended as an event-triggered job scheduler framework for code agnostic recovery jobs.

Home Page: https://jelinek.website/openfero/

License: Apache License 2.0

Dockerfile 1.47% Go 70.04% CSS 1.54% templ 10.92% Shell 16.04%
kubernetes prometheus remediation self-healing

openfero's Introduction

Hi there, my name is Alex ๐Ÿ‘‹

I had an early interest in Linux and open source software in general, but it was Kubernetes and my entry into the cloud native world that sparked my interest in learning the different ways to deploy a platform. I've been working in Kubernetes cluster administration, development and architecture for ~4 years (12.02.22) and find it as exciting as my first kubectl command on the first cluster I installed myself.

Mainly I use Bash, Golang, Python, Terrafom and Ansible in my daily work.

Since then I'm always on the track to learn new things to be able to provide even better platform services.


KubernetesBashPythonTerraformimageGoGoogleCloudUbuntuRaspberryPi


Payback159's github stats

Top Langs


Feel free to contact me, I am always happy to exchange ideas with like-minded people.


Made with the help of https://github.com/anuraghazra/github-readme-stats

openfero's People

Contributors

dependabot[bot] avatar jelinek-wgs avatar payback159 avatar step-security-bot avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

openfero's Issues

using K8s TTL-Controller to cleanup executed jobs

Is your feature request related to a problem? Please describe.

no related problem

Describe the solution you'd like

Starting with Kubernetes v1.23, Kubernetes offers a TTL controller. The controller can be used to have Kubernetes clean up jobs that have run after a certain time. To do this, the ttlSecondsAfterFinished field must be set in the job's specification.

https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs

...
spec:
  ttlSecondsAfterFinished: 100
...

By setting a default TTL when nothing is set in the job specification we could offload the cleanup logic from OpenFero to the Kubernetes cluster.

Describe alternatives you've considered

Alternative is the current implementation of the cleanup logic. From my point of view the current cleanup logic does not consider more than the TTL controller would do.

Additional context

Check if kube-go-client informer wouldn't be a better solution for cleanupJob

Is your feature request related to a problem? Please describe.

No direct problem at the moment but working with larger clusters the default list function of the kube client seems to put heavy load to the api-server and etcd. So it would be better to use the informer pattern.

Describe the solution you'd like

I want that for each control-loop an informer or maybe sharedInformer should be used to behave more "nicely" to the api-server and etcd. Possible areas in the code where the optimization should be noticeable are in

...
func readinessHandler
...
func cleanupJob
...

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Integrate in-memory Alert-Store

The idea is to include an in-memory hook store so that you can quickly and easily analyze the webhooks sent by the alert manager to the endpoint.

The store should only have a certain size and then the oldest alert simply expires.

Reaction of OpenFero to jobs that have not completed successfully

Currently openfero only reacts to successfully completed jobs and cleans them up to avoid unnecessary load on etcd and kubernets-api but maybe openfero can react to other exit codes in the future.

At the moment I think this is difficult, because the reaction to exit codes != 0 depends on the situation and the operarios itself, but maybe in the future there will be a way to improve OpenFero in this aspect.

docs: Define and document requirements for an Operarios

Define and document requirements for an Operarios so that future Operarios developers can consider them during implementation.

For example:

  • Using OPENFERO_ environment variables for parameterization.
    *clear use of exit codes so that openfero can better recognize already completed operarios and its cleanup mechanisms are sufficient.

add different authentication methods e.g. basicAuth, oauth2

Is your feature request related to a problem? Please describe.

OpenFero's goal is to react to various errors in an event-based manner (as of today prometheus alerts) and then take administrative steps. Although OpenFero itself has little to no rights on the systems to be healed, it indirectly gains powerful authorization on the systems to be healed through the Operarios.

Describe the solution you'd like

To prevent misuse, the webhook endpoint of OpenFero should be protected with common authentication methods so that only authenticated systems can send an alert to OpenFero and thus execute remediation jobs.

Describe alternatives you've considered

The alternative would be network insulation (firewalling, network policies, auth-proxy and much more). Although this solution would no longer be the focus of OpenFero, it would be highly dependent on the environment in which OpenFero is used and could therefore complicate and/or limit the use of OpenFero.

Additional context
Add any other context or screenshots about the feature request here.

TTL for Jobs is not set for jobs

Describe the bug

Actually, a default TTL of 300 should be set for the job if nothing is set in the job definition. However, no default TTL is set.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

De-duplication of Prometheus alerts

Often times an alert can fire multiple times over the course of a single incident. Prometheus does support a lot of de-duplication and grouping, which is helpful. However it is possible for the same alert to resolve, then trigger again, when openfero already have an job running for it.

We should detect this scenario to reduce the amount of jobs and avoid duplicate job running.

Prometheus sends a groupKey which is a unique identifier for each alert group. Before starting a new job for a given alert, we should first check to see if an existing job is already running for a given groupKey. If one is already running, we should only log it that the alert triggered again rather than creating a new job.

The main decision here I think will be how we link a groupKey to a specific job. If we implement this feature we also need a way to synchronize this information between multiple OpenFero instances.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.