inovex / trovilo Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 4.0 54 KB

trovilo collects and prepares files from Kubernetes ConfigMaps for Prometheus & friends

License: Apache License 2.0

Go 89.27% Makefile 9.42% Dockerfile 1.31%

prometheus alertmanager alerts kubernetes monitoring grafana dashboards configmap

trovilo's People

Stargazers

Watchers

Forkers

johscheuer thekangaroo i7xh simonre

trovilo's Issues

Fix travis CI

Currently the CI run fails in the publish state. This should be fixed.

Fix bad go-dep download in CI config

Currently trovilo supports multiple jobs (e.g. to gather information for Prometheus and for Grafana) since trovilo is designed to run inside Kubernetes environments I don't actaully see a benefit in supporting multiple jobs (and adding the complexity). From my perspective trovilo should always be used as a sidecar to the according service like in the Prometheus example: https://github.com/inovex/trovilo/blob/master/examples/k8s/deployment.yaml#L60

Are there any reasons why we should keep to support multiple jobs?

Move docker image hosting to Github

Expose Prometheus metrics

add docs

Add README and useful examples

Possible race condition?

Currently there is a possible race condition in the way trovilo is implemented (AFAIK):

Imagine the following flow:

1.) trovilo get's started
2.) add a ConfigMap with the expected labels for trovilo
3.) trovilo add ConfigMap (or correctly the content of the ConfigMap) on "disk"
4.) tovilo crashes
5.) Delete ConfigMap from above
6.) trovilo recovers

--> If a ConfigMap is deleted during a crash of trovilo the ConfigMap will never be clean up, correct? Since this line will never be called https://github.com/inovex/trovilo/blob/master/cmd/trovilo/main.go#L104 or to be precisly trovilo never checks the initial state of the targetDir.

Allow for a decorator phase / command to i.e. force-tag alerts

Thanks for this very helpful tool!

I'd love to be able to define a decorator that applies changes to the collected data from configmaps before feeding it to i.e. Alertmanager or Grafana.

One very distinct use-case are Prometheus' alert definitions which are collected from multiple Kubernetes namespaces. If one wants to route the alerts based on the source namespace the configmap was picked-up from, this metadata needs to be immutably available. In case of alerts this requires the source to either ensure the PromQL query leaves this as label on the data or have it set an additional label or an annotations containing the namespace for each and every alert. Ensuring this over hundreds of alerts and many different teams and people without maintaining the source namespace info for each alert behind the scenes is prone to fail.

My suggestion is to simply allow a command to run for each configmap collected by trovilo which then receives the Kubernetes metadata of the individual configmap as environment variables, i.e. K8S_METADATA_NAMESPACE, K8S_METADATA_NAME. The command could simply be a call of sed or maybe even a jsonpatch which decorates the source data with additional info.

Running this arbitrary command and simply providing some variables to it does not make trovilo any more domain specific.
But especially in multi-tenancy the namespace might just be the most important piece of information one wants to add / keep on the data that is then given to Alertmanager or Grafana.

Trovilo crashes without helpful error message / cause

After running for hours without any issue, Trovilo sometimes crashes with a very short error message like:

"{"error":"EOF","level":"fatal","msg":"Kubernetes ConfigMap watcher encountered an error. Exit..","time":"2018-08-30T16:04:31Z"}"

Unfortunately there is no indication of what could have causes this and after the container is restarted it works again for a long time until it crashes in the same matter again. While I cannot rule out an influence from consumed configmaps, I am very certain this is not causes but a change in the configmaps. It also happens when Kubernetes resources are not changed at all over a longer period of time.

Maybe some timeout when talking to the API which is not handled well?

inovex / trovilo Goto Github PK

trovilo's People

Stargazers

Watchers

Forkers

trovilo's Issues

Fix travis CI

Fix bad go-dep download in CI config

Refactor trovilo job design

Move docker image hosting to Github

Expose Prometheus metrics

add docs

Possible race condition?

Allow for a decorator phase / command to i.e. force-tag alerts

Trovilo crashes without helpful error message / cause

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent