Coder Social home page Coder Social logo

vshn / signalilo Goto Github PK

View Code? Open in Web Editor NEW
68.0 8.0 24.0 324 KB

Forward alerts from Prometheus Alertmanager to Icinga2 via Webhooks

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 1.55% Go 96.72% Makefile 1.73%
icinga2-api alertmanager icinga2 webhook vshn-project-ignore

signalilo's Introduction

Signalilo

Signalilo is our Alertmanager to Icinga2 bridge implementation. Signalilo acts on webhooks which it receives from Alertmanager and forwards the alerts in those webhooks to Icinga2 using https://github.com/vshn/go-icinga2-client.

See CHANGELOG.md for changelogs of each release version of Signalilo.

See DockerHub or Quay.io for pre-built Docker images of Signalilo.

Usage

Signalilo gets started from the command line and takes its configuration either as options or as environment variables. Use signalilo --help to get a list of all available configuration parameters.

When started, Signalilo listens to HTTP requests on the following paths:

  • /webhook Endpoint to accept alerts from Alertmanager.
  • /healthz returns HTTP 200 with ok as its payload as long as the webhook serving loop is operational.

Installation

Helm

helm install --name signalilo appuio/signalilo

See https://github.com/appuio/charts/tree/master/appuio/signalilo.

Docker

docker run --name signalilo vshn/signalilo

OpenShift

The Helm chart should work on OpenShift

Configuration

Mandatory

  • --uuid/SIGNALILO_UUID: UUID which identifies the Signalilo instance.
  • --icinga_hostname/SIGNALILO_ICINGA_HOSTNAME: Name of the Servicehost in Icinga2.
  • --icinga_url/SIGNALILO_ICINGA_URL: URL of the Icinga API. It's possible to specify one or more URLs. The Parameter content will be split on newline character \n, e.g. "http://example.com:5665\nhttp://example2.com:5665" will configure two masters at http://example.com:5665 and http://example2.com:5665. Please keep in mind that the first URL will be the Icinga-Config-Master.
  • --icinga_username/SIGNALILO_ICINGA_USERNAME: Authentication against Icinga2 API.
  • --icinga_password/SIGNALILO_ICINGA_PASSWORD: Authentication against Icinga2 API.

Optional

  • --loglevel/SIGNALILO_LOG_LEVEL: Integer to control verbosity of logging (default: 2).
  • --icinga_insecure_tls/SIGNALILO_ICINGA_INSECURE_TLS: If true, disable strict TLS checking of Icinga2 API SSL certificate (default: false).
  • --icinga_disable_keepalives/SIGNALILO_ICINGA_DISABLE_KEEPALIVES: If true, disable http keep-alives with Icinga2 API and will only use the connection to the server for a single HTTP request (default: false).
  • --icinga_display_name_as_service_name/SIGNALILO_ICINGA_DISPLAY_NAME_AS_SERVICE_NAME: If true, will leave display name same as service name. Useful for users who monitor alerts in Nagstamon (default: false).
  • --icinga_debug/SIGNALILO_ICINGA_DEBUG: If true, enable debugging mode in Icinga client (default: false).
  • --icinga_gc_interval/SIGNALILO_ICINGA_GC_INTERVAL: Interval to run Garbage collection of recovered alerts in Icinga (default 15m).
  • --icinga_heartbeat_interval/SIGNALILO_ICINGA_HEARTBEAT_INTERVAL: Interval to send heartbeat to Icinga (default 60s).
  • --icinga_keep_for/SIGNALILO_ICINGA_KEEP_FOR: How long to keep Icinga2 services around after they transition to state OK (default 168h).
  • --icinga_ca/SIGNALILO_ICINGA_CA: A PEM string of the trusted CA certificate for the Icinga2 API certificate.
  • --icinga_service_checks_active/SIGNALILO_ICINGA_SERVICE_CHECKS_ACTIVE: Use active checks for created icinga services to leverage on Alertmanager resend interval to manage stale checks (default: false).
  • --icinga_service_checks_command/SIGNALILO_ICINGA_SERVICE_CHECKS_COMMAND: Name of the check command used in Icinga2 service creation (default: 'dummy').
  • --icinga_service_checks_interval/SIGNALILO_ICINGA_SERVICE_CHECKS_INTERVAL: Interval (in seconds) to be used for icinga check_interval and retry_interval. This should be set to a multiple of alertmanager repeat_interval in case active checks are enabled (e.g. 1.1 < icinga_service_checks_interval/repeat_interval < 5, default: 43200s).
  • --icinga_service_max_check_attempts/SIGNALILO_ICINGA_SERVICE_MAX_CHECKS_ATTEMPTS: The maximum number of checks which are executed before changing to a hard state.
  • --icinga_service_template/SIGNALILO_ICINGA_SERVICE_TEMPLATE: Creates an icinga service with the given template. It's possible to specify one or more service templates. (default: "generic-service"). The Parameter content will be split on newline character \n, e.g. "generic-service\nexample-template" creates a service with generic-service and example-template. Please keep in mind that generic-service will be overwritten if the parameter is specified.
  • --icinga_reconnect/SIGNALILO_ICINGA_RECONNECT: If it's set, Signalilo to waits for a reconnect instead of switching immediately to another URL.
  • --alertmanager_port/SIGNALILO_ALERTMANAGER_PORT: Port on which Signalilo listens to incoming webhooks (default 8888).
  • --alertmanager_bearer_token/SIGNALILO_ALERTMANAGER_BEARER_TOKEN: Incoming webhook authentication. Can be either set via Authorization header or in the token URL query parameter.
  • --alertmanager_tls_cert/SIGNALILO_ALERTMANAGER_TLS_CERT: Path of certificate file for TLS-enabled webhook endpoint. Should contain the full chain.
  • --alertmanager_tls_key/SIGNALILO_ALERTMANAGER_TLS_KEY: Path of private key file for TLS-enabled webhook endpoint. TLS is enabled when both TLS_CERT and TLS_KEY are set.
  • --alertmanager_pluginoutput_annotations/SIGNALILO_ALERTMANAGER_PLUGINOUTPUT_ANNOTATIONS: The name of an annotation to retrieve the plugin_output from. Can be set multiple times in which case the first annotation with a value found is used.
  • --alertmanager_pluginoutput_by_states/SIGNALILO_ALERTMANAGER_PLUGINOUTPUT_BY_STATES: Enables support for dynamically selecting the Annotation name used for the Plugin Output based on the computed Service State. See Plugin Output for more details on this option.
  • --alertmanager_custom_severity_levels/SIGNALILO_ALERTMANAGER_CUSTOM_SEVERITY_LEVELS: Add or override the default mapping of the severity label of the Alert to an Icinga Service State. Use the format label_name=service_state. The service_state can be 0 for OK, 1 for Warning, 2 for Critical, and 3 for Unknown. Can be set multiple times and you can also override the default values for the labels warning and critical. The severity label is not case-sensitive.

The environment variable names are generated from the command-line flags. The flag is uppercased and all - characters are replaced with _. Signalilo uses the newline character \n to split flags that are allowed multiple times (like SIGNALILO_ALERTMANAGER_PLUGINOUTPUT_ANNOTATIONS) into an array.

Integration to Prometheus/Alertmanager.

The /webhook accepts alerts in the format of Alertmanager. The following Alertmanager configuration is an example taken from a Signalilo installation on OpenShift.

global:
  resolve_timeout: 5m
route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: default
  routes:
  - match:
      alertname: DeadMansSwitch
    repeat_interval: 5m
    receiver: deadmansswitch
receivers:
- name: default
  webhook_configs:
  - send_resolved: true
    http_config:
      bearer_token: "*****"
    url: http://signalilo.appuio-monitoring/webhook
- name: deadmansswitch

Signalilo requires a set of information to be part of an alert. Without this information, the check generated in Icinga will be lacking.

Required labels:

  • severity: Must be one of warning or critical, or any values set via the --alertmanager_custom_severity_levels option.
  • alertname mapped to display_name.

Required annotations:

  • description: mapped to notes.
  • message: mapped to plugin_output.

You can also use the --alertmanager_pluginoutput_annotations option to change the Annotation used for the plugin_output as well as the --alertmanager_pluginoutput_by_states option. See Plugin Output for more details.

Optional annotations:

  • runbook_url: mapped to `notes_url

Infered fields:

  • generatorURL: mapped to action_url

Plugin Output

By default, Signalilo will use the message Annotation to set the plugin_output in the Icinga Service.

This can be changed by using the --alertmanager_pluginoutput_annotations to select either a different Annotation or to provide a list of Annotations where the first one with a value will be used.

Alternatively if you enable the --alertmanager_pluginoutput_by_states option then Signalilo will take the Service State name (ok, warning, critical, or unknown) and suffix this to the Annotation name when looking up the Annotation to use for the Plugin Output (for example: message_ok).

This allows you to configure multiple Annotations with different values that are then used with the corresponding Service State to set the Plugin Output.

If an Annotation is not found for that specific Service State then Signalilo will fall back ot just using the Annotation name as configured.

Integration with Icinga

Icinga host

You need to create an Icinga service host which Signalilo can use. Signalilo is designed to expect that it has full control over one service host in Icinga. Therefore, you should create a service host for each Signalilo instance which you're running.

Each service host should look as shown below. You can add additional configurations (such as host variables) as you like.

object Host "signalilo_cluster.example.com"  {
  display_name = "Signalilo signalilo_cluster.example.com"
  check_command = "dummy"
  enable_passive_checks = false
  enable_perfdata = false
}

Icinga service template

You need to create an Icinga service template which Signalilo can use to create own services.

template Service "generic-service" {
}

Icinga API user

We recommend that you create an API user per Icinga service host. This naturally ensures that you create an API user per Signalilo instance, since you should have a service host per Signalilo instance. In that case, you can restrict the API user's permissions to only interact with the service host belonging to the Signalilo instance as shown below.

object ApiUser "signalilo_cluster.example.com"  {
  password = "verysecretpassword"
  permissions = [
  {
    permission = "objects/query/*"
    filter = {{ host.name == "signalilo_cluster.example.com" }}
  },
  {
    permission = "objects/create/service"
    filter = {{ host.name == "signalilo_cluster.example.com" }}
  },
  {
    permission = "objects/modify/service"
    filter = {{ host.name == "signalilo_cluster.example.com" }}
  },
  {
    permission = "objects/delete/service"
    filter = {{ host.name == "signalilo_cluster.example.com" }}
  },
  {
    permission = "actions/process-check-result"
    filter = {{ host.name == "signalilo_cluster.example.com" }}
  }, ]
}

Note that you don't have to use the same name for the API user as for its associated service host. However, you have to make sure that you compare host.name to the name of the service host for which the API user should have permissions.

Garbage Collection

Service objects in Icinga will get garbage collected (aka deleted) on a regular basis, following these rules:

  • Service object is in OK state
  • Last transition to OK state was more than "keep_for" ago
  • UUID of app matches "vars.bridge_uuid"

All state needed for doing garbage collection is stored in Icinga service variables.

Signalilo Heartbeat

On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. During operation, Signalilo regularly posts its state to the heartbeat service. If no state update was provided, Icinga automatically marks the check as UNKNOWN.

You need to configure the following service in Icinga:

object Service "heartbeat" {
  check_command = "dummy"
  check_interval = 10s

  /* Set the state to CRITICAL (2) if freshness checks fail. */
  vars.dummy_state = 2

  /* Use a runtime function to retrieve the last check time and more details. */
  vars.dummy_text = {{
    var service = get_service(macro("$host.name$"), macro("$service.name$"))
    var lastCheck = DateTime(service.last_check).to_string()

    return "No check results received. Last result time: " + lastCheck
  }}

  /* This must match the name of the host object for the Signalilo instance */
  host_name = "signalilo_cluster.example.com"
}

Custom Variables

All labels and annotations will be mapped to custom variables. Keys of Labels will be prefixed with label_ and keys of annotations with annotation_.

If the key an annotation or label starts with icinga_ it will also be added as custom variable but without any prefix. Since all labels and annotations will be strings, a type information needs to be provided so that a conversion can be done accordingly. This is done by adding the type as part of the prefix (icinga_<type>_). Current supported types are number and string.

Examples:

  • foo -> label_foo or an anotation_foo.
  • icinga_string_foo -> label/annotation named foo with value is passed as is.
  • icinga_number_bar -> label/annotation named bar with its value is converted to an integer number.

In case there is a label and an annotation with the icinga_<type> prefix, the value of the annotation will take precedence in the resulting set of custom variables.

Heartbeat Services

Signalilo supports creating heartbeat services in Icinga. This can be used to map alerts like the DeadMansSwitch which comes with prometheus-operator and signals that the whole Prometheus stack is healthy.

In order for Signalilo to treat an alert as a heartbeat, the alert must have a label heartbeat. Signalilo will try to parse the value of that label as a Go duration.

If the value is parsed successfully, Signalilo will create an Icinga service check with active checks enabled and with the check interval set to the parsed duration plus ten percent. We add ten percent to the parsed duration to account for network latencies etc., which could otherwise lead to flapping heartbeat checks.

signalilo's People

Contributors

corvus-ch avatar dragoangel avatar dsteininger86 avatar eyenx avatar frankfil avatar hairmare avatar mc-meta avatar mhutter avatar renovate-bot avatar renovate[bot] avatar simu avatar srueg avatar timoses avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

signalilo's Issues

No spaces in validateServiceName allowed

Hello,

I was searching for a while, after I realized, that signalilo cannot handle alertnames with spaces correctly.
You can repreduce my problem if you fire an alert with an Alertname like "This is a test".

The regex https://github.com/vshn/signalilo/blob/master/webhook/icinga.go#L28 prevents that anything with a space will be sent to icinga2.
Is there a reason, that there is no string replace to convert spaces to underscores or just properly encode the spaces to the GET method for the Icinga2 API?

I would love to see this fixed, since many people are using spaces in their names.

Thanks and best regards,
Dennis

old critical alerts in icinga do not go away after upgrade of openshift

First of all, great product: signalilo.
I recently set this up for our OpenShift clusters.

We had the following scenerio:
For our OpenShift Cluster A, we had bunch of critical alerts that showed up in Icinga.
Those alerts were not resolved (as in from OpenShift side).
We did an upgrade on our OpenShift Cluster, and after that re-added in the webhook config in alertmanager.
So from alertmanager perspective, it is now brand new. So old alerts in icinga were not resolved (they never got the resolved notification from alertmanager via signalilo).
Now in Icinga, we have this OpenShift Cluster set up as a Host "Test Host", and although new alerts are coming in and are resolved, the old alerts from previous version of OpenShift are still there.

I understand that there is a SIGNALILO_ICINGA_KEEP_FOR setting, but that is for OK and or resolved alerts.

I think that there should be a criteria such that if the alert is no longer firing from AlertManager, and if there are some lingering critical services in Icinga which did not receive any resolved status, then those should be garbage collected as well.

Documentation about custom variables is incorrect

Hello,

according to custom variables I have defined a label named:
icinga_emails_string.

But in the Icinga service there is no custom variable emails but label_icinga_emails_string.

In the Log of signalilo I've found the following error:

{"level":"debug","msg":"Processing firing alert: alertname=UserPersistentVolumeFillingUp, severity=critical, message=Das PersistentVolume definiert durch test-4-xx im Namespace monitor-demo hat nur 0.006104% frei.","time":"2021-03-19T16:29:38+01:00"}
{"level":"info","msg":"Failed to map Icinga variable 'icinga_emails_string': unknown type","time":"2021-03-19T16:29:38+01:00"}
{"level":"info","msg":"creating service: UserPersistentVolumeFillingUp_b06ac30ea6e1f65b\n","time":"2021-03-19T16:29:38+01:00"}
{"level":"debug","msg":"Executing ProcessCheckResult on icinga2 for UserPersistentVolumeFillingUp_b06ac30ea6e1f65b: exit status 2","time":"2021-03-19T16:29:38+01:00"}

I have checked the code.
The label should be: icinga_string_emails.

I have changed it and it worked.

So, the documentation should be changed.

Kind regards
Xavier

Dependency between hearthbeat and services

I'm working on some big deploy and (sadly) we had some big infra problem with a storming because of up/down of many alarms.
Linking the created services to the omnipresent watchdog would solve the storming when shutting down signalilo (in my case after i lost it ๐Ÿ˜ข

Action Required: Fix Renovate Configuration

There is an error with this repository's Renovate configuration that needs to be fixed. As a precaution, Renovate will stop PRs until it is resolved.

Error type: undefined. Note: this is a nested preset so please contact the preset author if you are unable to fix it yourself.

Customvars from icinga service templates will be overwritten by the default vars

Hi and happy New Year!

I wanted to reference to the Issue #96 . I wanted to implement the feature, so that Icinga Service template names will be configurable. In my tests I found a problematic behaviour.

The problem here is that the "Default customvars" will be transfered as a dictionary vars = {}, so all other custom vars from the templates will be overwritten.
From the Icinga 2 Docs:

If attributes are of the Dictionary type, you can also use the indexer format. This might be necessary to only override specific custom variables and keep all other existing custom variables (e.g. from templates):

"attrs": { "vars.os": "Linux" }

In the following example I created a Service-Template example1, which sets a new custom variable vars.testcustomvar = "VAR_AUS_TEMPLATE"
Example Service:

object Service "PrometheusAlertmanagerJobMissing_f321b9ca0fe3b163" {
        import "generic-service"
        import "example1"
        
        check_command = "dummy"
        [...]
        vars = {
                annotation_description = "A Prometheus AlertManager job has disappeared\n  VALUE = 1\n  LABELS = map[job:alertmanager]"
                annotation_summary = "Prometheus AlertManager job missing (instance )"
                bridge_uuid = "Instanz1"
                keep_for = 604800000000000.000000
                label_alertname = "PrometheusAlertmanagerJobMissing"
                label_job = "alertmanager"
                label_monitor = "my-monitor"
                label_severity = "warning"
        }
        [...]
    }

As you can see, the template is imported but all configured custom variables are not present, because the custom vars will be transfered as a dictionary. So the custom variables need to be posted on one level, like so:

"attrs": {
              [...]
                "vars.annotation_description": "A Prometheus AlertManager job has disappeared\n  VALUE = 1\n  LABELS = map[job:alertmanager]",
                "vars.annotation_summary": "Prometheus AlertManager job missing (instance )",
                "vars.bridge_uuid": "Instanz1",
                "vars.keep_for": 604800000000000,
                "vars.label_alertname": "PrometheusAlertmanagerJobMissing",
                "vars.label_job": "alertmanager",
                "vars.label_monitor": "my-monitor",
                "vars.label_severity": "warning",
                "check_interval": 43200,
                "retry_interval": 43200,
                "max_check_attempts": 1,
                "templates": [
                        "generic-service, example1"
                ]
        }

See my PR to fix this Issue.

Best Regards
Philipp

CreateContainerError on openshift 4.5.35 | chdir to cwd /home/nonroot set in config.json failed: permission denied

Dear Signalilo Community,

We recently upgraded from Openshift 4.5.30 to 4.5.35 version and we have problems since then to get signalilo up and running. Also tested with the latest release signalilo:latest (b19ed615764c), with no success.

In the events, we see the following error
Error: container create failed: time="2021-03-23T11:28:38Z" level=error msg="container_linux.go:348: starting container process caused \"chdir to cwd (\\\"/home/nonroot\\\") set in config.json failed: permission denied\"" container_linux.go:348: starting container process caused "chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"
Could it be that signalilo needs more configuration to make it work OR the issue could be on the OpenShift side ?

Thank you in advance.

Handling of automatic heartbeat service is not working

Hello,

we have created an heartbeat service WatchdogSHG, by setting a label 'heartbeat' as documented in heartbeat Services.

But in the log we can see that signalilo din't find the heartbeat service:

{"level":"debug","msg":"Processing firing alert: alertname=WatchdogSHG, severity=critical, message=Diese Message wird erst angezeigt, wenn Signalilo nichts mehr schickt.\n","time":"2021-02-11T12:02:05+01:00"}
{"level":"info","msg":"Creating alert as heartbeat with check interval 1m0s","time":"2021-02-11T12:02:05+01:00"}
{"level":"info","msg":"creating service: WatchdogSHG_35df4c5608e7fd5e\n","time":"2021-02-11T12:02:05+01:00"}
{"level":"debug","msg":"Executing ProcessCheckResult on icinga2 for WatchdogSHG_35df4c5608e7fd5e: exit status 0","time":"2021-02-11T12:02:05+01:00"}
{"level":"debug","msg":"Processing resolved alert: alertname=WatchdogSHG, severity=none, message=This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n","time":"2021-02-11T12:02:05+01:00"}
{"level":"info","msg":"Not processing resolved heartbeat for WatchdogSHG_c8b066e8d93e0f22","time":"2021-02-11T12:02:05+01:00"}
{"level":"error","msg":"heartbeat: unable to get heartbeat service: Did not get 200 OK","time":"2021-02-11T12:02:58+01:00"}
{"level":"error","msg":"sending heartbeat: Did not get 200 OK","time":"2021-02-11T12:02:58+01:00"}
{"level":"error","msg":"heartbeat: unable to get heartbeat service: Did not get 200 OK","time":"2021-02-11T12:03:58+01:00"}
{"level":"error","msg":"sending heartbeat: Did not get 200 OK","time":"2021-02-11T12:03:58+01:00"}

If I have really understood, the heartbeat service should be named heartbeat, see serve.go#L76.

My Heartbeat Alert:

  - name: general.rules
    rules:
    - alert: WatchdogSHG
      annotations:
        message: |
          Diese Message wird erst angezeigt, wenn Signalilo nichts mehr schickt.
      expr: vector(1)
      labels:
        severity: critical
        heartbeat: 1m

Maybe I have missed something (again ;-) ).

Kind regards
Xavier

Ability to specify zone for service(s)

I have a setup with one master node that manages the config and exports metrics to graphite/redis/logstash/db. This is zone "master"

i have a slave zone with two HA nodes where the actual checking and alerting takes place.

So naturally i setup signalilo to talk via API to the master node. These are automatically assigned zone "master" and no alerting takes place. Specifying zone as icinga variable makes it passed as vars.zone to icinga2. This doesn't work either.

Is there any way to specify zone for my checks?

Patching the generated files by hand with the correct zone resolves the issue.

Rename go.mod module and update internal package imports

https://github.com/vshn/signalilo/blob/master/webhook/icinga.go#L20 -> is that an internal repo

Originally posted by @roidelapluie in prometheus/docs#1589 (comment)

$ go get github.com/vshn/signalilo
# cd .; git clone -- https://git.vshn.net/appuio/signalilo.git /home/roidelapluie/go/src/git.vshn.net/appuio/signalilo
Cloning into '/home/roidelapluie/go/src/git.vshn.net/appuio/signalilo'...
fatal: could not read Username for 'https://git.vshn.net': terminal prompts disabled
package git.vshn.net/appuio/signalilo/config: exit status 128
package git.vshn.net/appuio/signalilo/gc: cannot find package "git.vshn.net/appuio/signalilo/gc" in any of:
        /home/roidelapluie/godist/go/src/git.vshn.net/appuio/signalilo/gc (from $GOROOT)
        /home/roidelapluie/go/src/git.vshn.net/appuio/signalilo/gc (from $GOPATH)
package git.vshn.net/appuio/signalilo/webhook: cannot find package "git.vshn.net/appuio/signalilo/webhook" in any of:
        /home/roidelapluie/godist/go/src/git.vshn.net/appuio/signalilo/webhook (from $GOROOT)
        /home/roidelapluie/go/src/git.vshn.net/appuio/signalilo/webhook (from $GOPATH)

Originally posted by @roidelapluie in prometheus/docs#1589 (comment)

Is it possible to map some other field to plugin_output?

I am using Grafana with alertmanager to forward alerts to icinga2 (because we have complex alerting logic, we don't want to do implement it twice).

The issue is that message from grafana goes into vars.label_message and i don't really receive anything on plugin_output. I could technically work that around in icinga2, but perhaps there's a way to do that in signalilo itself?

Or maybe there is some alertmanager workaround.

[misconfiguration?] severity labels do not seem to work correctly

from signalilo log:

signalilo-7b4bb75878-vm84w signalilo {"level":"debug","msg":"Processing firing alert: alertname=apiservices_down, severity=CRITICAL, message=Apiservice v1beta1.webhook.cert-manager.io not operational : MissingEndpoints","time":"2021-03-08T13:54:03Z"}
signalilo-7b4bb75878-vm84w signalilo {"level":"info","msg":"updating service: apiservices_down_37c3e7be25078ec2\n","time":"2021-03-08T13:54:03Z"}
signalilo-7b4bb75878-vm84w signalilo {"level":"debug","msg":"Executing ProcessCheckResult on icinga2 for apiservices_down_37c3e7be25078ec2: exit status 3","time":"2021-03-08T13:54:03Z"}

i am always getting UNKNOWN, and i don't understand why.

Reporting a vulnerability

Hello!

I hope you are doing well!

We are a security research team. Our tool automatically detected a vulnerability in this repository. We want to disclose it responsibly. GitHub has a feature called Private vulnerability reporting, which enables security research to privately disclose a vulnerability. Unfortunately, it is not enabled for this repository.

Can you enable it, so that we can report it?

Thanks in advance!

PS: you can read about how to enable private vulnerability reporting here: https://docs.github.com/en/code-security/security-advisories/repository-security-advisories/configuring-private-vulnerability-reporting-for-a-repository

Icinga Service template name not configurable

Hi,

in our Icinga environment, the template name for generic services is not "generic-service", but "Generic Service" and we are not able to change this. Thus, signalilo fails to create services on Alerts with the error

500 Internal Server Error - Object could not be created. Error: Import references unknown template: 'generic-service'

I found out that the template name is hardcoded in your go-icinga2-client:
https://github.com/vshn/go-icinga2-client/blob/b157d0f48abed9f12ceb01a84f0db8021707010a/icinga2/service.go#L71

func (s *WebClient) CreateService(service Service) error {
	serviceCreate := ServiceCreate{Templates: []string{"generic-service"}, Attrs: service}
	// Strip "name" from create payload
	serviceCreate.Attrs.Name = ""
	err := s.CreateObject("/services/"+service.FullName(), serviceCreate)
	return err
}

Would it be an option for you to extend the go-icinga2-client with the option to configure the template name, and make this parameter configurable in signalilo?

Secrets can be passed on the command line

Currently it's possible to pass arguments containing secrets on the command line.

While it's possible to avoid doing so by appropriately setting environment variables for those values, nothing is stopping users from passing secrets in plain sight.

Determine if and how we want to disable passing of secrets on the command line.

Wildcard Certificates not accepted

The most recent version of signalilo (0.8.0) doesn't support wildcard certificates anymore.
I can confirm that it was working with signalilo 0.6.0
{"level":"error","msg":"Unable to send initial heartbeat: Get \"https://monitor.redacted.com/v1/objects/hosts/REDACTED\": invalid certificate name \"*.redacted.com\", expected \"monitor.redacted.com\"","time":"2021-01-18T10:50:19Z"}

Signalilo aborts during garbage collection

Hello,

in a normal environment (Openshift Cluster connected with Icinga Test), we can see always blocks beginning with Running garbage collection, finding some services, e.g. like Watchdog... and ending with Garbage collection complete, like:

{"level":"info","msg":"[Collect] Running garbage collection at ts=2021-02-25 10:55:47.378779355 +0100 CET m=+900.091794889","time":"2021-02-25T10:55:47+01:00"}
...
{"level":"info","msg":"[Collect] Found service Watchdog_af4ff9b2e11f2eb8 with our bridge UUID","time":"2021-02-25T10:55:47+01:00"}
{"level":"debug","msg":"[Collect] Skipping service Watchdog_af4ff9b2e11f2eb8: state=3, downtimed=false","time":"2021-02-25T10:55:47+01:00"}
...
{"level":"info","msg":"[Collect] Garbage collection completed in 406.282653ms","time":"2021-02-25T10:55:47+01:00"}

In one environment (Openshift Cluster C2, connected with Prod Icinga) we see only the first message but no findings and the pod is restarted.

{"level":"info","msg":"[Collect] Running garbage collection at ts=2021-03-05 10:31:28.604140194 +0100 CET m=+900.105834291","time":"2021-03-05T10:31:28+01:00"}
{"level":"info","msg":"Sending heartbeat: 'OK: 2021-03-05T10:31:28+01:00'","time":"2021-03-05T10:31:28+01:00"}

We see that the Icinga server consumes a lot of memory (1GB) at that time, and didn't free its. See file Icinga-memory.jpg
Icinga-Memory

After a few hours we must restart the icinga server, to free the memory.

On the Icinga server we didn't find anything in the logs.

I've set the env SIGNALILO_ICINGA_DEBUG, but I cannot see any problem.

Can you help me to analyse the problem?

Kind regards
Xavier

Make display_name more configurable

I use grafana -> alertmanager -> signalilo -> icinga2. This way my users don't have to meddle with prometheus/alertmanager yaml files for their alerting needs.

The issue:

It seems display_name gets set from alertname.

However, icinga won't accept certain characters in that field. I think even spaces are invalid. So i have to use simple definitions like "ALERT_01_for_some_service". Which seem to go into service definition (object name, which has restrictions) and display name (which doesn't have a lot of restrictions).

I thought i could pass specific field to signalilo to overwrite this, but i think i cannot.

So, I tried setting icinga_string_display_name with custom display_name and providing a template :

template Service "fix-display-name" {
        if (vars.display_name) {
                display_name = vars.display_name
        }
}

What happens is - signalilo creates an object like this :

object Service "CMB_PVC_K8S_6f0bc2786b5c6528" {
        import "generic-service"

(...)
        display_name = "CMB_PVC_K8S"
(.....)        vars["display_name"] = "Capacity of PVC spool-cc-cmb-test-aggregator-0"

And my custom display_name does not apply.

The problem is that template inheritance happens BEFORE all variables are parsed.

The fix is to import the template at the very end of service definition. I have no way to configure that in signalilo, but such feaure Would Be Nice To Have.

[Feature Request] Provide pre-built binaries

First of all thank you very much for bridging the gap between Prometheus' Alertmanager and Icinga!

I'd like to suggest to build and provide binaries with every release like you do with the containers and the source code (https://github.com/vshn/signalilo/releases). Github will happily host the assets, just need to build them :-)

This is quite common among exporters or other Prometheus components:

and a lot of tooling i.e. Ansible or Puppet modules exists to install and manage exporters in this way.
Also there might be reasons for people to not use containers and to wanting to deploy Signalilo natively on their systems.

[Feature Request] use SIGNALILO_ICINGA_HOSTNAME from label value

Hey, first, thanks for this great piece of software!

We would like to use one Signalilo deployment for a few hosts in Icinga2, currently we only could deploy one Signalilo for one Icinga2 host which is, in our case, not feasible since we have roundabout 3k hosts in Icinga2.

My idea would be to take the icinga2 host from a label value and configure the api-user filters accordingly.

Is this a feature you also would like to implement anyway? I would like to do it myself and come up with a PR, but i have no experience in go programming for now.

[Feature Request] Support for Redundant Master API's

Hello,

I would love to see, that Signalilo can connect to an multi master icinga2 setup.
E.g. I provide two API addresses and in case one of them is not reachable, Signalilo will try to use the second master to submit whatever it has to submit.

Regards,
Dennis

Proposal to change the owner of go-icinga2-client

Hi,

Doesn't technically belong here, but i don't see any over way since issues on https://github.com/vshn/go-icinga2-client are disabled

So https://github.com/vshn/go-icinga2-client is a fork of https://github.com/simu/go-icinga2-client (archived) which is a fork of our original https://github.com/plusserver/go-icinga2-client which pretty much haven't seen any more development in a long time.

Do you guys just want to make https://github.com/vshn/go-icinga2-client the "official/upstream", since every other fork/origin is not in an "active" development state anymore ? I would advise something like this, to get rid of the fork.

1. git clone --bare https://github.com/vshn/go-icinga2-client
2. Delete old repo vshn/go-icinga2-client
3. Create new repo vshn/go-icinga2-client
4. cd go-icinga2-client.git && git push --mirror https://github.com/vshn/go-icinga2-client

I would probably then archive https://github.com/plusserver/go-icinga2-client

[feature] No ability to create Services in existing ServiceHosts

Problem

There is no ability to map Alerts in AM to existing ServiceHost. I would like to create Services in the certain real Icinga2 Hosts instead of the dummy Signalilo host, using a mapping between the instance label value in a Alert.

Proposal

Create a config option that allows a user to create a mapping to a label value for the serviceHost to use

Signalilo Heartbeat Implementation Question

This more question to discuss, but from what I see:

  1. README.md states alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.
  2. README.md states: On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. Which get me to understanding that if the heartbeat service doesn't exist 404 or there will be any other failures like 4xx\5xx - Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?
  3. From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:
    • Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.
    • Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts. Problem that it will create delays.
    • Last option - I think most preferable: have a separate mandatory service IcingaApiErorrs for such error handling that must be created in the same way as Heartbeat, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates of IcingaApiErorrs service fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to update IcingaApiErorrs - if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.

What you think?

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

dockerfile
Dockerfile
  • golang 1.20
github-actions
.github/workflows/push.yml
  • actions/checkout v3
  • mikepenz/release-changelog-builder-action v4
  • ncipollo/release-action v1
.github/workflows/test.yml
  • actions/checkout v3
  • actions/setup-go v4
gomod
go.mod
  • go 1.16
  • github.com/alecthomas/kingpin/v2 v2.3.2
  • github.com/bketelsen/logr v0.0.0-20170116012416-f3d070bdd1c5@f3d070bdd1c5
  • github.com/corvus-ch/logr v0.0.0-20210413064445-af2a51d190ad@af2a51d190ad
  • github.com/prometheus/alertmanager v0.25.0
  • github.com/sirupsen/logrus v1.9.3
  • github.com/stretchr/testify v1.8.4
  • github.com/vshn/go-icinga2-client v0.0.17

  • Check this box to trigger a request for Renovate to run again on this repository

description as Plugin Output?

Hi

We are using signalilo togethere with the default alerts rules from kube-prometheus. All Rules are set there to use the annotation description rather than message.

This means in our Icinga we see no plugin Output when an alert from kube-proemtheus-stack gets created.

According to the Signalilo README only the message annotation is used for creating the plugin_output inside icinga:

Required annotations:

    description: mapped to notes.
    message: mapped to plugin_output.

This is a problem for us, as our icinga alerting system is going to send the plugin_output as a SMS and we don't parse the notes for alerting.

Could this be configured somehow, so that we can create the plugin_output from the annotations['description'] rather then annotations['message']?

Thanks.

Consistent ProcessCheckResult failures with icinga 2.8.4

Hello,

we have experienced consistent failures in signalilo during ProcessCheckResult such as the one below:

{"level":"error","msg":"Error in ProcessCheckResult for Watchdog_53f82e5f93b8c355: Post \"https://icinga-stg.xxxx.io:5665/v1/actions/process-check-result\": EOF","time":"2020-06-16T21:06:45Z"}

We've been able to replicate exact failure by instantiating some Go icinga2 client transactions. Similar failures are not happening when simulating via curl or an icinga python client.

Failure seems to be related to Go http client not playing well with icinga2 api module in our environment (icinga 2.8.4 with some customizations on a CentOS 7.5) and are likely related to timeout issues .

We have been able to workaround this problem by disabling http keepalive.

We are going to propose a couple of pull requests so that http keepalive could be tuned in signalilo and pushed down to Go icinga2 client.

We are available in case you are interested in looking further into this problem.

HTH,

-m

Remove hard coded state of UNKNOWN for Severity Levels other than normal, warning, and critical

Signalilo is currently hard coded to set the matching Icinga Service to UNKNOWN if a Firing Alert with a Severity Level other than normal, warning, or critical is received.

It would be helpful to have an option to either globally state that a Severity Level other than these should be treated as something else (like WARNING instead of UNKNOWN) or perhaps even allow for supplying a custom mapping of Severity Values against Icinga Service State values.

One example for this would be in the kube-prometheus-stack CPU Throttling High alert which has a Severity of info which becomes UNKNOWN when updated in Icinga.

I'm happy to make the change and submit a PR but I'm just wondering if a simple global option would suffice?

Like --icinga_unknown_severity_service_state=n where n is 0-3.

This would suit my particular requirements but maybe others would prefer the ability to completely customise the mapping of Severity Levels to Icinga Service States?

Visibility of alerts with same servicename in Nagstamon

Now Signalilo creates unique services with IDs based on: ${alertname}_${fingerprint} and give them names as: ${alertname} to have more nice output in Web UI, but when Icinga used as server in Nagstamon this results in the fact that Nagstamon will display only 1 alert from a bunch of alerts with same servicename.

It would be cool if Signalilo could support an optional workaround for it:

  • set display service name as ${alertname}_${fingerprint}
  • set servicename as-is

This logic could be controlled by a command flag or environment variable, while by default it will work as-is.

Additionally, support nice servicename as ${alertname}_${id}, where ${id} is an incremental integer? Would be cool, but not sure it's possible or worth the effort. As it means that we need to check for any existing ${alertname}_${id} before creating alert, not think Icinga could return service not by its IDs, but by display servicename. Maybe a sort of search request with filter by ${alertname}_ is possible?

Thank you in advance!

Env SIGNALILO_ICINGA_KEEP_FOR has no effect.

Hello,

thank you for this very good software.

We are using Signalilo Version v0.8.0 and want to remove 'normal' services after one day and not after 7 days.
The env variable SIGNALILO_ICINGA_KEEP_FOR is set with "24h".
But the services are not deleted.

In the log we can see, that the variable has been seend (see 3. message), but in the garbage collection, keep_forstill has the default value: 168h.

Signalilo v0.8.0
Build time: 2020-11-11T10:02:48+00:00

{"level":"info","msg":"Configuring logger with LogLevel=2","time":"2021-01-15T17:05:49+01:00"}
{"level":"info","msg":"Signalilo UUID: signalilo-s01","time":"2021-01-15T17:05:49+01:00"}
{"level":"info","msg":"Keep for: 24h0m0s","time":"2021-01-15T17:05:49+01:00"}
{"level":"info","msg":"Starting heartbeat: interval 1m0s","time":"2021-01-15T17:05:49+01:00"}
[...]
{"level":"info","msg":"[Collect] Found service KubePodNotReady_77d3b5309c77cae4 with our bridge UUID","time":"2021-01-15T17:20:50+01:00"}
{"level":"debug","msg":"[Collect] Skipping service KubePodNotReady_77d3b5309c77cae4: keep_for = 168h0m0s; age = 160h37m39.407704s","time":"2021-01-15T17:20:50+01:00"}
[...]

Kind regards
Xavier

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.