Coder Social home page Coder Social logo

mrwinston / configuration-anomaly-detection Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openshift/configuration-anomaly-detection

0.0 0.0 0.0 2.41 MB

Configuration anomaly detection for OSD clusters

License: Apache License 2.0

Shell 20.45% Go 75.76% Makefile 3.26% Dockerfile 0.53%

configuration-anomaly-detection's Introduction

Go Report Card PkgGoDev codecov License


Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.

Contributing

Adding a new investigation

CAD investigations are triggered by PagerDuty webhooks. Currently, CAD supports the following two formats of webhooks:

  • WebhookV3
  • EventOrchestrationWebhook

The required investigation is identified by CAD based on the incident and its payload. As PagerDuty itself does not provide finer granularity for webhooks than service-based, CAD filters out the alerts it should investigate. For more information, please refer to https://support.pagerduty.com/docs/webhooks.

To add a new alert investigation:

  • create a mapping for the alert to the getInvestigation function in investigate.go and write a corresponding CAD investigation (e.g. Investigate() in chgm.go).
  • if the alert is not yet routed to CAD, add a webhook to the service your alert fires on. For production, the service should also have an escalation policy that escalates to SRE on CAD automation timeout.

Testing locally

Pre-requirements

  • an existing cluster
  • an existing PagerDuty incident for the cluster and alert type that is being tested

To quickly create an incident for a cluster_id, you can run ./test/generate_incident.sh <alertname> <clusterid>. Example usage:./test/generate_incident.sh ClusterHasGoneMissing 2b94brrrrrrrrrrrrrrrrrrhkaj.

Running cadctl for an incident ID

  1. Export the required ENV variables, see required ENV variables.
  2. Create a payload file containing the incident ID
export INCIDENT_ID=
echo '{"event": {"data":{"id": "${INCIDENT_ID}"}}}' > ./payload
  1. Run cadctl using the payload file
./cadctl/cadctl investigate --payload-path payload

Documentation

CAD CLI

  • cadctl -- Performs investigation workflow.

Investigations

Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.

Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.

Integrations

  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
  • osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.

Overview

  • CAD is a command line tool that is run in tekton pipelines.
  • The tekton service is running on an app-sre cluster.
  • CAD is triggered by PagerDuty webhooks configured on selected services, meaning that all alerts in that service trigger a CAD pipeline.
  • CAD uses the data received via the webhook to determine which investigation to start.

CAD Overview CAD Overview

Templates

  • Update-Template -- Updating configuration-anomaly-detection-template.Template.yaml.
  • OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.

Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Deployment

  • Tekton -- Installation/configuration of Tekton and triggering pipeline runs.
  • Skip Webhooks -- Skipping the eventlistener and creating the pipelinerun directly.
  • Namespace -- Allowing the code to ignore the namespace.

Boilerplate

PipelinePruner

Required ENV variables

  • CAD_OCM_CLIENT_ID: refers to the OCM client ID used by CAD to initialize the OCM client
  • CAD_OCM_CLIENT_SECRET: refers to the OCM client secret used by CAD to initialize the OCM client
  • CAD_OCM_URL: refers to the used OCM url used by CAD to initialize the OCM client
  • AWS_ACCESS_KEY_ID: refers to the access key id of the base AWS account used by CAD
  • AWS_SECRET_ACCESS_KEY: refers to the secret access key of the base AWS account used by CAD
  • CAD_AWS_CSS_JUMPROLE: refers to the arn of the RH-SRE-CCS-Access jumprole
  • CAD_AWS_SUPPORT_JUMPROLE: refers to the arn of the RH-Technical-Support-Access jumprole
  • CAD_ESCALATION_POLICY: refers to the escalation policy CAD should use to escalate the incident to
  • CAD_PD_EMAIL: refers to the email for a login via mail/pw credentials
  • CAD_PD_PW: refers to the password for a login via mail/pw credentials
  • CAD_PD_TOKEN: refers to the generated private access token for token-based authentication
  • CAD_PD_USERNAME: refers to the username of CAD on PagerDuty
  • CAD_SILENT_POLICY: refers to the silent policy CAD should use if the incident shall be silent
  • PD_SIGNATURE: refers to the PagerDuty webhook signature (HMAC+SHA256)
  • X_SECRET_TOKEN: refers to our custom Secret Token for authenticating against our pipeline
  • CAD_PROMETHEUS_PUSHGATEWAY: refers to the URL cad will push metrics to
  • BACKPLANE_URL: refers to the backplane url to use
  • BACKPLANE_INITIAL_ARN: refers to the initial ARN used for the isolated backplane jumprole flow

Optional ENV variables

  • BACKPLANE_PROXY: refers to the proxy CAD uses for the isolated backplane access flow.

Note: BACKPLANE_PROXY is required for local development, as a backplane api is only accessible through the proxy.

For Red Hat employees, these environment variables can be found in the SRE-P vault.

configuration-anomaly-detection's People

Contributors

openshift-merge-robot avatar typeid avatar raphaelbut avatar ninabauer avatar openshift-merge-bot[bot] avatar georgettica avatar openshift-ci[bot] avatar zmird-r avatar ramonbutter avatar dependabot[bot] avatar tnierman avatar rafael-azevedo avatar sam-nguyen7 avatar npecka avatar bergmannf avatar mjlshen avatar tessg22 avatar shibumi avatar mitalibhalla avatar nikokolas3270 avatar thrasher-redhat avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.