Coder Social home page Coder Social logo

csmo-icp's Introduction

IBM Cloud Service Management and Operations

Monitoring in IBM Cloud Private

Here you will will find artifacts created by the IBM CSMO team to assist you with performance management of your ICP deployment. You will find hints and tips, integration how tos, Grafana Dashboards and more!

As defined in the Cloud Service Management and Operations reference architecture, the monitoring & logging components of the CSMO toolchain "kick off" the incident management process.

IncidentManagement Toolchain

As of late October, we are concentrating only on the Monitoring & Logging components which are builtin to the Kubernetes component of IBM Cloud Private. We will also discuss integration points with external Event Management solutions such as Netcool Operations Insight.

Soon, this document will be updated to include internal monitoring components such as Cloud Foundry monitoring capabilities and external components such as APM.

csmo-icp's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csmo-icp's Issues

grafana.ini needs to be changed more easily

Is there a better / easier way to add configuration parameters to /etc/grafana/grafana.ini, when grafana runs in a container, because the current method via configMap and volume is quite complex.
E.g. by providing environment parameters to grafana.ini in the helm chart ?

ICP 3.2.1

Is there an update for ICP Monitoring Dashboards on ICP 3.2.1. The current rules are not working as it seems that the format have change in prometheus (PrometheusRule instead of AlertRule)?

have to refresh the page after you select a namespace

When you select ICP 2.1 Namespace in grafana's dashboard (default for example), you have to refresh the page if you want to see graphs.
If you don't do that , $namespace still remains the same.
Once you select a namespace and refresh the page, title of the graph change with correct variable and the graph appears.

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert

Hi, all,

I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).

In my investigation, I found the alert is triggered by a normal option.
Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.

{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}

Seems alert rule is not meaningful if grpc_code="Unavailable" or grpc_method="LeaseKeepAlive" , so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .

     - alert: ICPetcdHighNumberOfFailedGRPCRequests
       annotations:
         message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
           $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
       expr: |
         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK", grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"}[5m])) BY (job, instance, grpc_service, grpc_method)
           /
         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
           > 1
       for: 10m
       labels:
         severity: warning

Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.

Someone meet similar issue in:
openshift/cluster-monitoring-operator#248

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.