Coder Social home page Coder Social logo

Comments (9)

CodeSandwich avatar CodeSandwich commented on June 10, 2024 1

The alerts are added for:

  • any node being down
  • any node having best block height grow by less than 1 in 10 minutes or by 40 in an hour.

from infra.

geigerzaehler avatar geigerzaehler commented on June 10, 2024 1

One issue we have with the current setup is that we get false alerts for nodes being down when miners are pre-empted. To check miner availability we need to distinguish between nodes being down because its pod is rescheduled due to a VM being pre-empted and a node being down because it has crashed.

I found that kube-state-metrics exposes a metric for a the pod phase. We could use the “Failed” status of that metrics to determine whether a node crashed.

from infra.

geigerzaehler avatar geigerzaehler commented on June 10, 2024

Here are my initial thoughts on this:

As a channel we should choose an existing medium. For us that means email or IRC. I would prefer IRC (a public channel on freenode) since it is more easily managed (people can choose to join the channel instead of an admin managing an email list) and discussion about the incident can happen in that channel.

For checking alerts we can choose between Grafana and Alertmanager. If we use Grafana we can define alerts for all our metric sources which might include Google Stackdriver for K8s cluster monitoring. Alertmanager would only integrate with Prometheus and we would need to run an alertmanager for every K8s cluster.

Neither Grafana nor Alertmanager have built-in support for IRC notifications. We’d need to investigate a bridge like irccat which we would need to run ourselves.

As alert conditions I suggest the following

  • Alert when the average block import rate of a node deviates too much from one per minute. The margin of error should be smaller for averages over larger time windows.
  • Alert when a node is not connected to any peer
  • Alert when kubernetes pods are not running
  • Alert when kubernetes containers restart

from infra.

CodeSandwich avatar CodeSandwich commented on June 10, 2024

Grafana alerts look extremely useful. IRC seems completely alienated though, I couldn't find any reasonable way to easily integrate it, neither Grafana nor any of its integratables supports IRC. Gmail can't do it either, not even with addons. Riot.im can integrate with IRC, but not with Grafana or email. The irccat looks like the most reasonable solution, requires some extra steps, but should do the job.

Is it a good idea to make IRC our official public notification platform? Nowadays it really looks like a dead technology.

from infra.

geigerzaehler avatar geigerzaehler commented on June 10, 2024

Is it a good idea to make IRC our official public notification platform? Nowadays it really looks like a dead technology.

I wouldn’t call it dead technology. But the need for DIY integration is a real downside. We should investigate how easy email would be. Maybe a Google Group makes sense. Then it is easier to manage it.

from infra.

geigerzaehler avatar geigerzaehler commented on June 10, 2024

@CodeSandwich and I discussed this. We think it’s best to have the alerting on Grafana and notify us via an email group. We’ll start with alerts for low block import rate all nodes to check network health and alerts on whether all nodes that we run are up. We’ll document this in the repo for now and then integrate it into terraform later.

from infra.

CodeSandwich avatar CodeSandwich commented on June 10, 2024

Possible future alerts:

  • Any node in the cluster is down
  • Any node in the cluster has less than 4 peers (the minimum guaranteed by the size of the cluster)
  • Blocks are mined too quickly (this may be monitored indirectly with block import rate, but adding the last block mining time would be more reliable)
  • Blocks are mined too slowly (this is going to be covered indirectly with the low import rate, but again last block mining time would be more reliable)
  • Too many invalid transactions are being proposed (possible DDoS attack)

Many false alerts can be muted by disabling some or all the checks while the node is in sync mode: radicle-dev/radicle-registry#402

To consider: the email group should be completely closed for all senders except this one Grafana instance.

from infra.

CodeSandwich avatar CodeSandwich commented on June 10, 2024

The group is live: https://groups.google.com/forum/#!forum/radicle-registry-ffnet-cluster-alerts

from infra.

CodeSandwich avatar CodeSandwich commented on June 10, 2024

@geigerzaehler Some of the steps to complete this goal have been completed, the others have been split into smaller issues. Can we close it now?

from infra.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.