Comments (9)
The alerts are added for:
- any node being down
- any node having best block height grow by less than 1 in 10 minutes or by 40 in an hour.
from infra.
One issue we have with the current setup is that we get false alerts for nodes being down when miners are pre-empted. To check miner availability we need to distinguish between nodes being down because its pod is rescheduled due to a VM being pre-empted and a node being down because it has crashed.
I found that kube-state-metrics
exposes a metric for a the pod phase. We could use the “Failed” status of that metrics to determine whether a node crashed.
from infra.
Here are my initial thoughts on this:
As a channel we should choose an existing medium. For us that means email or IRC. I would prefer IRC (a public channel on freenode) since it is more easily managed (people can choose to join the channel instead of an admin managing an email list) and discussion about the incident can happen in that channel.
For checking alerts we can choose between Grafana and Alertmanager. If we use Grafana we can define alerts for all our metric sources which might include Google Stackdriver for K8s cluster monitoring. Alertmanager would only integrate with Prometheus and we would need to run an alertmanager for every K8s cluster.
Neither Grafana nor Alertmanager have built-in support for IRC notifications. We’d need to investigate a bridge like irccat which we would need to run ourselves.
As alert conditions I suggest the following
- Alert when the average block import rate of a node deviates too much from one per minute. The margin of error should be smaller for averages over larger time windows.
- Alert when a node is not connected to any peer
- Alert when kubernetes pods are not running
- Alert when kubernetes containers restart
from infra.
Grafana alerts look extremely useful. IRC seems completely alienated though, I couldn't find any reasonable way to easily integrate it, neither Grafana nor any of its integratables supports IRC. Gmail can't do it either, not even with addons. Riot.im can integrate with IRC, but not with Grafana or email. The irccat looks like the most reasonable solution, requires some extra steps, but should do the job.
Is it a good idea to make IRC our official public notification platform? Nowadays it really looks like a dead technology.
from infra.
Is it a good idea to make IRC our official public notification platform? Nowadays it really looks like a dead technology.
I wouldn’t call it dead technology. But the need for DIY integration is a real downside. We should investigate how easy email would be. Maybe a Google Group makes sense. Then it is easier to manage it.
from infra.
@CodeSandwich and I discussed this. We think it’s best to have the alerting on Grafana and notify us via an email group. We’ll start with alerts for low block import rate all nodes to check network health and alerts on whether all nodes that we run are up. We’ll document this in the repo for now and then integrate it into terraform later.
from infra.
Possible future alerts:
- Any node in the cluster is down
- Any node in the cluster has less than 4 peers (the minimum guaranteed by the size of the cluster)
- Blocks are mined too quickly (this may be monitored indirectly with block import rate, but adding the last block mining time would be more reliable)
- Blocks are mined too slowly (this is going to be covered indirectly with the low import rate, but again last block mining time would be more reliable)
- Too many invalid transactions are being proposed (possible DDoS attack)
Many false alerts can be muted by disabling some or all the checks while the node is in sync mode: radicle-dev/radicle-registry#402
To consider: the email group should be completely closed for all senders except this one Grafana instance.
from infra.
The group is live: https://groups.google.com/forum/#!forum/radicle-registry-ffnet-cluster-alerts
from infra.
@geigerzaehler Some of the steps to complete this goal have been completed, the others have been split into smaller issues. Can we close it now?
from infra.
Related Issues (20)
- Alert when an invalid block is being proposed HOT 1
- Validators should store chain data on persistent volume
- Alert when nodes are stuck syncing
- RPC node deployment should allow fast sync with CPU burst
- Alert when Too many invalid blocks are being proposed
- Identify our nodes on telemetry.polkadot.io HOT 1
- Expose devnet RPC at rpc.devnet.radicle.network
- Build agent instance distro and package distro don’t match
- Cache volumes are sometimes created without build_cache label HOT 1
- Scale down ffnet
- Create registry artifacts GCS bucket with terraform
- ci: Switch to `radicle-services` infra HOT 5
- Use tag names to easily identify build artifacts HOT 1
- Alert when a node is not connected to peers
- Alert when blocks are mined too quickly HOT 5
- Alert when Too many invalid transactions are being proposed HOT 2
- Fix nodes being down alerts spam HOT 3
- Fix nodes being mined too slowly spam HOT 4
- Historical data from telemetry.polkadot.io
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from infra.