bahmanm / lemmy-meter Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 0.0 698 KB

A web application to track Lemmy instances performance and represent the results visually

Home Page: https://lemmy-meter.info

License: GNU General Public License v3.0

Makefile 66.52% Jinja 1.48% Perl 31.48% Dockerfile 0.52%

fediverse lemmy observability

lemmy-meter's Introduction

1. lemmy-meter

A solution for Lemmy end-users, like me, to check the health of their favourite instance in 3 levels of details.

This is the source repository which is used to build and deploy lemmy-meter.info.

2. Health Reports

lemmy-meter provides 3 levels of reports.

2.1 Overall Health

This is what you are, almost always, interested in.

Colour	Meaning	Interpretation
🟢 Green	none of the health checks are failing 🙂	Your instance is healthy and doing well.
🟠 Orange	some of the health checks are failing 🫤	Your instance may be partially down; for example mobile APIs may not be working.
🔴 Red	all health checks are failing 🙁	Your instance may be completely down; for example during a planned maintenance.

2.2 Endpoint Health

A breakdown of overall health by few, subjectively, important endpoints:

Landing page: the web page users when they visit the instance.
Select API endpoints which are used by mobile (and desktop) applications:
- getPosts
- getComments
- getCmmunities

2.3 Endpoint Response Time - Rate

A visual representation of how much the average response time has changed over time.
A flat line indicates a consistent response time, regardless of being slow or fast.
Spikes or changes in elevation mean changes in the response time.

NB: It does not represent the actual response times but only the fluctuations.

2.4 Endpoint Response Time - Raw

The raw response time per endpoint as it happend.
Lower is better. Anything below 500ms is quite decent.
Don't read too much into the actual values.
The server is currently located in Germany which means non-EU instances will always be slightly slower than you'd expect.

3. How To Run

The only dependency is bmakelib.

3.1 Locally

Simply run make up and make down to start the cluster and tear it down.

You can access Grafana at http://localhost:3000 (admin/admin)

3.2 Remote

Run make deploy to, well, deploy lemmy-meter to the remote server.

lemmy-meter's People

Contributors

Stargazers

Watchers

lemmy-meter's Issues

Grafana admin password should be configurable in `deploy-remote` playbook

Set the default availability warning and error alert durations

Based on the admins' feedback and real world experience, a default duration of 5m to trigger the warning and 10m to trigger the error is quite reasonable.

Externally embeddable gauges

Investigate if it is possible to embed the health indicator gauges for a given instance in another website, the way a usual health "badge" works.

Thanks @unruffled for bringing this up.

Configure Alertmanager

Configure Prometheus alerts and Alertmanager to notify instance admins/communities of outages/degraded performance, eg in a Matrix channel/chat or a Discord server.

Retire matrix-webhook

With Prometheus alerts in place, there's no more need for the Grafana-Matrix bridge and it can be safely retired.

Integrate Alertmanager with ntfy

Configure ntfy to run as a component in the cluster.
Write a webhook receiver which translates Alertmanager payload to ntfy model.
Configure Alertmanager to use the said receiver.

Configure alerts for slow DNS resolution

There have been a couple of incidents already when Blackbox Exporter takes a very long time (10s+) to finish the "resolve" phase.

One suspect is the connection between the Docker daemon and provider's nameserver can become stale (:man_shrugging:) I patched the configuration to always use DNS servers outside the internal network.

However, I'd like to be alerted the next time this happens so I can start investigating right away.

Additional meta tags in the header for SEO and embeddability

Follow up from a conversation in #lemmy-meter:matrix.org

Investigate alerts and notifications

Explore whether it is possible for viewers to sign up for notifications as to when their favourite instances becomes (partially) unavailable.

This may be potentially helpful for admins as well.

For this to happen:

There should be an un/subscribe form.
lemmy-meter should be able to able to send e-mails - probably plenty of them.
Reasonable alerts should be configured.

Use volumes for Prometheus and Grafana

Currently all the cluster nodes use mount binds which are not totally reliable. Use volumes for at least Prometheus and Grafana.

Configure alerts

It should be possible to subscribe to a particular instance's alerts and receive a notification (eg an e-mail) whenever the alert is triggered.

Endpoint to validate scheduled downtime file

It would be helpful to implement an endpoint to assist admins in validating scheduled-downtime.json.

For example:

$ curl -X GET https://lemmy-meter.info/.metadata/validate-json?instance=<INSTANCE>
Invalid 
<detailed error message>

Tune the frequency/number of HTTP requests for the default configuration

Currently, the default configuration sends a good deal of HTTP requests per minute to an instance (~30 req/min.)

Tune it down to 2-4 req/min.

Try out Kamal instead of Compose

Kamal v1.0.0 which has just been released seems to be an interesting alternative to Docker Compose. It's worth trying it out while lemmy-meter is in its early stages.

Migrate Grafana to PostgreSQL

Cron syntax for recurring scheduled periods

Follow up from #22

In the case of the planned downtime Google sheet, there should be two new columns for cron schedules.

Consider a special 5xx reponse response to a probe as maintenance mode

Investigate whether it's possible to assume the site is in maintenance mode if it responds to probes w/ a special 5xx response such as 503.

Integrate Alertmanager w/ Gotify

Run matrix-webhook in the cluster

Currently, matrix-webhoo which is used for Alert notifications is run as a separate user. Move it to the same cluster as other services to ensure fail-over and consistency.

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Update dependency molecule to v24

Detected dependencies

cpanfile

cluster/downtime-processor/cpanfile

perl 5.39.9

Mojolicious 9.36

Net::Prometheus 0.12

Data::Dump 1.25

Schedule::Cron::Events 1.96

Text::CSV 2.04

Moose 2.2207

JSON 4.10

JSON::Validator 5.14

File::Slurper 0.014

Data::UUID 1.227

Log::Log4perl 1.57

docker-compose

cluster/docker-compose.yml

prom/prometheus v2.51.2

grafana/grafana 10.4.2

prom/blackbox-exporter v0.25.0

prometheuscommunity/json-exporter v0.6.0

nginx 1.26

postgres 16.2

prom/alertmanager v0.27.0

ixdotai/smtp v0.5.2

binwiederhier/ntfy v2.10.0

dockerfile

cluster/downtime-processor/Dockerfile

perl 5.39.9

pip_requirements

ansible/requirements.txt

ansible ==9.5.1

molecule ==6.0.3

molecule-plugins ==23.5.3

passlib == 1.7.4

Check this box to trigger a request for Renovate to run again on this repository

Probe Lemmy instances from different geo regions

Create alerting rules for latency and availability

As the first iteration, the following values and durations should be enough:

Availability:
- WARN if < 70% for > 15m
- ERROR if < 70% for > 30m
Latency
- WARN if > 30% for > 15m
- ERROR if > 30% for > 30m

Enable Renovate

Scheduled downtime does not show up in metrics

Enable load balancing for Grafana

Import/export Grafana dashboards w/ zero downtime

It should be possible to transfer the changes between local lemmy-meter and lemmy-meter.info w/o requiring the cluster to be stopped.

One workflow is

Grab latest dashboards from remote
Experiment and make changes locally
Upload the changes to remote

Or even better is to store the relevant Grafana configurations such data sources, users and dashboards so that they can be versioned in git.

Scrape downtime schedules off instances

Follow up on #22

It should be possible to scrape downtime schedules off predefined URLs from instances. For example, https://INSTANCE/.well-known/host-metadata.json or https://INSTANCE/.well-known/scheduled-downtime.json

Deploy cluster configuration w/o restarting it

In cases like changes to Prometheus service discover files (eg adding an instance) there's no need to restart the cluster as Prometheus will pick the changes up OOTB.

Write tests for the deploy playbook

Allow instance owners to schedule planned downtime

As it stands, this project is very good for detecting unplanned service outages, but there is currently not a way to distinguish between planned and unplanned outages.

Broken data link on Overall Health panel

The link points to localhost:3000

Calculate instance availablity via recording rules

Expose stats via APIs

It'd be useful to expose the health check results that lemmy-meter collects via some API to interested parties.

For example, uptime.lemmings.world could use such stats to generate uptime badges.

Things to note at the first pass:

The API shouldn't be public. Not at least for now, as lemmy-meter simply hasn't got the infrastructure for that.
There are two types of data that lemmy-meter ingests and stores: snapshot and time-series. Again, for the infrastructural reason, for the time being, the focus should be on the snapshot data.

Thanks @RikudouSage for bringing this up.

Re-organise the project into subprojects

Automate the rollout of a new version

The current process for deploying a new version is quite laborious and involves scp, wget and unzip which is just not right 😅

Ideally, there should be an Ansible playbook(s) to automate all or most aspects of that:

Deploying a new version of lemmy-meter
Deploying Grafana dashboards
Restarting the cluster
Restarting a particular service

For the sake of simplicity, the task of deploying a cluster to a new machine can be skipped.