Coder Social home page Coder Social logo

metricq / metricq-sink-nsca Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 1.0 5.89 MB

🧪 Send passive service checks for metrics using send_nsca

Home Page: https://metricq.github.io/metricq-sink-nsca/

License: GNU General Public License v3.0

Python 99.12% Dockerfile 0.88%
metricq metricq-client nsca monitoring metricq-sink

metricq-sink-nsca's Introduction

BSD 3-clause PyPI

metricq

MetricQ is a highly-scalable, distributed metric data processing framework based on RabbitMQ. This repository used to be the central repository, but has since been splitted into several other repositories.

The different MetricQ language implementations can be found here:

The proto files of the used Protobuf definitions can be found here.

Documentation

Given the distributed architecture of MetricQ, the documentation is scattered over several repositories and webpages:

There are also a lot of client implementations available:

Setup development environment with docker-compose

Note: During the startup, especially on the first one, errors and restarts of some services are normal! Please be patient.

Just run:

docker-compose -f docker-compose-development.yml up

This will setup:

  • Grafana server (port 3000 forwarded to localhost:3001)
  • CouchDB server (port 5984 forwarded to localhost)
  • RabbitMQ server (port 5672 and 15672 forwarded to localhost)
  • MetricQ Wizard (port 3000 forwarded to localhost)
  • MetricQ Webview (port 80 forwarded to localhost:3002)
  • MetricQ Explorer (port 80 forwarded to localhost:3004)
  • MetricQ Wizard backend (port 8000 forwarded to localhost)
  • metricq-sink-websocket (port 3000 forwarded to localhost:3003)
  • MetricQ Manager
  • metricq-grafana (port 4000 forwarded to localhost)
  • C++ example source generating a metric called dummy.source
  • metricq-rabbitmq-source providing metricq.rabbitmq.[...] performance metrics for the running RabbitMQ server
  • metricq-source-sysinfo providing loalhost.[...] performance metrics for the docker host
  • metricq-db-hta database that stores the metrics
  • metricq-example-combinator a combinator that can combine metrics into new metrics

By default, all logins are admin / admin. Do not use this dockerfile for production use!

To run it in the background append -d:

docker-compose -f docker-compose-development.yml up -d

To stop everything run:

docker-compose -f docker-compose-development.yml stop

To stop and remove everything run

docker-compose -f docker-compose-development.yml down

Connecting to the MetricQ network

You can now connect to the network with amqp://admin:admin@localhost as url and dummy.source as a metric. Using the examples from metricq-python.

pip install ".[examples]"
./examples/metricq_sink.py --server amqp://admin:admin@localhost -m dummy.source

Setup clustered development environment with docker-compose

If you follow the steps from above instead with docker-compose-cluster.yml, three RabbitMQ nodes will be set up. On start, they will automatically form a cluster.

The container names will be (might be different for your specific setup):

  • metricq_rabbitmq-server-node0_1
  • metricq_rabbitmq-server-node1_1
  • metricq_rabbitmq-server-node2_1

By default, all MetricQ agents started from the compose file will connect to rabbitmq-server, which resolves to any of the three nodes.

Note: You need to make sure to use the new BuildKit by for instance setting the ENV variable COMPOSE_DOCKER_CLI_BUILD to 1, or use docker-compose newer than 1.28.0-rc3.

Configure like live Cluster

  • Create a user-policy with
    • Name: ManagementAsHA
    • Pattern: management
    • Definition: ha-mode: all

Connecting to nodes from docker network

Use the hostname rabbitmq-server and the client will connect to random node in the cluster.

For specific nodes, use the hostnames rabbitmq-node0, rabbitmq-node1, or rabbitmq-node2.

Connecting to nodes from host or remotely

The different RabbitMQ nodes are listening on the network interface of their host.

  • rabbitmq-node0: 5671 / 15671
  • rabbitmq-node1: 5672 / 15672
  • rabbitmq-node2: 5673 / 15673

Acknowledgements

This work is supported in part by the German Research Foundation (DFG) within the CRC 912 - HAEC.

Primary Reference

Thomas Ilsche, Daniel Hackenberg, Robert Schöne, Mario Bielert, Franz Höpfner and Wolfgang E. Nagel: MetricQ: A Scalable Infrastructure for Processing High-Resolution Time Series Data 📕 2019 IEEE/ACM Industry/University Joint International Workshop on Data-center Automation, Analytics, and Control (DAAC), Denver, CO, USA, 2019, pp. 7-12, DOI: 10.1109/DAAC49578.2019.00007.

Additional Reference

Thomas Ilsche: Energy Measurements of High Performance Computing Systems: From Instrumentation to Analysis 📕 2020 Doctoral dissertation TU Dresden, URN: urn:nbn:de:bsz:14-qucosa2-716000

metricq-sink-nsca's People

Contributors

bmario avatar dependabot[bot] avatar kinnarr avatar phijor avatar tilsche avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

mstud

metricq-sink-nsca's Issues

Add check for discrete values

Sometimes, a metric should have a particular value, everything else denotes a failure. We need an option to configure that.

Add a global "ignore list" of metrics

Sometimes, a metric is known to temporarily produce "bad" data. We would like to ignore it from any checks without having to remove it from the check configuration (as that usually results in misconfigurations when re-adding the metric).

Idea: have a new top-level configuration key override that contains temporary overrides:

{
  "override": {
    "ignored_metrics": [
      "foo.bar.*"
    ]
  }
}

Connection closed: CHANNEL_ERROR - expected 'channel.open' (ConnectionChannelError)

sink-nsca stopped after following error:

Sep 20 16:41:09 igel metricq-sink-nsca[751]: [2021-09-20 16:41:09,185] [INFO ] [metricq_sink_nsca.reporter] Sending 1 NSCA report(s)
Sep 20 16:41:21 igel metricq-sink-nsca[751]: [2021-09-20 16:41:21,795] [ERROR] [metricq_sink_nsca.reporter] Failed to send reports to NSCA host at xx.xx.xx.xx:5667: returncode=2
Sep 20 16:41:21 igel metricq-sink-nsca[751]: [2021-09-20 16:41:21,795] [ERROR] [metricq_sink_nsca.reporter] send_nsca: Error: Timeout after %d seconds
Sep 20 16:41:26 igel metricq-sink-nsca[751]: [2021-09-20 16:41:26,364] [ERROR] [metricq.agent       ] Exception in event loop: Future exception was never retrieved
Sep 20 16:41:26 igel metricq-sink-nsca[751]: [2021-09-20 16:41:26,364] [ERROR] [metricq.agent       ] Future: <Future finished exception=BrokenPipeError(32, 'Broken pipe')>
Sep 20 16:41:26 igel metricq-sink-nsca[751]: [2021-09-20 16:41:26,364] [ERROR] [metricq.agent       ] Stopping Agent on unhandled exception (BrokenPipeError)
Sep 20 16:41:27 igel metricq-sink-nsca[751]: [2021-09-20 16:41:27,640] [INFO ] [metricq.data_client ] closing data channel and connection.
Sep 20 16:41:27 igel metricq-sink-nsca[751]: [2021-09-20 16:41:27,812] [INFO ] [metricq_sink_nsca.reporter] Sending 3 NSCA report(s)
Sep 20 16:41:50 igel metricq-sink-nsca[751]: [2021-09-20 16:41:50,062] [INFO ] [metricq.agent       ] Connection closed: CHANNEL_ERROR - expected 'channel.open' (ConnectionChannelError)
Sep 20 16:41:50 igel metricq-sink-nsca[751]: [2021-09-20 16:41:50,062] [INFO ] [aio_pika.robust_connection] Connection to amqps://nsca:******@rabbitmq:5671/data closed. Reconnecting after 5 seconds.
Sep 20 16:41:50 igel metricq-sink-nsca[751]: [2021-09-20 16:41:50,069] [INFO ] [metricq.agent       ] Stopping Agent ReporterSink ([Errno 32] Broken pipe)...
Sep 20 16:41:50 igel metricq-sink-nsca[751]: [2021-09-20 16:41:50,075] [INFO ] [metricq.agent       ] Closing management channel and connection...
Sep 20 16:41:50 igel metricq-sink-nsca[751]: Traceback (most recent call last):
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq/agent.py", line 251, in run
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     self.event_loop.run_until_complete(wait_for_stop())
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "uvloop/loop.pyx", line 1494, in uvloop.loop.Loop.run_until_complete
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq/agent.py", line 247, in wait_for_stop
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     return stopped_task.result()
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq/agent.py", line 497, in stopped
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     await self._stop_future
Sep 20 16:41:50 igel metricq-sink-nsca[751]: BrokenPipeError: [Errno 32] Broken pipe
Sep 20 16:41:50 igel metricq-sink-nsca[751]: During handling of the above exception, another exception occurred:
Sep 20 16:41:50 igel metricq-sink-nsca[751]: Traceback (most recent call last):
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/bin/metricq-sink-nsca", line 8, in <module>
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     sys.exit(main())
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/click/core.py", line 829, in __call__
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     return self.main(*args, **kwargs)
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/click/core.py", line 782, in main
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     rv = self.invoke(ctx)
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/click/core.py", line 1066, in invoke
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     return ctx.invoke(self.callback, **ctx.params)
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/click/core.py", line 610, in invoke
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     return callback(*args, **kwargs)
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq_sink_nsca/main.py", line 76, in main
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     reporter.run(cancel_on_exception=True)
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq/agent.py", line 254, in run
Sep 20 16:41:50 igel metricq-sink-nsca[751]:     self.event_loop.run_until_complete(self.event_loop.shutdown_asyncgens())
Sep 20 16:41:50 igel metricq-sink-nsca[751]:   File "uvloop/loop.pyx", line 1492, in uvloop.loop.Loop.run_until_complete
Sep 20 16:41:50 igel metricq-sink-nsca[751]: RuntimeError: Event loop stopped before Future completed.

Improved logging: per-check loggers, runtime configuration

Currently, tracing the right log messages is hard: either you log too much or not enough. Look into ways how to improve this situation.

  • per-check loggers: filtering log entries by check makes it easier to debug specific error conditions that only apply to a select set of checks
  • runtime configuration: providing per-check/per-component log levels via configure-RPC allows to toggle log-levels without restarting the client

Improve handling of non-monotonic metric values

In theory, all metrics should only produce strictly monotonically increasing timestamps.

Practice is different though. Currently, the checker fails with:

ValueError: Failed to update state history of 'elab.ariel.s1.package.power.1Hz'
[2021-02-03 15:08:57,053] [ERROR] [metricq_sink_nsca.check] Unhandled exception when checking values for 'elab.ariel.s1.dram.power.1Hz'
Traceback (most recent call last):
  File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq_sink_nsca/state_cache.py", line 417, in update_state
    metric_history.insert(time=timestamp, state=state)
  File "/home/service/envs/nsca/lib/python3.9/site-packages/metricq_sink_nsca/state_cache.py", line 133, in insert
    raise ValueError(

and cannot update the state of the corresponding metric.

Options:

  1. Ignore non-monotonics
  2. Ignore them, but make the behaviour configurable
  3. Keep it this way.

Throttling notifications

Not sure what exactly is going wrong, but today it seems 13k notifications were created for the elab checks within 25 minutes...

Whatever the circumstances, this must not happen.

Rumor has it, if noone cleared the backlog Centreon is still be sending mails right now...

Gracefully handle non-monotonic metrics

if time <= latest_transition.time:
raise ValueError(
f"Times of state transitions must be strictly increasing: "
f"new transition at {time.posix_ns} is not after "
f"latest transition at {latest_transition.time.posix_ns}"
)

If triggered here, this crashes the sink.

Plan of action (done in 720bc5e):

  • log descriptive error
  • catch exception
  • send CRITICAL report
  • (optionally) handle a lot more errors this way

Make parsing of DataChunks optional

There are cases when the parsing of the DataChunk should be omitted. For example, I'm trying to just send a warning when a certain metric hasn't received new values.

The checks are looking like this:

    "source_elab_lmg95": {
      "metrics": [
        "elab.ariel.s0.package.power",
        "elab.ariel.s1.package.power",
        "elab.ariel.s0.dram.power",
        "elab.ariel.s1.dram.power"
      ],
      "timeout": "900s"
    },

Those metrics aggregate to around 600k Samples/s. There's no way, Python can process that amount.

Client fails to resubscribe after cluster restart

During a cluster restart, clients failed to correctly obtain a new data queue, thus they reported a CRITICAL timeout.

Here's a log:

[2021-04-22 12:23:48,959] [INFO ] [metricq.agent       ] Connection closed: (320, 'CONNECTION_FORCED - Node was put into maintenance mode') (ConnectionClosed)
[2021-04-22 12:23:48,960] [INFO ] [aio_pika.robust_connection] Connection to amqps://user_xxx:******@host_yyy:5671/ closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:51,960] [INFO ] [metricq.agent       ] Connection closed: 0 bytes read on a total of 1 expected bytes (ConnectionError)
[2021-04-22 12:23:51,961] [INFO ] [aio_pika.robust_connection] Connection to amqps://user_xxx:******@host_yyy:5671/data closed. Reconnecting after 5 seconds.
[2021-04-22 12:23:54,000] [INFO ] [metricq.agent       ] Reconnected to amqps://user_xxx:******@host_yyy:5671/
[2021-04-22 12:23:56,985] [INFO ] [metricq.agent       ] Reconnected to amqps://user_xxx:******@host_yyy:5671/data
[2021-04-22 12:23:56,985] [INFO ] [metricq.sink        ] Sink data connection (amqps://user_xxx:******@host_yyy:5671/data) reestablished!
[2021-04-22 12:23:56,985] [INFO ] [metricq.sink        ] Resubscribing to 3881 metric(s) with RPC parameters {'dataQueue': '<redacted>'}...
[2021-04-22 12:23:57,001] [INFO ] [metricq.agent       ] sending RPC sink.subscribe, ex: metricq.management, rk: sink.subscribe, ci: <redacted>, args: {"function": "sink.subscribe", "metrics": [...] }
[2021-04-22 12:24:25,498] [INFO ] [metricq.agent       ] received message from manager, correlation id: <redacted>, reply_to: None, length: 1291442
{"dataServerAddress": "amqps://host_yyy:5671/data", "dataQueue": "<redacted>", "metrics": { ... }
[2021-04-22 12:24:25,515] [INFO ] [metricq.agent       ] rpc completed in 28.5290430621244 s
[2021-04-22 12:31:06,112] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,113] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-XYZ': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,114] [WARNING] [metricq_sink_nsca.check] Check 'Check-ABC': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,126] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s
[2021-04-22 12:31:06,126] [WARNING] [metricq_sink_nsca.check] Check 'Check-UVW': <redacted> timed out after 300.0s

NSCA host puts service into `UNKNOWN` state, but client seems operational

We encountered an error where the the NSCA host displayed a service in state UNKNOWN since it did not receive check results for a long time. Nonetheless, the metricq-sink-nsca client seemed to run without any issue.
After restarting the client, the problemen vanished and the service state recovered.

This might have been caused by one the following:

  • there is a bug in metricq-sink-nsca where it continues to consume metric data, but does not send any new reports.
  • (unlikely after having a look at the NSCA host logs): metricq-sink-nsca was fully functional, successfully sending check results, but the NSCA host dropped them along the way

In the latter case, we should debug the problem by logging whenever a report was sent successfully. I released version 1.8.1 that includes more log messages when reports are sent.

TODO:

  • actually figure out why the client refuses to send reports randomly

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.