Coder Social home page Coder Social logo

collectd-ceph's Introduction

This is a fork of rochaporto's version, which seems to have been abandoned around 2015, with plenty of bugs and pull requests lying around. I have tried to incorporate most of the outstanding pulls and some of the interesting looking but not overly conflicting forks.

If you have a clean pull request I will eventually try to merge in. I don't promise to fix issues but if anyone do I'll merge it.

I don't really want to maintain this but I don't want to let it lying bug-ridden, I want to use it.

collectd-ceph

Overview

A set of collectd plugins monitoring and publishing metrics for Ceph components.

Screenshots

Sample Ceph Overview dashboard displaying common metrics from the plugins.

image

image

image

Sample Ceph Pool Capacity dashboard, if you are using pool quota this sample may be usefull.

image

Check here for the Ceph Overview dashboard definition.

Check here for the Ceph Pool Capacity dashboard definition.

Plugins and Metrics

There are several plugins, usually mapping to the ceph command line tools.

Find below a list of the available plugins and the metrics they publish.

  • ceph_monitor_plugin
    • ceph-<cluster>.mon.gauge.number (total number of monitors)
    • ceph-<cluster>.mon.gauge.quorum (number of monitors in quorum)
  • ceph_osd_plugin
    • ceph-<cluster>.osd.gauge.up (number of osds 'up')
    • ceph-<cluster>.osd.gauge.down (number of osds 'down')
    • ceph-<cluster>.osd.gauge.in (number of osds 'in')
    • ceph-<cluster>.osd.gauge.out (number of osds 'out')
  • ceph_pool_plugin
    • ceph-<cluster>.pool-<name>.gauge.size (per pool size)
    • ceph-<cluster>.pool-<name>.gauge.min_size (per pool min_size)
    • ceph-<cluster>.pool-<name>.gauge.pg_num (per pool pg_num)
    • ceph-<cluster>.pool-<name>.gauge.pgp_num (per pool pg_placement_num)
    • ceph-<cluster>.pool-<name>.gauge.quota_max_bytes (per pool quota_max_bytes)
    • ceph-<cluster>.pool-<name>.gauge.quota_max_objects (per pool quota_max_objects)
    • ceph-<cluster>.pool-<name>.gauge.max_avail (per pool max_available)
    • ceph-<cluster>.pool-<name>.gauge.objects (per pool objects number)
    • ceph-<cluster>.pool-<name>.gauge.objects (per pool objects number)
    • ceph-<cluster>.pool-<name>.gauge.read_bytes_sec (per pool read bytes/sec)
    • ceph-<cluster>.pool-<name>.gauge.write_bytes_sec (per pool write bytes/sec)
    • ceph-<cluster>.pool-<name>.gauge.op_per_sec (per pool iops)
    • ceph-<cluster>.pool-<name>.gauge.bytes_used (per pool bytes used)
    • ceph-<cluster>.pool-<name>.gauge.kb_used (per pool KBytes used)
    • ceph-<cluster>.pool-<name>.gauge.objects (per pool number of objects) min_size
    • ceph-<cluster>.cluster.gauge.total_avail (cluster space available)
    • ceph-<cluster>.cluster.gauge.total_space (cluster total raw space)
    • ceph-<cluster>.cluster.gauge.total_used (cluster raw space used)
  • ceph_pg_plugin
    • ceph-<cluster>.pg.gauge.<state> (number of pgs in <state>)
    • ceph-<cluster>.osd-<id>.gauge.fs_commit_latency (fs commit latency for osd)
    • ceph-<cluster>.osd-<id>.gauge.apply_commit_latency (apply commit latency for osd)
    • ceph-<cluster>.osd-<id>.gauge.kb_used (kb used by osd)
    • ceph-<cluster>.osd-<id>.gauge.kb (total space of osd)
  • ceph_latency_plugin
    • ceph-<cluster>.cluster.gauge.avg_latency (avg cluster latency)
    • ceph-<cluster>.cluster.gauge.max_latency (max cluster latency)
    • ceph-<cluster>.cluster.gauge.min_latency (min cluster latency)
    • ceph-<cluster>.cluster.gauge.stddev_latency (stddev of cluster latency)

Requirements

It assumes an existing installation of collectd - check docs for details.

If you want to publish to graphite, configure the write_graphite collectd plugin.

And you might want the awesome grafana too, which provides awesome displays.

Setup and Configuration

The example configuration(s) below assume the plugins to be located under /usr/lib/collectd/plugins/ceph.

If you're under ubuntu, consider installing from this ppa.

Each plugin should have its own config file, under /etc/collectd/conf.d/<pluginname>.conf, which should follow something similar to:

# cat /etc/collectd/conf.d/ceph_pool.conf

<LoadPlugin "python">
    Globals true
</LoadPlugin>

<Plugin "python">
    ModulePath "/usr/lib/collectd/plugins/ceph"

    Import "ceph_pool_plugin"

    <Module "ceph_pool_plugin">
        Verbose "True"
        Cluster "ceph"
        Interval "60"
        TestPool "test"
    </Module>
</Plugin>

Puppet

If you use puppet for configuration, then try this excelent collectd module.

It has plenty of docs on how to use it, but for our specific plugins:

  collectd::plugin::python { 'ceph_pool':
    modulepath => '/usr/lib/collectd/plugins/ceph',
    module     => 'ceph_pool_plugin',
    config     => {
      'Verbose'  => 'true',
      'Cluster'  => 'ceph',
      'Interval' => 60,
      'TestPool' => 'test',
    },
  }

Docker

Check this repo for a nice docker setup to run collectd-ceph (thanks to Ian Babrou).

Limitations

The debian packaging files are provided, but not yet available in the official repos.

Development

All contributions more than welcome, just send pull requests.

License

GPLv2 (check LICENSE).

Contributors

Support

Please log tickets and issues at the github home.

Additional Notes

Some handy instructions on how to build for ubuntu.

collectd-ceph's People

Contributors

bobrik avatar cfz avatar gcmalloc avatar grinapo avatar kallio avatar krissn avatar lloucas-imvu avatar mourgaya avatar patchkez avatar rochaporto avatar umesecke avatar y4ns0l0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

collectd-ceph's Issues

Apache License

Are you willing to re-license this code under Apache License so it can be used with OpenStack?

Cluster I/O Latency metrics are not gathered

Hi,
Thank you very much for keeping this project alive.
Im using this y4ns0l0 dockerized solution to monitor my cluster.
I still have problems with Cluster I/O Latency
I see "no datapoints" error.and while looking in graphite i can see that the relevant metrics are not gathered.
Cluster I/O latency: cluster.gauge.latency
What seems to be the problem? Thank you very much!

Unable to get below metrics using this plugin on my mon node

I am unable to get below metrics in my graphite after using this collectd plugin.

=================================================================
pg_plugin

-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:00 gauge-stale.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:00 gauge-peering.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:00 gauge-degraded.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:00 gauge-undersized.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:00 gauge-remapped.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Sep 18 15:27 gauge-snaptrim_wait.wsp

=================================================================
cluster

-rw-r--r-- 1 _graphite _graphite 751720 Nov 15 03:50 gauge-health_warn.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Nov 15 03:50 gauge-health.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Nov 15 03:50 gauge-health_err.wsp
-rw-r--r-- 1 _graphite _graphite 751720 Nov 15 03:50 gauge-health_ok.wsp

latency plugin needs love: rados bench output should be documented

Right now there's a regexp which tries to parse the bench output, which fails with my output.
I have patched it up in 5024baa, but they should be either unified or if not possible, thrown into an array and check in a loop. I (or someone nice) need to check older and newer output, and see if the regexp was buggy or the output have considerably changed.

ceph-latency fails with UnboundLocalError: local variable 'stats' referenced before assignment

Hi, first: thanks for creating this fork!

We're running the plugins on debian jessie, python 2.7. All plugins work, except the ceph-latency plugin fails with the following errors:

ceph: failed to get stats :: list index out of range :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 162, in read_callback
stats = self.get_stats()...
Nov 01 17:13:55 ceph2 collectd[40152]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment

I have already edited the pool-name from 'data' to a pool that we actually have, plus I downloaded https://github.com/collectd/collectd/blob/collectd-5.4/contrib/python/getsigchld.py (since jessie comes with collectd 5.4) to /usr/local/lib/python2.7/site-packages

Both did not help: same error, still.

Anyone with a suggestion?

Per OSD and PG information is not gathered (Luminous)

Hi,
Im using @y4ns0l0 dockerized solution to monitor my cluster.

I cant see metrics like osd-* or pg.
It used to work before....
I can see only: cluster, mon, osd, pool, pool-*
pg and osd-* are missig....
when i run docker logs i see:

ceph: failed to get stats :: 'fs_perf_stat' :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 162, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_pg_plugin.py", line 70, in get_stats
data[ceph_cluster][osd_id]['apply_latency_ms'] = osd['fs_perf_stat']['apply_latency_ms']
KeyError: 'fs_perf_stat'

Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
read-function of plugin `python.ceph_pg_plugin' failed. Will suspend it for 15360.000 seconds.

ceph: failed to get stats :: 'fs_perf_stat' :: Traceback (most recent call last):
File "/usr/lib/collectd/plugins/ceph/base.py", line 162, in read_callback
stats = self.get_stats()
File "/usr/lib/collectd/plugins/ceph/ceph_pg_plugin.py", line 70, in get_stats
data[ceph_cluster][osd_id]['apply_latency_ms'] = osd['fs_perf_stat']['apply_latency_ms']
KeyError: 'fs_perf_stat'

Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
read-function of plugin `python.ceph_pg_plugin' failed. Will suspend it for 7680.000 seconds.

Am i doing something wrong or it is something with luminous?
When I tried this solution before it worked perfectly.
Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.