Coder Social home page Coder Social logo

mrhkyn / ceph-monitoring-with-elk Goto Github PK

View Code? Open in Web Editor NEW
1.0 0.0 0.0 1.03 MB

This is a project which allows monitoring many metrics of the CEPH (version: mimic, bluestore) components in terms of physical and OSD hosts using logstash, elasticsearch and kibana.

License: GNU General Public License v3.0

Python 100.00%
elk elk-stack docker docker-compose ceph monitoring collectd

ceph-monitoring-with-elk's Introduction

CEPH Monitoring with ELK

This project is using the ELK (ElasticSearch, Logstash, Kibana)' stack and Grafana to monitor a CEPH cluster (mimic: 13.2.0) where it has deployed 4 OSD hosts. The metrics relevant to physical hosts such as cpu load, memory usage, interface traffic and each osd sata and ssd disks are obtained with the collectd tool using built-in plugin installed on each host. Other metrics relevant to CEPH such as osd disk capacity, read and write latency, pool statistics are obtained with collectd tool based on the ceph-specific plugin developed in other project which is published here. Since the version of CEPH is mimic, the plugins were updated accordingly in this prooject.

Monitoring metrics

  • CEPH Overview
    • Number of Monitors, OSD up/down, OSD in/out
    • Cluster IOPS/Throughput
    • RADOS benchmark latency for osd and hdd pools
    • OSD disk apply/commit latency
    • OSD disk usage/capacity
  • OSD Disk Performance
    • OSD disk read/write IOPS/Throughput
    • OSD SSD read/write IOPS/Throughput
  • OSD Hosts Stats (for each host)
    • Multiple interface (40G and 10G) rx/tx traffic in terms of packets and octets
    • Memory usage
    • Load
  • Pool Statistics (for each pool)
    • Used space, # of stored Objects, pg/pgp_num
    • IOPS/Throughput per pool

Collectd

It was assumed that the collectd tool has already installed on the osd hosts. The tool needs to collect two different part of metrics which are specific for physical hosts and CEPH. Each one should be considered differently and configured accordingly.

Collectd for Physical Hosts

collectd.conf is a default configuration file for getting the basic measurements about the physical hosts so it has to be applied for all hosts separately. The time interval between the samples, host name and the list of plugins used are set in this file. The name of plugins listed below are enough in our case.

  • disk
  • interface
  • load
  • memory
  • network

The collected metrics have been published into LogStash which is running on 25826 port so that the network plugin might be configured as following.

<Plugin network>
	<Server "172.16.36.4" "25826">
	</Server>
</Plugin>

Collectd for CEPH

The files under collectd.conf.d folder have been prepared for getting metrics about CEPH usage and performance. The interface traffic as well as disk performance is not enough alone to assess the health of CEPH. Furthermore, there might be many pools where each one might be mapped with different CRUSH map (write HDD or SSD device). The list of configuration files are given below.

  • ceph_hdd_latency.conf
  • ceph_monitor.conf
  • ceph_osd.conf
  • ceph_pg.conf
  • ceph_pool.conf
  • ceph_ssd_latency.conf
We do not require to configure all CEPH hosts to gather metrics. Instead, a single OSD host which is eligible to run ceph management commands 
enough to get and publish all ceph relevant metrics.

The location of the python plugins, time interval (60 seconds) and name of the cluster (ceph) was set in this configuration files.

The python files (plugins) are installed under the /usr/lib/collectd/plugins/ceph path. It basically runs ceph cli commands such as ceph mon dump, ceph pg dump and obtains the results in json format. The plugins have been updated the latest version of CEPH (Mimic for now) because there are slightly changes between the releases.

  • If you have multiple pools based on HDD and SSD like our examples, we might prepare two configuration files and plugins and updated the name of metrics properly for each one. For example, hdd and ssd prefix has been added for the latency plugins (hdd_avg_latency, etc). The name of the metrics are important since they will be indexed in ElasticSearch and Graphane will visualize accordingly.

Logstash

It is assumed that Logstash, ElasticSearch and Kibana have been installed and worked correctly. The ELK stack has been deployed with docker cluster which is out of this scope.

LogStash is running on 172.16.36.4 and listening the 25826 port so that its configuration file requires to be set as following. The metrics coming from collectd were tagged with collectdceph to discriminate metrics coming from netflow and ceilometer openstack.

input {
  udp {
    port => 25826
    buffer_size => 1452
    codec => collectd { }
    tags => "collectdceph"
    type => collectdceph
  }

}

output {

  if "collectdceph" in [tags] {
    elasticsearch {
                  index => "collectdceph-%{+YYYY.MM.dd}"
                  hosts =>  ["172.16.36.2:9200"]
    }
  }

}

Kibana is very useful tool to assess the name of measurements under different plugin and run the specific query. I will strongly recommend to install Kibana and debug the flow and query using it.

![image] (https://gitlab.com/eakkoyun/ceph-monitoring-with-elk/raw/master/grafana/screenshots/kibana-overview.png)

Grafana

The source data for Grafana is ElasticSearch. The metrics have been collected using Logstash and indexed by collectdceph prefix that we need to set our source in Kibana. Furthermore, the name of the OSD hosts are zula prefix. We have defined the variables comprises of the host names. Grafana allows us that we can picked a specific host and visualize just its metrics on the dashboard.

The json files of dashboard are listed below.

  • CEPH_OVERVIEW.json
  • OSD_Disk_Performance.json
  • OSD_HOSTS_STATS.json
  • Pool_Statistics.json

ceph-monitoring-with-elk's People

Contributors

mrhkyn avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.