Coder Social home page Coder Social logo

cephmetrics's Introduction

cephmetrics

Cephmetrics is a tool that allows a user to visually monitor various metrics in a running Ceph cluster.

Prerequisites

  • RHEL 7 should be running on all hosts
  • A functional ceph cluster running version ceph-osd-10.2.7-27.el7cp.x86_64 or later is already up and running.
  • Another host machine independent of the ceph machines must be available. This host will be used to receive data pushed by the hosts in the Ceph cluster, and will run the dashboard to display that data.
  • A host machine on which to execute ansible-playbook to orchestrate the deployment must be available.
  • Passwordless SSH access from the deploy host to the ceph hosts. The username should be the same for all hosts.
  • Passwordless sudo access on the ceph and dashboard hosts
  • All hosts must share the same DNS domain

Resulting configuration

After running this procedure, you will have the following configuration.

  • The ceph nodes will have collectd installed, along with collector plugins from cephmetrics-collectd
  • The dashboard host will have grafana installed and configured to display various dashboards by querying data received from Ceph nodes via a graphite-web, python-carbon, and python-whisper stack.

Installation

Install cephmetrics-ansible

First, decide which machine you want to use to run ansible-playbook. If you used ceph-ansible to set up your cluster, you may want to reuse that same host to take advantage of the inventory file that was created as part of that process.

Once the host is selected, perform the following steps there. This will install a repo which includes the cephmetrics installation code and ansible (version 2.2.3 or later):

sudo su -
mkdir ~/cephmetrics
subscription-manager repos --enable rhel-7-server-optional-rpms --enable rhel-7-server-rhscon-2-installer-rpms
curl -L -o /etc/yum.repos.d/cephmetrics.repo http://download.ceph.com/cephmetrics/rpm-master/el7/cephmetrics.repo
yum install cephmetrics-ansible

Create or edit the inventory file

Next, we need an inventory file. If you are running ansible-playbook on a host that previously ran ceph-ansible, you may simply modify /etc/ansible/hosts; otherwise you may copy /usr/share/cephmetrics-ansible/inventory.sample and modify it if you wish.

The inventory file format looks like:

[ceph-grafana]
grafana_host.example.com

[osds]
osd0.example.com
osd1.example.com
osd2.example.com

[mons]
mon0.example.com
mon1.example.com
mon2.example.com

[mdss]
mds0.example.com

[rgws]
rgw0.example.com

If you are running ansible-playbook on a host mentioned in the inventory file, you will need to append ansible_connection=local to each line in the inventory file that mentions that host. An example: my_host.example.com ansible_connection=local Omit the mdss section if no ceph mds nodes are installed. Omit the rgws section if no rgw nodes are installed.

Ansible variables can be set in a vars.yml file if necessary. If it is required, make sure to add -e '@/path/to/vars.yml to your ansible-playbook invocation below. Click here for more information.

Deploy via ansible-playbook

If you are using a ceph-ansible host, run these commands:

cd /usr/share/cephmetrics-ansible
ansible-playbook -v playbook.yml

Otherwise, run these commands:

cd /usr/share/cephmetrics-ansible
ansible-playbook -v -i /path/to/inventory playbook.yml

Note: The reason it is necessary to change directories is so that ansible-playbook will use the bundled ansible.cfg; there is currently no command-line argument allowing the specification of an arbitrary .cfg file.

cephmetrics's People

Contributors

b-ranto avatar christinameno avatar jjict avatar marcosmamorim avatar pcuzner avatar xenolinux avatar zmc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cephmetrics's Issues

Merging package repos

At some point, we'll want to merge the repos containing dependencies and cephmetrics itself. I'm not sure if we want to do this manually, or if this chacra feature might help.

Implement bluestore support in the osd collector

The osd collector provides OS and OSD level performance metrics for filestore based OSDs, but only os level metrics for bluestore OSDs.

Bluestore metrics should be exposed within the cephmetrics dashboards

Carbon disk IO causing grafana updates to time out

@pcuzner noticed that on a 10-node cluster, the cephmetrics host was saturating its (spinning) disk's IO capabilities. I noticed that grafana updates were often taking >10s, resulting in their being aborted due to the 10s refresh rate on the dashboards.

num_osd_hosts is calculated incorrectly

the mon collector is counting the hosts listed in osd tree output to determine the number of physical hosts - but this is flawed when crush is used to create separate pools - the same physical host can appear multiple times.

Odd per-host metrics counts

I wonder if this might partially explain #45

$ curl -o /tmp/metrics_index.json http://magna010.ceph.redhat.com:8080/metrics/index.json
$ for i in $(seq 119 128); do host=magna$i; echo -n $host; jq '.' /tmp/metrics_index.json | grep "^ *\"collectd\.$host" | wc -l; done
magna119      93
magna120      93
magna121      93
magna122      93
magna123     169
magna124     157
magna125     157
magna126     516
magna127     516
magna128     532

no utilization stats for SSD journals

For non-collocated OSDs, it would be nice to graph I/O on the journal devices. We might not want to put SSD devices in same graph as HDD devices for a IOPS or transfer rate graph, but it would be really useful to see utilization - i.e. are the SSDs over/undersubscribed?

Add a per-host filter on OSD node detail?

On a cluster with a large number of hosts (36 in that case), the data can be very difficult to read. I'm not sure adding a per host filter is the right solution however.

grafana1
grafana2

Logging changes break collectd

d3432ca broke collectd:

Jun 26 16:48:50 magna120 collectd[12136]: Unhandled python exception in loading module: TypeError: __init__() got an unexpected keyword argument 'log_lev
el'

5899e3f helps, but:

Jun 26 16:55:45 magna120 collectd[17899]: Unhandled python exception in loading module: IOError: [Errno 13] Permission denied: '/var/log/collectd-cephmetrics-OSDs.log'

iscsi PR broke collectors

Aug 08 19:44:54 mira120 collectd[28366]: python plugin: Error importing module "cephmetrics".
Aug 08 19:44:54 mira120 collectd[28366]: Unhandled python exception in importing module: ImportError: No module named rtslib_fb
Aug 08 19:44:54 mira120 collectd[28366]: Traceback (most recent call last):
Aug 08 19:44:54 mira120 collectd[28366]:   File "/usr/lib/collectd/cephmetrics/cephmetrics/__init__.py", line 15, in <module>
Aug 08 19:44:54 mira120 collectd[28366]:     from collectors import (common, iscsi, mon, osd, rgw)
Aug 08 19:44:54 mira120 collectd[28366]:   File "/usr/lib/collectd/cephmetrics/cephmetrics/collectors/iscsi.py", line 8, in <module>
Aug 08 19:44:54 mira120 collectd[28366]:     from rtslib_fb import RTSRoot
Aug 08 19:44:54 mira120 collectd[28366]: ImportError: No module named rtslib_fb

Unit tests

We need unit tests. One roadblock is the current organization of the python modules; I'm working on converting cephmetrics' python code into a package.

Another issue is that collectd's python module is only available in the context of collectd itself importing the plugin, due to the way collectd's python.so is constructed.

Closing the packaging gap

I'm using a repo to collect dependencies.

So far we have our own builds of:

collectd-5.7.1-4.el7_3.src.rpm
golang-1.8.3-1.el7_3.src.rpm
grafana-4.3.2-2.el7_3.src.rpm
liboping-1.6.2-2.el7_3.src.rpm
phantomjs-1.9.7-3.el7_3.src.rpm
python-carbon-0.9.15-2.1.el7scon.src.rpm
riemann-c-client-1.6.1-4.el7_3.src.rpm

I've added these to the repo as well. I'm unsure if we'll need to rebuild them.

graphite-web-0.9.15-1.el7scon.src.rpm
python-carbon-0.9.15-2.1.el7scon.src.rpm
python-django-1.6.11-4.el7scon.src.rpm
python-django-tagging-0.3.1-11.1.el7.src.rpm
python-simplejson-3.5.3-6.el7ost.src.rpm
python-whisper-0.9.15-1.1.el7scon.src.rpm

some graphs never fill in

some of the grafana displays never fill in, and it's reallly not obvious why. See here for example. Could it state why there is no data if it doesn't display a graph? ceph -s shows cluster is healthy, and has been up for over 1 hour. Some of the graphs have red triangle in them and when I hover over that, it just says internal server error.

RHSC Compatibility

After upgrading our cluster from Ceph 1.3 to 2.3 we now have RHSC installed.
It does roughly the same thing that this project does (collectd and graphite).
If I wanted to use this project instead of RHSC should it be setup against hosts that don't have any connection to a RHSC?

RFE: cephmetrics should be able to pull metrics for a dmcrypt OSD

I just got cephmetrics installed and the dashboards are awesome!
Good jobs guys, however cephmetrics is not working on our OSD hosts because we use dmcrypt for our OSD devices, IE we deployed the osds with ceph-disk prepare --dmcrypt /dev/sdX

Collectd is erroring on getting stats for the disk, here's the error message from collectd:

Jul 27 13:16:35 data4-lab-00.someinternaldns.com collectd[191421]: Initialization complete, entering read-loop.
Jul 27 13:16:35 data4-lab-00.someinternaldns.com collectd[191421]: Unhandled python exception in read callback: IOError: [Errno 2] No such file or directory: '/sys/block/eeabbcadfcc/queue/rotational'
Jul 27 13:16:35 data4-lab-00.someinternaldns.com collectd[191421]: read-function of plugin `python.cephmetrics' failed. Will suspend it for 20.000 seconds.```

here's an output of our /proc/mounts:
/dev/mapper/e970914e-8011-41a1-b9b3-c6a4d0fc7c33 /var/lib/ceph/osd/ceph-2 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/436ae319-e27a-46d3-8b8e-2dab007999ca /var/lib/ceph/osd/ceph-10 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/466d6da6-de7a-43ba-9516-462e94fddbf1 /var/lib/ceph/osd/ceph-33 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/25f3735b-6fdf-4aa2-a876-c0af2f79f4ad /var/lib/ceph/osd/ceph-14 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/97c186a3-6ccc-4983-ad26-cec38316e7ea /var/lib/ceph/osd/ceph-22 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/3d5be4ae-7cf3-4308-8838-c2a65e0e64ed /var/lib/ceph/osd/ceph-30 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/75c55bbc-e4fc-4d1c-86a0-ffe20ec90ffb /var/lib/ceph/osd/ceph-4 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/b0c64975-1b2b-4cdd-94b2-93bbee656fa4 /var/lib/ceph/osd/ceph-34 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/c20c2b15-fd65-4bce-869c-c5f3ece25875 /var/lib/ceph/osd/ceph-31 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/d110d34f-8f10-4e75-a648-7a97d682b0a1 /var/lib/ceph/osd/ceph-17 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/d467b2d2-da43-4eaf-a48b-77f86c659a28 /var/lib/ceph/osd/ceph-25 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/356b6b81-a794-4dd0-915e-efaf36c4216d /var/lib/ceph/osd/ceph-7 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0
/dev/mapper/ccbf012d-7aee-41cf-b788-e6ab3455d25f /var/lib/ceph/osd/ceph-18 xfs rw,context=unconfined_u:object_r:var_lib_t:s0,noatime,attr2,inode64,noquota 0 0

MON's and OSD's should not have apostrophes

I know style guides differ about these, but most of our documentation and more-common modern usage seems to indicate there should be no apostrophe (it connotes possession, which is not what is meant).

http://www.grammarbook.com/punctuation/apostro.asp Rule 6 is one discussion. The final entry in https://en.oxforddictionaries.com/punctuation/apostrophe is another. Both seem to agree that it's useful to use apostrophe for plurals in the case of single letters or digits (i.e. "A's" meaning more than one A), presumably because it's so easy to miss the distinction and consider it part of one word rather than a plural.

Anyway I think the majority opinion is 'no apostrophe'.

Installation failed missing dependencies

Deployment failed because of dependency. Requires this package https://access.redhat.com/downloads/content/python-twisted-core/12.2.0-4.el7/x86_64/fd431d51/package

TASK [ceph-grafana : Install packages] ******************************************************************************************
ok: [qcttwcoehd41.qct.com] => (item=unzip) => {"changed": false, "item": "unzip", "msg": "", "rc": 0, "results": ["All packages providing unzip are up to date", ""]}
changed: [qcttwcoehd41.qct.com] => (item=graphite-web) => {"changed": true, "item": "graphite-web", "msg": "", "rc": 0, "results": ["Loaded plugins: product-id, search-disabled-repos, subscription-manager\nResolving Dependencies\n--> Running transaction check\n---> Package graphite-web.noarch 0:0.9.16-1.el7 will be installed\n--> Processing Dependency: python-whisper >= 0.9.16 for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: python-django >= 1.3 for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: pytz for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: python-simplejson for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: pycairo for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: mod_wsgi for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: django-tagging for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: dejavu-serif-fonts for package: graphite-web-0.9.16-1.el7.noarch\n--> Processing Dependency: dejavu-sans-fonts for package: graphite-web-0.9.16-1.el7.noarch\n--> Running transaction check\n---> Package dejavu-sans-fonts.noarch 0:2.33-6.el7 will be installed\n--> Processing Dependency: dejavu-fonts-common = 2.33-6.el7 for package: dejavu-sans-fonts-2.33-6.el7.noarch\n---> Package dejavu-serif-fonts.noarch 0:2.33-6.el7 will be installed\n---> Package mod_wsgi.x86_64 0:3.4-12.el7_0 will be installed\n--> Processing Dependency: httpd-mmn = 20120211x8664 for package: mod_wsgi-3.4-12.el7_0.x86_64\n---> Package pycairo.x86_64 0:1.8.10-8.el7 will be installed\n---> Package python-django.noarch 0:1.6.11.6-1.el7 will be installed\n--> Processing Dependency: python-django-bash-completion = 1.6.11.6-1.el7 for package: python-django-1.6.11.6-1.el7.noarch\n---> Package python-django-tagging.noarch 0:0.3.1-11.1.el7 will be installed\n---> Package python-whisper.noarch 0:0.9.16-1.el7 will be installed\n---> Package python2-simplejson.x86_64 0:3.10.0-1.el7 will be installed\n---> Package pytz.noarch 0:2012d-5.el7 will be installed\n--> Running transaction check\n---> Package dejavu-fonts-common.noarch 0:2.33-6.el7 will be installed\n---> Package httpd.x86_64 0:2.4.6-45.el7_3.4 will be installed\n--> Processing Dependency: httpd-tools = 2.4.6-45.el7_3.4 for package: httpd-2.4.6-45.el7_3.4.x86_64\n---> Package python-django-bash-completion.noarch 0:1.6.11.6-1.el7 will be installed\n--> Running transaction check\n---> Package httpd-tools.x86_64 0:2.4.6-45.el7_3.4 will be installed\n--> Finished Dependency Resolution\n\nDependencies Resolved\n\n================================================================================\n Package                       Arch   Version          Repository          Size\n================================================================================\nInstalling:\n graphite-web                  noarch 0.9.16-1.el7     epel               1.8 M\nInstalling for dependencies:\n dejavu-fonts-common           noarch 2.33-6.el7       rhel-7-server-rpms  64 k\n dejavu-sans-fonts             noarch 2.33-6.el7       rhel-7-server-rpms 1.4 M\n dejavu-serif-fonts            noarch 2.33-6.el7       rhel-7-server-rpms 777 k\n httpd                         x86_64 2.4.6-45.el7_3.4 rhel-7-server-rpms 1.2 M\n httpd-tools                   x86_64 2.4.6-45.el7_3.4 rhel-7-server-rpms  84 k\n mod_wsgi                      x86_64 3.4-12.el7_0     rhel-7-server-rpms  76 k\n pycairo                       x86_64 1.8.10-8.el7     rhel-7-server-rpms 157 k\n python-django                 noarch 1.6.11.6-1.el7   epel               4.0 M\n python-django-bash-completion noarch 1.6.11.6-1.el7   epel                16 k\n python-django-tagging         noarch 0.3.1-11.1.el7   cephmetrics-noarch  57 k\n python-whisper                noarch 0.9.16-1.el7     epel                42 k\n python2-simplejson            x86_64 3.10.0-1.el7     epel               188 k\n pytz                          noarch 2012d-5.el7      rhel-7-server-rpms  38 k\n\nTransaction Summary\n================================================================================\nInstall  1 Package (+13 Dependent packages)\n\nTotal download size: 9.8 M\nInstalled size: 37 M\nDownloading packages:\n--------------------------------------------------------------------------------\nTotal                                              474 kB/s | 9.8 MB  00:21     \nRunning transaction check\nRunning transaction test\nTransaction test succeeded\nRunning transaction\n  Installing : dejavu-fonts-common-2.33-6.el7.noarch                       1/14 \n  Installing : dejavu-serif-fonts-2.33-6.el7.noarch                        2/14 \n  Installing : dejavu-sans-fonts-2.33-6.el7.noarch                         3/14 \n  Installing : python-django-bash-completion-1.6.11.6-1.el7.noarch         4/14 \n  Installing : python-django-1.6.11.6-1.el7.noarch                         5/14 \n  Installing : python-django-tagging-0.3.1-11.1.el7.noarch                 6/14 \n  Installing : pytz-2012d-5.el7.noarch                                     7/14 \n  Installing : pycairo-1.8.10-8.el7.x86_64                                 8/14 \n  Installing : python-whisper-0.9.16-1.el7.noarch                          9/14 \n  Installing : python2-simplejson-3.10.0-1.el7.x86_64                     10/14 \n  Installing : httpd-tools-2.4.6-45.el7_3.4.x86_64                        11/14 \n  Installing : httpd-2.4.6-45.el7_3.4.x86_64                              12/14 \n  Installing : mod_wsgi-3.4-12.el7_0.x86_64                               13/14 \n  Installing : graphite-web-0.9.16-1.el7.noarch                           14/14 \n  Verifying  : python-django-tagging-0.3.1-11.1.el7.noarch                 1/14 \n  Verifying  : mod_wsgi-3.4-12.el7_0.x86_64                                2/14 \n  Verifying  : httpd-tools-2.4.6-45.el7_3.4.x86_64                         3/14 \n  Verifying  : python2-simplejson-3.10.0-1.el7.x86_64                      4/14 \n  Verifying  : python-django-1.6.11.6-1.el7.noarch                         5/14 \n  Verifying  : graphite-web-0.9.16-1.el7.noarch                            6/14 \n  Verifying  : dejavu-fonts-common-2.33-6.el7.noarch                       7/14 \n  Verifying  : dejavu-serif-fonts-2.33-6.el7.noarch                        8/14 \n  Verifying  : dejavu-sans-fonts-2.33-6.el7.noarch                         9/14 \n  Verifying  : python-whisper-0.9.16-1.el7.noarch                         10/14 \n  Verifying  : pycairo-1.8.10-8.el7.x86_64                                11/14 \n  Verifying  : pytz-2012d-5.el7.noarch                                    12/14 \n  Verifying  : python-django-bash-completion-1.6.11.6-1.el7.noarch        13/14 \n  Verifying  : httpd-2.4.6-45.el7_3.4.x86_64                              14/14 \n\nInstalled:\n  graphite-web.noarch 0:0.9.16-1.el7                                            \n\nDependency Installed:\n  dejavu-fonts-common.noarch 0:2.33-6.el7                                       \n  dejavu-sans-fonts.noarch 0:2.33-6.el7                                         \n  dejavu-serif-fonts.noarch 0:2.33-6.el7                                        \n  httpd.x86_64 0:2.4.6-45.el7_3.4                                               \n  httpd-tools.x86_64 0:2.4.6-45.el7_3.4                                         \n  mod_wsgi.x86_64 0:3.4-12.el7_0                                                \n  pycairo.x86_64 0:1.8.10-8.el7                                                 \n  python-django.noarch 0:1.6.11.6-1.el7                                         \n  python-django-bash-completion.noarch 0:1.6.11.6-1.el7                         \n  python-django-tagging.noarch 0:0.3.1-11.1.el7                                 \n  python-whisper.noarch 0:0.9.16-1.el7                                          \n  python2-simplejson.x86_64 0:3.10.0-1.el7                                      \n  pytz.noarch 0:2012d-5.el7                                                     \n\nComplete!\n"]}
failed: [qcttwcoehd41.qct.com] (item=python-carbon) => {"changed": true, "failed": true, "item": "python-carbon", "msg": "Error: Package: python-carbon-0.9.16-1.el7.noarch (epel)\n           Requires: python-twisted-core >= 8.0\n", "rc": 1, "results": ["Loaded plugins: product-id, search-disabled-repos, subscription-manager\nResolving Dependencies\n--> Running transaction check\n---> Package python-carbon.noarch 0:0.9.16-1.el7 will be installed\n--> Processing Dependency: python-twisted-core >= 8.0 for package: python-carbon-0.9.16-1.el7.noarch\n--> Finished Dependency Resolution\nError: Package: python-carbon-0.9.16-1.el7.noarch (epel)\n           Requires: python-twisted-core >= 8.0\n**********************************************************************\nyum can be configured to try to resolve such errors by temporarily enabling\ndisabled repos and searching for missing dependencies.\nTo enable this functionality please set 'notify_only=0' in /etc/yum/pluginconf.d/search-disabled-repos.conf\n**********************************************************************\n\n You could try using --skip-broken to work around the problem\n** Found 1 pre-existing rpmdb problem(s), 'yum check' output follows:\nceph-ansible-1.0.5-32.el7scon.noarch has missing requires of ansible < ('0', '2', None)\n"]}

selinux deployment problem

Internal testing has shown a problem with deploying the selinux policy
fatal: [ip-172-31-40-85]: FAILED! => {"changed": true, "cmd": ["semodule", "-i", "/tmp/cephmetrics.pp"], "delta": "0:00:12.218719", "end": "2017-07-11 17:34:41.236180", "failed": true, "rc": 1, "start": "2017-07-11 17:34:29.017461", "stderr": "Failed to resolve typeattributeset statement at /etc/selinux/targeted/tmp/modules/400/cephmetrics/cil:3\nsemodule: Failed!", "stdout": "", "stdout_lines": [], "warnings": []}

ref. http://perf1.perf.lab.eng.bos.redhat.com/bengland/tmp/cephmetrics-ansible.log

CentOS compatibility

i have a hyper converged cloud deployed via tripleo quickstart with 3 controllers and 3 osd compute nodes and I've attempted to deploy cephmetrics via ansible but it failed. The nodes are based on centos images and not rhel images so the install fails because the playbook uses the subscription-manager to do something which is not possible on a centos image i believe. is there a way around this? i can provide more information if necessary.

OSD hosts calculation is incorrect

The mon collector determines the number of osd hosts in the cluster by looking at the number of unique host names in ceph osd tree. This strategy fails however, when a custom crushmap is used to set up different pools across the same physical hosts

The result is that the OSD Hosts panel shows a higher number of hosts than actually exists.

The OSD collector could count the osd's it has and store that to graphite. The OSD hosts panel can then count the number of metrics that contain the osd_count to determine the OSD host count - with a countSeries query

growth and forecast panels showing inconsistent data

The offset used is not to a fixed point in time (eg. max over a day, 7 days ago) so the values change in more of a moving window algorithm. This makes the data change frequently, which is not what an admin will expect.

Flapping in singlestat panels

I've heard reports from two people that the singlestat panels on the at-a-glance dashboard were flapping. I finally saw it myself. These were taken six seconds apart on the internal cluster:

screen shot 2017-06-20 at 4 55 35 pm

screen shot 2017-06-20 at 4 55 41 pm

Dashboards not updating properly

A tale of two screenshots. The first is from the Sepia cluster I'm testing on; the second is from the internal cluster. I just reran ansible on both.
screen shot 2017-06-23 at 12 36 29 pm
screen shot 2017-06-23 at 12 40 29 pm

Documentation changes needed

The README makes a number of assumptions on the reader and should be updated to reflect

  1. Repo needs to be added before installing cephmetrics-ansible (i.e. swap yum install cephmetrics-ansible and curl -L <...>)
  2. I think we ought to assume that users will have ceph-ansible already and thus require that playbook to be run from the same machine. That removes the need to add rhel-7-server-rhscon-2-installer-rpms and create a new inventory file, simplifying the ansible command to: ansible-playbook -v playbook.yml
  3. Doc should specify the URL for Grafana after deployment.
  4. Doc should specify that all the hosts should be able to resolve the grafana host.

Thanks to Alexandre Marangone for the feedback

need more control on data retention

A couple of users have mentioned now that they'd like more flexibility with how the retention policy is exposed in the deployment. This could just be a doc error, but should be addressed.

assigning to zack

collectd-ping and collectd-disk don't get installed on collectd hosts

After the initial deploy, collectd did not start with an error about missing ping.
After some troubleshooting the ping collector was not installed, we installed the collectd-ping and collectd-disk packages and collectd was able to start.
here's the ansible tasks we added to install those:

diff --git a/ansible/roles/ceph-collectd/tasks/install_packages.yml b/ansible/roles/ceph-collectd/tasks/install_packages.yml
index b290ac3..25f70c1 100644
--- a/ansible/roles/ceph-collectd/tasks/install_packages.yml
+++ b/ansible/roles/ceph-collectd/tasks/install_packages.yml
@@ -17,6 +17,26 @@
     - not use_epel
   notify: Restart collectd

+- name: Install collectd-ping
+  package:
+    name: collectd-ping
+    state: latest
+  when:
+    - ansible_pkg_mgr == "yum"
+    - devel_mode
+    - not use_epel
+  notify: Restart collectd
+
+- name: Install collectd-disk
+  package:
+    name: collectd-disk
+    state: latest
+  when:
+    - ansible_pkg_mgr == "yum"
+    - devel_mode
+    - not use_epel
+  notify: Restart collectd
+
 - name: Install cephmetrics-collectors
   package:
     name: cephmetrics-collectors```

gracefully handling single OSD failure

I just had an issue with one OSD on a data node where it didn't respond to perf dump on it's admin socket. This caused the error message from collectd, which caused no stats to be collected.
I wonder if there could be some better logic around failures of collecting 'perf dump'

Aug 05 07:01:07  collectd[84657]: read-function of plugin `python.cephmetrics' failed. Will suspend it for 320.000 seconds.
Aug 05 07:06:27  collectd[84657]: Unhandled python exception in read callback: AttributeError: 'NoneType' object has no attribute 'get'
Aug 05 07:06:27  collectd[84657]: read-function of plugin `python.cephmetrics' failed. Will suspend it for 640.000 seconds. ```

RGW Dashboard initial graph needs to be fixed height

The RGW overview total requests per sec expands automatically if there are lots of rgw hosts. This is fine, but it throws the panel layout out, resulting in the second layer of charts being offset due to this first panel's height.

Fixing the panel to 230 resolves the issue

cephmetrics dependencies

Why does the cephmetrics rpm (delivering the dashboards) have a dependency on the collectors rpm?

Error: Package: cephmetrics-0.1-231_ga895841.el7.centos.x86_64 (/cephmetrics-0.1-231_ga895841.el7.centos.x86_64)
Requires: cephmetrics-collectors = 0.1-231_ga895841.el7.centos

allowing all dashboards to reference a user-specified time range

I couldn't get all dashboards to restrict the time range to a specified interval, but if you click on a widget within a dashboard and it takes you to another one, the time range is preserved, so there is inconsistent behavior. Even with relative time ranges, if you change it in one dashboard it isn't changed in the others.
Why is this important? Let's say I have a test result, or there was a performance problem at a site reported at a certain time - we'd like to be able to look at what happened around that time, from a variety of perspectives = dashboards. But grafana documentation here just says no, see last sentence.

For example, a link like this one will take you to a particular time interval for a particular dashboard - this one was for yesterday. You can either use the time range GUI or you can calculate using python by computing start and end time since 1970 in sec (what time.time() returns) multiplying by 1000 and converting to int.

I was talking with folks in IRC #rhos-scale, where Browbeat developers (OpenStack's collectd-grafana package) are, about this. One of them ("throck" = Tom Throckmorton) directed me to this grafana issue, which may be a chunk of the problem. Here's my comment on this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.