We have just started using graylog-plugin-metrics to relay metrics to our graphite instance, and we're finding extremely curious results, like this (the times on this are at GMT-0600, so 10:00 on the graph is 16:00 GMT):
The graph is built from graylog absorption of nginx access.log entries, which have request_time and upstream_response_time fields. Since the upstream response time must be completely contained within the request time, request_time should be strictly greater than upstream_response_time (it is not possible for a request to have a greater request_time than an upstream_response_time).
Note that it is explicitly possible for a request to not have an upstream_response_time (if it is nginx serving a static file), but we've explicitly adjusted for that -- the graylog stream that we are dealing with restricts events to only those that include an upstream_response_time.
I've validated that within graylog itself, the reality holds and that upstream_response_time is always less than request_time:
From 16:07 to 16:08, mean request_time is 193.43 and mean upstream_response_time is 175.78.
From 16:08 to 16:09, mean request_time is 190.33 and mean upstream_response_time is 163.96.
From 16:09 to 16:10, mean request_time is 219.43 and mean upstream_response_time is 183.53.
From 16:10 to 16:11, mean request_time is 215.08 and mean upstream_response_time is 186.41.
... thus, the data in graylog looks good. The request_time and upstream_response_time fields are sent (as histograms, via graylog-plugin-metrics) to graphite, and result in the following graphite data, however:
$ whisper-fetch.py --from=1458230820 --until=1458231060 request_time/mean.wsp
1458230880 192.900000
1458230940 152.620000
1458231000 221.990000
1458231060 206.160000
$ whisper-fetch.py --from=1458230820 --until=1458231060 upstream_response_time/mean.wsp
1458230880 171.720000
1458230940 223.010000
1458231000 202.560000
1458231060 191.670000
As you can see, the data for 1458230940 (16:09 GMT) shows a significantly higher mean upstream_response_time than request_time. This should not be possible.
As the data in graylog appears correct, I can only assume that the metrics plugin is applying a transform to the data as it builds the necessary histogram, and that this is leading to incorrect output being sent to graphite.
Please let me know if there is further debugging that I can do on this front to help aid in diagnosing this issue.