The metrics from tarantool

refactoring: remove test code

this should be refactored, it is pretty dangerous to use test only switches in production builds

Originally posted by @vasiliy-t in #69

Different implementation of Quicksort

I've found some sort implementations written in pure Lua - https://github.com/DarkRoku12/lua_sort (also you can take the tests here)

Could you check them, may be some of them could be a bit faster.

Also it will be great if you share some benchmarks shows that current approach is better than C-written.
(Without consideration that we drop C part and gcc requirement - it's obviously perfect)

Originally posted by @olegrok in #112 (comment)

Add deprecation notes to stat, graphite, prometheus.

Since metrics now combines the power of above-mentioned modules, it is safe to add a deprecation note to each of them.

CI checks with promtool

I propose to add to CI validation of metrics with promtool to make sure that Prometheus accepts it.

Average collector can't work with two metric collectors

Average collector resets count on each collect, so if we have two metric collectors, some part of observation data will be given to the first one and the rest will be given to the second one.

HTTP latency collector start

HTTP latency collector starts working only on first processed request, before it it's null rps, not 0 rps.

Graphite values ULL suffix

Using graphite 1.1.7 (docker image graphiteapp/graphite-statsd) faced with graphite ignoring some default metrics from module because of ULL suffix of values.

Log sample:

dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=transfers;engine=memtx 11511360ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=kv;engine=memtx 49379ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=attempts_count;engine=memtx 98850ULL 1594291594]

The same behavior noticed with metrics: tnt_space_bsize, tnt_cfg_current_time

Support custom handlers in metrics role

metrics/cartridge/roles/metrics.lua

Line 6 in a780a21

local handlers = {

Add doc for ``average`` collector

there is no description about what is it and how it works

Average
Can be used only as a collector for HTTP statistics (described below) and cannot be built explicitly.```

cartridge metrics

We could collect such metrics as:

We could add new metric issues_count of type gauge. Value is a number of cluster issues this instance knows. This should be good enough for basic alerting - healthy cluster reports 0 issues. - closed in #243 and #144
Cartridge instance state (like OperationError) as numerical value -- needs design
~~Time since last restart~~ - already present as metric tnt_info_uptime
Failover trigger count -- could be transformed from metric tnt_read_only, but needs desing too

Sometimes, we need to add some label (e.g. instance alias) to each metric we collect. But it's inconvenient to pass or change it (or even them) on every create and update operation we call. This problem can be solved by setting up some global_labels table for metrics, which we can append to collected metrics' labels.

Possible solutions:

Append global labels to every label_pairs field in each metric on its creation/update. It is driver-independent and straightforward solution, but it require some excessive memory, table copy on each metrics update. It will also be harder to set global labels along the way.
Append global labels to label_pairs on output. It is simpler to code, cause it doesn't change inner logic, and it require less memory operations and storage. It will also be easier to set them along the way. On the contrary, it's driver-dependent.
Append global labels to label_pairs on Shared:collect(...) method call. It has positive aspects of both previous solutions (driver-independent, easy to set along the way, don't require excessive memory and memory operations, don't revise inner update logic) and don't inherits any significant disadvantages.

Wrong metric type

Default metrics return metric tnt_cfg_listen with the string type that unsupported by Prometheus.

local metrics = require('metrics')
local http_router = require('http.router')
local http_server = require('http.server')
local http_handler = require('metrics.plugins.prometheus').collect_http

box.cfg{
    listen = '0.0.0.0:3301',
}

metrics.enable_default_metrics()
local httpd = http_server.new('0.0.0.0', 8088, {log_requests = true})
local router = http_router.new():route({path = '/metrics'}, http_handler)
httpd:set_router(router)
httpd:start()

# HELP tnt_cfg_listen Tarantool port
# TYPE tnt_cfg_listen gauge
tnt_cfg_listen 0.0.0.0:3301

level=warn ts=2020-01-24T14:45:02.109Z caller=scrape.go:930 component="scrape manager" scrape_pool=tarantool target=http://test5.tarantool.e:8088/metrics msg="append failed" err="strconv.ParseFloat: parsing \"0.0.0.0:3301\": invalid syntax"

Summary: documentation

It's necessary to add documentation and examples for summary collector in README and Getting Started docs.

Expand documentation on http_middleware module

It is not clear from an example, that in order for this to work correctly you need ONLY ONE collector for router. Preferably even default collector.

Otherwise you get inconsistent metric names and a lot of metrics, like:

  {
    "label_pairs": {
      "path": "/labels",
      "method": "GET",
      "status": 200,
      "alias": "tnt-router"
    },
    "timestamp": 1600351181350306,
    "metric_name": "labels_latency_avg",
    "value": 0
  },

instead of:

  {
    "label_pairs": {
      "path": "/labels",
      "method": "GET",
      "status": 200,
      "alias": "tnt-router"
    },
    "timestamp": 1600352083432399,
    "metric_name": "http_server_request_latency_avg",
    "value": 0
  },

So when registering routes for endpoints, everything must use the same collector. This can be done using default collector. Something along the lines of:

local http_middleware = require('cartridge').service_get('metrics').http_middleware
if http_collector == nil then
    http_collector = http_middleware.build_default_collector('average')
end

server:route({ path = "/labels", method = "ANY"}, http_middleware.v1(handler, http_collector))

Add metrics for replication status

Пожалуйста добавьте метрики по статусу репликации downstream и upstream. Сейчас есть только lag и lsn, но это не позволяет проверить, что реплика развалилась.

Cartridge role is not described in documentation

move `metrics/default_metrics/tarantool/utils.lua` in a different package

I think it makes sense to move metrics/default_metrics/tarantool/utils.lua in a different package, because it may be used in other packages.

Originally posted by @oleggator in #69

Invalid output for Prometheus by default_metrics

Some values from enable_default_metrics() may have suffix ULL for long numbers and Prometheus doesn't recognize it.

Quick solution:

local ret = prometheus.collect_http(req)
ret.body = ret.body:gsub("ULL", "")

Add in-depth description of default metrics

We need to add in-depth description of default metrics, because link to stat deprecated repo contains outdated info and no descriptions.

To answer the question "What default metrics contains of?" it is needed to start a Tarantool instance and collect default metrics (to get the list of default metrics) and then search for info in the net (like https://www.tarantool.io/ru/doc/1.10/reference/reference_lua/fiber/#fiber-info and https://www.tarantool.io/en/doc/1.10/reference/reference_lua/box_slab/) for default metrics meaning.

tnt_stats_op_* metric is not convenient

The current version contains two kinds of metrics: tnt_stats_op_*_total and tnt_stats_op_*_rps, where the asterisk is an operation name. I suggest making it a tag.

Add metrics for number of requests in input and output queues

It would be great to add:
*. the number of requests in queues (iproto -> tx, tx -> iproto);
*. utilization of readahead buffer;
*. the number of requests executing simultaneously;

Metric info_vclock_* is inconvinient

At current version vclock metric looks as info_vclock_1, info_vclock_2, etc. I think that is not convenient. I suggest making a number of vclock as a tag.

Graphite time in seconds

Graphite version: 1.1.7.

User states that Graphite accept only time in seconds, but we send microseconds. It results in not working graphs.

User suggestion: replace

ts = tostring(fiber.time64()):sub(1, -4) -- Delete ULL sufix

with

ts = tostring(fiber.time64()):sub(1, -10)) -- Delete ULL suffix and 6 digits

Rename default metrics

Prometheus have some problems with suffixes '_count' and '_total' for non-summary and non-histogram metrics

Latency functions descriptions lack output info

Most latency-related functions, described with ldoc (I suppose), contains description of input parameters but lacks description of output parameters.

https://github.com/tarantool/metrics/blob/master/metrics/collectors/shared.lua#L93
https://github.com/tarantool/metrics/blob/master/metrics/http_middleware.lua#L63

Port to CMake

Currently the list of files to be installed is hardcoded in both the rockspec and the rpm spec. This should not happen. Instead, the package should be installed with 'make install'.

Please port the build script from Makefile to CMake, and implement packing for debian (now absent)

rocks tests

It would be nice to add some tests for tarantoolctl rocks install ... and luarocks install ... and run as rocks since in #101 there is dynamic lib in package

Declare metrics cartridge role as permanent

If I enable this role in my init.lua I agree that my metrics will be enabled on ALL instances. It could be really strange to enable metrics per replicaset.

So I propose:

Declare metrics role as permanent (I believe it's literally could be redundant in long roles list)

Don't create a counter if it already exists

Currently if you do this:

local counter = metrics.counter('foobar')

Then it creates a new counter object every time. This means that users have to store counter objects carefully.

What I propose is to make such calls idemponent: if the parameters (like histogram buckets etc) don't change -- then just return an existing counter.

Lack of Documentation for default_metrics

I enabled default metrics for Prometheus, however, the documentation for these metrics is poor and hasn't full information.

Not working with package.reload

metrics == 0.5.0

unix/:/var/run/tarantool/app.control> package.reload()
---
- error: '/.rocks/share/tarantool/metrics/quantile.lua:10: cannot change a protected
    metatable'
...

metrics shows +Inf as bsize on empty spaces

Example:

tnt_space_bsize{name="sequence",engine="memtx"} +Inf
tnt_space_bsize{name="jobs",engine="memtx"} +Inf
tnt_space_bsize{name="repair_queue",engine="memtx"} +Inf
tnt_space_bsize{name="audit_log_repair",engine="memtx"} +Inf
tnt_space_bsize{name="command_list",engine="memtx"} +Inf
tnt_space_bsize{name="test",engine="memtx"} +Inf

But in fact they are empty

Update luatest to v0.4.0

Add summary to HTTP middleware description

metrics/metrics/http_middleware.lua

Line 20 in 23e5674

-- @string[opt='histogram'] type_name `histogram` or `average`

roles: metrics role for cartridge

According to: tarantool/cartridge#873

proposed configuration format:

metrics:
      export:
        - path: "/metrics/json"
          format: "json"
        - path: "/metrics/prom"
          format: "prometheus"

where

metrics is a top level section name
export is exporter configuration, e.g. [1] is a way to enable json metrics via http endpoint /metrics/json
Default metrics and global label 'alias' enable by default after init().

Documentation for tarantool/doc

According to tarantool/doc#1328
I propose the following structure in docs:

User's Guide

Monitoring
- Metrics API Reference
- Getting Started
- Plugins

Quantile: rewrite double sorting in Lua

We should test different sorting solution for ffi double arrays to avoid using dynamic lib with comparator function. Note that performance must be highest priority.

Add metrics.server Plugin

Currently we have metrics.connect() in public API, which creates a worker doing periodic exports to metrics.server.

We should move it under metrics/plugins folder.

vshard

There is no vshard metrics collected at all.

Add Continuous Integration

Add support for getrusage()

getrusage() allows us to get our own resource usage. It would be nice to add it to metrics.

json plugin converts number64 to string in "value" field

{"label_pairs":{"some_label":"label"},"timestamp":1598451366672309,"metric_name":"name","value":"1605461ULL"}

{"label_pairs":{"name":"space_name","engine":"memtx"},"timestamp":1598462586194806,"metric_name":"tnt_space_total_bsize","value":"0ULL"}

Because of it we could not measure time in nanoseconds.

I suppose the reason is here:

metrics/metrics/plugins/json/init.lua

Lines 5 to 17 in f567492

    
           local function finite(value) 
        
               if type(value) == "string" then 
        
                   value = tonumber(value) 
        
                   if value == nil then return nil end 
        
               elseif type(value) ~= "number" then 
        
                   return nil 
        
               end 
        
               return value > -metrics.INF and value < metrics.INF 
        
           end 
        
           local function format_value(value) 
        
               return finite(value) and value or tostring(value) 
        
           end

Monitor cartridge issues

Cartridge UI (since v. 2.0.2) display gauge with issues in top

which unrolls into the list of issues with descriptions

It would be convenient to also have it on monitoring dashboards (i.e. Grafana). We can start with plain issues gauge.

Average collector breaks Prometheus plugin

The only types supported by Prometheus are Gauge, Counter, Histogram and Summary (docs):

hitting metrics endpoint on storage nodes gives HTTP 500 with undefined variable in metrics library

Using tarantool: 2.3.2-1-g9be641b
and Metrics library: metrics == 0.1.8

curl 10.3.151.235:8081/metrics                                                                                                    
Unhandled error: ...e/tarantool/metrics/default_metrics/tarantool/spaces.lua:41: variable 'include_vinyl_count' is not declared
stack traceback:
	/opt/tarantool/.rocks/share/tarantool/http/server.lua:743: in function 'process_client'
	/opt/tarantool/.rocks/share/tarantool/http/server.lua:1199: in function </opt/tarantool/.rocks/share/tarantool/http/server.lua:1198>
	[C]: in function 'pcall'
	builtin/socket.lua:1073: in function <builtin/socket.lua:1071>

The metrics endpoint is being initialised using the cartridge http server using this function

            local httpd = cartridge.service_get('httpd')

            if httpd == nil then
                error('failed to get cartridge httpd service for prometheus')
            end

            metrics.enable_default_metrics()

            httpd:route({
                path = '/metrics',
                method = 'GET',
                public = true,
            }, prometheus.collect_http)

	local function finite(value)
	if type(value) == "string" then
	value = tonumber(value)
	if value == nil then return nil end
	elseif type(value) ~= "number" then
	return nil
	end
	return value > -metrics.INF and value < metrics.INF
	end

	local function format_value(value)
	return finite(value) and value or tostring(value)
	end

tarantool / metrics Goto Github PK

metrics's People

Contributors

Stargazers

Watchers

Forkers

metrics's Issues

Recommend Projects

Recommend Topics

Recommend Org