tarantool / metrics Goto Github PK
View Code? Open in Web Editor NEWMetric collection library for Tarantool
License: MIT License
Metric collection library for Tarantool
License: MIT License
this should be refactored, it is pretty dangerous to use test only switches in production builds
Originally posted by @vasiliy-t in #69
I've found some sort implementations written in pure Lua - https://github.com/DarkRoku12/lua_sort (also you can take the tests here)
Could you check them, may be some of them could be a bit faster.
Also it will be great if you share some benchmarks shows that current approach is better than C-written.
(Without consideration that we drop C part and gcc requirement - it's obviously perfect)
Originally posted by @olegrok in #112 (comment)
Since metrics now combines the power of above-mentioned modules, it is safe to add a deprecation note to each of them.
I propose to add to CI validation of metrics with promtool to make sure that Prometheus accepts it.
Average collector resets count on each collect, so if we have two metric collectors, some part of observation data will be given to the first one and the rest will be given to the second one.
HTTP latency collector starts working only on first processed request, before it it's null rps
, not 0 rps
.
Using graphite 1.1.7 (docker image graphiteapp/graphite-statsd) faced with graphite ignoring some default metrics from module because of ULL suffix of values.
Log sample:
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=transfers;engine=memtx 11511360ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=kv;engine=memtx 49379ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=attempts_count;engine=memtx 98850ULL 1594291594]
The same behavior noticed with metrics: tnt_space_bsize
, tnt_cfg_current_time
metrics/cartridge/roles/metrics.lua
Line 6 in a780a21
there is no description about what is it and how it works
Average
Can be used only as a collector for HTTP statistics (described below) and cannot be built explicitly.```
We could collect such metrics as:
issues_count
of type gauge. Value is a number of cluster issues this instance knows. This should be good enough for basic alerting - healthy cluster reports 0 issues. - closed in #243 and #144tnt_info_uptime
tnt_read_only
, but needs desing tooSometimes, we need to add some label (e.g. instance alias) to each metric we collect. But it's inconvenient to pass or change it (or even them) on every create and update operation we call. This problem can be solved by setting up some global_labels
table for metrics, which we can append to collected metrics' labels.
Possible solutions:
label_pairs
field in each metric on its creation/update. It is driver-independent and straightforward solution, but it require some excessive memory, table copy on each metrics update. It will also be harder to set global labels along the way.label_pairs
on output. It is simpler to code, cause it doesn't change inner logic, and it require less memory operations and storage. It will also be easier to set them along the way. On the contrary, it's driver-dependent.label_pairs
on Shared:collect(...)
method call. It has positive aspects of both previous solutions (driver-independent, easy to set along the way, don't require excessive memory and memory operations, don't revise inner update logic) and don't inherits any significant disadvantages.Default metrics return metric tnt_cfg_listen
with the string type that unsupported by Prometheus.
local metrics = require('metrics')
local http_router = require('http.router')
local http_server = require('http.server')
local http_handler = require('metrics.plugins.prometheus').collect_http
box.cfg{
listen = '0.0.0.0:3301',
}
metrics.enable_default_metrics()
local httpd = http_server.new('0.0.0.0', 8088, {log_requests = true})
local router = http_router.new():route({path = '/metrics'}, http_handler)
httpd:set_router(router)
httpd:start()
# HELP tnt_cfg_listen Tarantool port
# TYPE tnt_cfg_listen gauge
tnt_cfg_listen 0.0.0.0:3301
level=warn ts=2020-01-24T14:45:02.109Z caller=scrape.go:930 component="scrape manager" scrape_pool=tarantool target=http://test5.tarantool.e:8088/metrics msg="append failed" err="strconv.ParseFloat: parsing \"0.0.0.0:3301\": invalid syntax"
It's necessary to add documentation and examples for summary collector in README and Getting Started docs.
It is not clear from an example, that in order for this to work correctly you need ONLY ONE collector for router. Preferably even default collector.
Otherwise you get inconsistent metric names and a lot of metrics, like:
{
"label_pairs": {
"path": "/labels",
"method": "GET",
"status": 200,
"alias": "tnt-router"
},
"timestamp": 1600351181350306,
"metric_name": "labels_latency_avg",
"value": 0
},
instead of:
{
"label_pairs": {
"path": "/labels",
"method": "GET",
"status": 200,
"alias": "tnt-router"
},
"timestamp": 1600352083432399,
"metric_name": "http_server_request_latency_avg",
"value": 0
},
So when registering routes for endpoints, everything must use the same collector. This can be done using default collector. Something along the lines of:
local http_middleware = require('cartridge').service_get('metrics').http_middleware
if http_collector == nil then
http_collector = http_middleware.build_default_collector('average')
end
server:route({ path = "/labels", method = "ANY"}, http_middleware.v1(handler, http_collector))
Пожалуйста добавьте метрики по статусу репликации downstream и upstream. Сейчас есть только lag и lsn, но это не позволяет проверить, что реплика развалилась.
I think it makes sense to move metrics/default_metrics/tarantool/utils.lua
in a different package, because it may be used in other packages.
Originally posted by @oleggator in #69
Some values from enable_default_metrics()
may have suffix ULL for long numbers and Prometheus doesn't recognize it.
Quick solution:
local ret = prometheus.collect_http(req)
ret.body = ret.body:gsub("ULL", "")
We need to add in-depth description of default metrics, because link to stat deprecated repo contains outdated info and no descriptions.
To answer the question "What default metrics contains of?" it is needed to start a Tarantool instance and collect default metrics (to get the list of default metrics) and then search for info in the net (like https://www.tarantool.io/ru/doc/1.10/reference/reference_lua/fiber/#fiber-info and https://www.tarantool.io/en/doc/1.10/reference/reference_lua/box_slab/) for default metrics meaning.
The current version contains two kinds of metrics: tnt_stats_op_*_total
and tnt_stats_op_*_rps
, where the asterisk is an operation name. I suggest making it a tag.
It would be great to add:
*. the number of requests in queues (iproto -> tx, tx -> iproto);
*. utilization of readahead buffer;
*. the number of requests executing simultaneously;
At current version vclock metric looks as info_vclock_1, info_vclock_2, etc. I think that is not convenient. I suggest making a number of vclock as a tag.
Graphite version: 1.1.7.
User states that Graphite accept only time in seconds, but we send microseconds. It results in not working graphs.
User suggestion: replace
ts = tostring(fiber.time64()):sub(1, -4) -- Delete ULL sufix
with
ts = tostring(fiber.time64()):sub(1, -10)) -- Delete ULL suffix and 6 digits
Prometheus have some problems with suffixes '_count' and '_total' for non-summary and non-histogram metrics
Most latency-related functions, described with ldoc (I suppose), contains description of input parameters but lacks description of output parameters.
https://github.com/tarantool/metrics/blob/master/metrics/collectors/shared.lua#L93
https://github.com/tarantool/metrics/blob/master/metrics/http_middleware.lua#L63
Currently the list of files to be installed is hardcoded in both the rockspec and the rpm spec. This should not happen. Instead, the package should be installed with 'make install'.
Please port the build script from Makefile to CMake, and implement packing for debian (now absent)
It would be nice to add some tests for tarantoolctl rocks install ...
and luarocks install ...
and run as rocks since in #101 there is dynamic lib in package
If I enable this role in my init.lua I agree that my metrics will be enabled on ALL instances. It could be really strange to enable metrics per replicaset.
So I propose:
Currently if you do this:
local counter = metrics.counter('foobar')
Then it creates a new counter object every time. This means that users have to store counter objects carefully.
What I propose is to make such calls idemponent: if the parameters (like histogram buckets etc) don't change -- then just return an existing counter.
I enabled default metrics for Prometheus, however, the documentation for these metrics is poor and hasn't full information.
metrics == 0.5.0
unix/:/var/run/tarantool/app.control> package.reload()
---
- error: '/.rocks/share/tarantool/metrics/quantile.lua:10: cannot change a protected
metatable'
...
Example:
tnt_space_bsize{name="sequence",engine="memtx"} +Inf
tnt_space_bsize{name="jobs",engine="memtx"} +Inf
tnt_space_bsize{name="repair_queue",engine="memtx"} +Inf
tnt_space_bsize{name="audit_log_repair",engine="memtx"} +Inf
tnt_space_bsize{name="command_list",engine="memtx"} +Inf
tnt_space_bsize{name="test",engine="memtx"} +Inf
But in fact they are empty
metrics/metrics/http_middleware.lua
Line 20 in 23e5674
According to: tarantool/cartridge#873
proposed configuration format:
metrics:
export:
- path: "/metrics/json"
format: "json"
- path: "/metrics/prom"
format: "prometheus"
where
metrics
is a top level section nameexport
is exporter configuration, e.g. [1] is a way to enable json metrics via http endpoint /metrics/jsonAccording to tarantool/doc#1328
I propose the following structure in docs:
User's Guide
We should test different sorting solution for ffi double arrays to avoid using dynamic lib with comparator function. Note that performance must be highest priority.
Currently we have metrics.connect()
in public API, which creates a worker doing periodic exports to metrics.server
.
We should move it under metrics/plugins
folder.
There is no vshard metrics collected at all.
getrusage()
allows us to get our own resource usage. It would be nice to add it to metrics.
{"label_pairs":{"some_label":"label"},"timestamp":1598451366672309,"metric_name":"name","value":"1605461ULL"}
{"label_pairs":{"name":"space_name","engine":"memtx"},"timestamp":1598462586194806,"metric_name":"tnt_space_total_bsize","value":"0ULL"}
Because of it we could not measure time in nanoseconds.
I suppose the reason is here:
metrics/metrics/plugins/json/init.lua
Lines 5 to 17 in f567492
The only types supported by Prometheus are Gauge, Counter, Histogram and Summary (docs):
Using tarantool: 2.3.2-1-g9be641b
and Metrics library: metrics == 0.1.8
curl 10.3.151.235:8081/metrics
Unhandled error: ...e/tarantool/metrics/default_metrics/tarantool/spaces.lua:41: variable 'include_vinyl_count' is not declared
stack traceback:
/opt/tarantool/.rocks/share/tarantool/http/server.lua:743: in function 'process_client'
/opt/tarantool/.rocks/share/tarantool/http/server.lua:1199: in function </opt/tarantool/.rocks/share/tarantool/http/server.lua:1198>
[C]: in function 'pcall'
builtin/socket.lua:1073: in function <builtin/socket.lua:1071>
The metrics endpoint is being initialised using the cartridge http server using this function
local httpd = cartridge.service_get('httpd')
if httpd == nil then
error('failed to get cartridge httpd service for prometheus')
end
metrics.enable_default_metrics()
httpd:route({
path = '/metrics',
method = 'GET',
public = true,
}, prometheus.collect_http)
Currently it's not clear how to create counter objects and set their values. Please add it to readme
In current version name of an index is in name of metrics. I suggest making it a tag.
I suggest deleting these metrics.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.