Coder Social home page Coder Social logo

metrics's People

Contributors

a1div0 avatar andreyaksenov avatar artembo avatar artur-barsegyan avatar asverdlov avatar differentialorange avatar eugenepaniot avatar filonenko-mikhail avatar int3cd avatar kasen avatar knazarov avatar lenkis avatar mmelentiev-mail avatar nickvolynkin avatar no1seman avatar ochaton avatar olegrok avatar onvember avatar opomuc avatar patiencedaur avatar printercu avatar reo7sp avatar runsfor avatar vanyarock01 avatar vasiliy-t avatar vpotseluyko avatar xuniq avatar ylobankov avatar yngvar-antonsson avatar zwirec avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metrics's Issues

CI checks with promtool

I propose to add to CI validation of metrics with promtool to make sure that Prometheus accepts it.

Graphite values ULL suffix

Using graphite 1.1.7 (docker image graphiteapp/graphite-statsd) faced with graphite ignoring some default metrics from module because of ULL suffix of values.

Log sample:

dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=transfers;engine=memtx 11511360ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=kv;engine=memtx 49379ULL 1594291594]
dockerd-current[19642]: 09/07/2020 10:46:34 :: [listener] invalid line received from 172.17.0.1, ignoring [myapp.tnt_space_total_bsize;name=attempts_count;engine=memtx 98850ULL 1594291594]

The same behavior noticed with metrics: tnt_space_bsize, tnt_cfg_current_time

Add doc for ``average`` collector

there is no description about what is it and how it works

Average
Can be used only as a collector for HTTP statistics (described below) and cannot be built explicitly.```

cartridge metrics

We could collect such metrics as:

  1. We could add new metric issues_count of type gauge. Value is a number of cluster issues this instance knows. This should be good enough for basic alerting - healthy cluster reports 0 issues. - closed in #243 and #144
  2. Cartridge instance state (like OperationError) as numerical value -- needs design
  3. Time since last restart - already present as metric tnt_info_uptime
  4. Failover trigger count -- could be transformed from metric tnt_read_only, but needs desing too

Add optional global labels

Sometimes, we need to add some label (e.g. instance alias) to each metric we collect. But it's inconvenient to pass or change it (or even them) on every create and update operation we call. This problem can be solved by setting up some global_labels table for metrics, which we can append to collected metrics' labels.

Possible solutions:

  • Append global labels to every label_pairs field in each metric on its creation/update. It is driver-independent and straightforward solution, but it require some excessive memory, table copy on each metrics update. It will also be harder to set global labels along the way.
  • Append global labels to label_pairs on output. It is simpler to code, cause it doesn't change inner logic, and it require less memory operations and storage. It will also be easier to set them along the way. On the contrary, it's driver-dependent.
  • Append global labels to label_pairs on Shared:collect(...) method call. It has positive aspects of both previous solutions (driver-independent, easy to set along the way, don't require excessive memory and memory operations, don't revise inner update logic) and don't inherits any significant disadvantages.

Wrong metric type

Default metrics return metric tnt_cfg_listen with the string type that unsupported by Prometheus.

local metrics = require('metrics')
local http_router = require('http.router')
local http_server = require('http.server')
local http_handler = require('metrics.plugins.prometheus').collect_http

box.cfg{
    listen = '0.0.0.0:3301',
}

metrics.enable_default_metrics()
local httpd = http_server.new('0.0.0.0', 8088, {log_requests = true})
local router = http_router.new():route({path = '/metrics'}, http_handler)
httpd:set_router(router)
httpd:start()
# HELP tnt_cfg_listen Tarantool port
# TYPE tnt_cfg_listen gauge
tnt_cfg_listen 0.0.0.0:3301
level=warn ts=2020-01-24T14:45:02.109Z caller=scrape.go:930 component="scrape manager" scrape_pool=tarantool target=http://test5.tarantool.e:8088/metrics msg="append failed" err="strconv.ParseFloat: parsing \"0.0.0.0:3301\": invalid syntax"

Summary: documentation

It's necessary to add documentation and examples for summary collector in README and Getting Started docs.

Expand documentation on http_middleware module

It is not clear from an example, that in order for this to work correctly you need ONLY ONE collector for router. Preferably even default collector.

Otherwise you get inconsistent metric names and a lot of metrics, like:

  {
    "label_pairs": {
      "path": "/labels",
      "method": "GET",
      "status": 200,
      "alias": "tnt-router"
    },
    "timestamp": 1600351181350306,
    "metric_name": "labels_latency_avg",
    "value": 0
  },

instead of:

  {
    "label_pairs": {
      "path": "/labels",
      "method": "GET",
      "status": 200,
      "alias": "tnt-router"
    },
    "timestamp": 1600352083432399,
    "metric_name": "http_server_request_latency_avg",
    "value": 0
  },

So when registering routes for endpoints, everything must use the same collector. This can be done using default collector. Something along the lines of:

local http_middleware = require('cartridge').service_get('metrics').http_middleware
if http_collector == nil then
    http_collector = http_middleware.build_default_collector('average')
end

server:route({ path = "/labels", method = "ANY"}, http_middleware.v1(handler, http_collector))

Add metrics for replication status

Пожалуйста добавьте метрики по статусу репликации downstream и upstream. Сейчас есть только lag и lsn, но это не позволяет проверить, что реплика развалилась.

Invalid output for Prometheus by default_metrics

Some values from enable_default_metrics() may have suffix ULL for long numbers and Prometheus doesn't recognize it.

Quick solution:

local ret = prometheus.collect_http(req)
ret.body = ret.body:gsub("ULL", "")

Add in-depth description of default metrics

We need to add in-depth description of default metrics, because link to stat deprecated repo contains outdated info and no descriptions.

To answer the question "What default metrics contains of?" it is needed to start a Tarantool instance and collect default metrics (to get the list of default metrics) and then search for info in the net (like https://www.tarantool.io/ru/doc/1.10/reference/reference_lua/fiber/#fiber-info and https://www.tarantool.io/en/doc/1.10/reference/reference_lua/box_slab/) for default metrics meaning.

tnt_stats_op_* metric is not convenient

The current version contains two kinds of metrics: tnt_stats_op_*_total and tnt_stats_op_*_rps, where the asterisk is an operation name. I suggest making it a tag.

Metric info_vclock_* is inconvinient

At current version vclock metric looks as info_vclock_1, info_vclock_2, etc. I think that is not convenient. I suggest making a number of vclock as a tag.

Graphite time in seconds

Graphite version: 1.1.7.

User states that Graphite accept only time in seconds, but we send microseconds. It results in not working graphs.

User suggestion: replace

ts = tostring(fiber.time64()):sub(1, -4) -- Delete ULL sufix

with

ts = tostring(fiber.time64()):sub(1, -10)) -- Delete ULL suffix and 6 digits

Rename default metrics

Prometheus have some problems with suffixes '_count' and '_total' for non-summary and non-histogram metrics

Port to CMake

Currently the list of files to be installed is hardcoded in both the rockspec and the rpm spec. This should not happen. Instead, the package should be installed with 'make install'.

Please port the build script from Makefile to CMake, and implement packing for debian (now absent)

rocks tests

It would be nice to add some tests for tarantoolctl rocks install ... and luarocks install ... and run as rocks since in #101 there is dynamic lib in package

Declare metrics cartridge role as permanent

If I enable this role in my init.lua I agree that my metrics will be enabled on ALL instances. It could be really strange to enable metrics per replicaset.

So I propose:

  • Declare metrics role as permanent (I believe it's literally could be redundant in long roles list)

Don't create a counter if it already exists

Currently if you do this:

local counter = metrics.counter('foobar')

Then it creates a new counter object every time. This means that users have to store counter objects carefully.

What I propose is to make such calls idemponent: if the parameters (like histogram buckets etc) don't change -- then just return an existing counter.

Not working with package.reload

metrics == 0.5.0

unix/:/var/run/tarantool/app.control> package.reload()
---
- error: '/.rocks/share/tarantool/metrics/quantile.lua:10: cannot change a protected
    metatable'
...

metrics shows +Inf as bsize on empty spaces

Example:

tnt_space_bsize{name="sequence",engine="memtx"} +Inf
tnt_space_bsize{name="jobs",engine="memtx"} +Inf
tnt_space_bsize{name="repair_queue",engine="memtx"} +Inf
tnt_space_bsize{name="audit_log_repair",engine="memtx"} +Inf
tnt_space_bsize{name="command_list",engine="memtx"} +Inf
tnt_space_bsize{name="test",engine="memtx"} +Inf

But in fact they are empty

roles: metrics role for cartridge

According to: tarantool/cartridge#873

proposed configuration format:

metrics:
      export:
        - path: "/metrics/json"
          format: "json"
        - path: "/metrics/prom"
          format: "prometheus"

where

  • metrics is a top level section name
  • export is exporter configuration, e.g. [1] is a way to enable json metrics via http endpoint /metrics/json
    Default metrics and global label 'alias' enable by default after init().

Quantile: rewrite double sorting in Lua

We should test different sorting solution for ffi double arrays to avoid using dynamic lib with comparator function. Note that performance must be highest priority.

Add metrics.server Plugin

Currently we have metrics.connect() in public API, which creates a worker doing periodic exports to metrics.server.

We should move it under metrics/plugins folder.

vshard

There is no vshard metrics collected at all.

json plugin converts number64 to string in "value" field

{"label_pairs":{"some_label":"label"},"timestamp":1598451366672309,"metric_name":"name","value":"1605461ULL"}
{"label_pairs":{"name":"space_name","engine":"memtx"},"timestamp":1598462586194806,"metric_name":"tnt_space_total_bsize","value":"0ULL"}

Because of it we could not measure time in nanoseconds.

I suppose the reason is here:

local function finite(value)
if type(value) == "string" then
value = tonumber(value)
if value == nil then return nil end
elseif type(value) ~= "number" then
return nil
end
return value > -metrics.INF and value < metrics.INF
end
local function format_value(value)
return finite(value) and value or tostring(value)
end

Monitor cartridge issues

Cartridge UI (since v. 2.0.2) display gauge with issues in top
image

which unrolls into the list of issues with descriptions
image

It would be convenient to also have it on monitoring dashboards (i.e. Grafana). We can start with plain issues gauge.

hitting metrics endpoint on storage nodes gives HTTP 500 with undefined variable in metrics library

Using tarantool: 2.3.2-1-g9be641b
and Metrics library: metrics == 0.1.8

curl 10.3.151.235:8081/metrics                                                                                                    
Unhandled error: ...e/tarantool/metrics/default_metrics/tarantool/spaces.lua:41: variable 'include_vinyl_count' is not declared
stack traceback:
	/opt/tarantool/.rocks/share/tarantool/http/server.lua:743: in function 'process_client'
	/opt/tarantool/.rocks/share/tarantool/http/server.lua:1199: in function </opt/tarantool/.rocks/share/tarantool/http/server.lua:1198>
	[C]: in function 'pcall'
	builtin/socket.lua:1073: in function <builtin/socket.lua:1071>

The metrics endpoint is being initialised using the cartridge http server using this function

            local httpd = cartridge.service_get('httpd')

            if httpd == nil then
                error('failed to get cartridge httpd service for prometheus')
            end

            metrics.enable_default_metrics()

            httpd:route({
                path = '/metrics',
                method = 'GET',
                public = true,
            }, prometheus.collect_http)

Add examples to README.md

Currently it's not clear how to create counter objects and set their values. Please add it to readme

WIP: Summary collector

Recently we introduced http middleware to instrument http server metrics and new collector type average. Average collector is not compatible with prometheus API and could be replaced with Summary collector.

Affects:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.