grafana / metrictank Goto Github PK

metrics2.0 based, multi-tenant timeseries store for Graphite and friends.

License: GNU Affero General Public License v3.0

Go 97.19% Shell 1.95% Ruby 0.01% Makefile 0.07% Python 0.63% Dockerfile 0.17%

monitoring metrics graphite deprecated unmaintained

metrictank's Introduction

UNMAINTAINED

As of August 2023, Grafana is no longer maintaining this repository. Our primary compatibility with Graphite is provided by carbonapi, using Mimir as our backing database.

Grafana Metrictank

Introduction

Grafana Metrictank is a multi-tenant timeseries platform that can be used as a backend or replacement for Graphite. It provides long term storage, high availability, efficient storage, retrieval and processing for large scale environments.

Grafana Labs has been running Metrictank in production since December 2015. It currently requires an external datastore like Cassandra or Bigtable, and we highly recommend using Kafka to support clustering, as well as a clustering manager like Kubernetes. This makes it non-trivial to operate, though Grafana Labs has an on-premise product that makes this process much easier.

Features

100% open source
Heavily compressed chunks (inspired by the Facebook gorilla paper) dramatically lower cpu, memory, and storage requirements and get much greater performance out of Cassandra than other solutions.
Writeback RAM buffers and chunk caches, serving most data out of memory.
Multiple rollup functions can be configured per serie (or group of series). E.g. min/max/sum/count/average, which can be selected at query time via consolidateBy(). So we can do consolidation (combined runtime+archived) accurately and correctly, unlike most other graphite backends like whisper
Flexible tenancy: can be used as single tenant or multi tenant. Selected data can be shared across all tenants.
Input options: carbon, metrics2.0, kafka.
Guards against excessively large queries. (per-request series/points restrictions)
Data backfill/import from whisper
Speculative Execution means you can use replicas not only for High Availability but also to reduce query latency.
Write-Ahead buffer based on Kafka facilitates robust clustering and enables other analytics use cases.
Tags and Meta Tags support
Render response metadata: performance statistics, series lineage information and rollup indicator visible through Grafana
Index pruning (hide inactive/stale series)
Timeseries can change resolution (interval) over time, they will be merged seamlessly at read time. No need for any data migrations.

Relation to Graphite

The goal of Metrictank is to provide a more scalable, secure, resource efficient and performant version of Graphite that is backwards compatible, while also adding some novel functionality. (see Features, above)

There's 2 main ways to deploy Metrictank:

as a backend for Graphite-web, by setting the CLUSTER_SERVER configuration value.
as an alternative to a Graphite stack. This enables most of the additional functionality. Note that Metrictank's API is not quite on par yet with Graphite-web: some less commonly used functions are not implemented natively yet, in which case Metrictank relies on a graphite-web process to handle those requests. See our graphite comparison page for more details.

Limitations

No performance/availability isolation between tenants per instance. (only data isolation)
Minimum computation locality: we move the data from storage to processing code, which is both metrictank and graphite.
Can't overwrite old data. We support reordering the most recent time window but that's it. (unless you restart MT)

Interesting design characteristics (feature or limitation... up to you)

Upgrades / process restarts requires running multiple instances (potentially only for the duration of the maintenance) and possibly re-assigning the primary role. Otherwise data loss of current chunks will be incurred. See operations guide
clustering works best with an orchestrator like kubernetes. MT itself does not automate master promotions. See clustering for more.
Only float64 values. Ints and bools currently stored as floats (works quite well due to the gorilla compression),
Only uint32 unix timestamps in second resolution. For higher resolution, consider streaming directly to grafana
We distribute data by hashing keys, like many similar systems. This means no data locality (data that will be often used together may not live together)

Docs

installation, configuration and operation.

features in-depth

Other

Releases and versioning

releases and changelog
we aim to keep master stable and vet code before merging to master
We're pre-1.0 but adopt semver for our 0.MAJOR.MINOR format. The rules are simple:
- MAJOR version for incompatible API or functionality changes
- MINOR version when you add functionality in a backwards-compatible manner, and
We don't do patch level releases since minor releases are frequent enough.

License

This software is distributed under the terms of the GNU Affero General Public License.

Some specific packages have a different license:

schema : apache2
expr: 2-clause BSD
mdata/chunk/tsz: 2-clause BSD

metrictank's People

Stargazers

Watchers

Forkers

mirans jnunemaker replay oom-killer gwaldo ehlerst ericxnelson arunsingh hodgesds shanson7 sjanulonoks shankarramshivram tehlers320 bloomberg dieterbe gocuntian motyla dancech absrivastava ramy-ahmed keshavpeswani etsangsplk iidenis nvichare beorn- martin-ly hexiaofeng red7hj pythonai raghu999 worldtiki katholen aergonus saheemg kelvinni lsxredrain agao48 deniszh dinilatgit xzhthu2018 gouthamve baharclerode anujthakwani jangocheng dalavancloud gaybro8777 cagdasolgun pbg99 wilsonribeiro codelingobot pavantyagi cyriltovena smourapina the-real-stiven e7dal aeoluseros pchaozhong dut3062796s christiangutman shlok007 wangbibo d3luxee weizai118 santakd cuchalin iuztemur sinopower cabecada fitzoh jmorgancusick doytsujin beebis mixton pks-os fgouteroux vaguiar jadilsonsilva laashub-soa doc22940 mmalewski admariner antoine-auffret devopstoday11 otmane-chaibe dwtcourses rickyc0626-forks bouyguestelecom topheryun clix-dev-llc franklinharry kamalcharan suryatmodulus pinkdiamond1 zlin179 petethepig guilasabf jamindy isabella232 stjordanis iq-scm

metrictank's Issues

a bunch of metrics are listed twice in ES for no reason

steps to reproduce:

spin up dev-stack (raintank-docker, tank branch)
once grafana running, run env-load
run graphite-watcher (in this repo tank branch), check in grafana in sys dashoard, it quickly jumps to 18k metrics
run inspect-es to confirm (in this repo, tank branch)
interestingly, for some endpoint's it's the ping ones that appear twice, for others it's dns, etc.

dieter@dieter-m6800 inspect-es ./inspect-es  | wc -l
18000
dieter@dieter-m6800 inspect-es ./inspect-es  | sort | uniq -c | head -n 150
      2 litmus.fake_org_100_endpoint_1.dev1.dns.answers
      2 litmus.fake_org_100_endpoint_1.dev1.dns.default
      2 litmus.fake_org_100_endpoint_1.dev1.dns.error_state
      2 litmus.fake_org_100_endpoint_1.dev1.dns.ok_state
      2 litmus.fake_org_100_endpoint_1.dev1.dns.time
      2 litmus.fake_org_100_endpoint_1.dev1.dns.ttl
      2 litmus.fake_org_100_endpoint_1.dev1.dns.warn_state
      1 litmus.fake_org_100_endpoint_1.dev1.http.connect
      1 litmus.fake_org_100_endpoint_1.dev1.http.dataLength
      1 litmus.fake_org_100_endpoint_1.dev1.http.default
      1 litmus.fake_org_100_endpoint_1.dev1.http.dns
      1 litmus.fake_org_100_endpoint_1.dev1.http.error_state
      1 litmus.fake_org_100_endpoint_1.dev1.http.ok_state
      1 litmus.fake_org_100_endpoint_1.dev1.http.recv
      1 litmus.fake_org_100_endpoint_1.dev1.http.send
      1 litmus.fake_org_100_endpoint_1.dev1.http.statusCode
      1 litmus.fake_org_100_endpoint_1.dev1.http.throughput
      1 litmus.fake_org_100_endpoint_1.dev1.http.total
      1 litmus.fake_org_100_endpoint_1.dev1.http.wait
      1 litmus.fake_org_100_endpoint_1.dev1.http.warn_state
      1 litmus.fake_org_100_endpoint_1.dev1.ping.avg
      1 litmus.fake_org_100_endpoint_1.dev1.ping.default
      1 litmus.fake_org_100_endpoint_1.dev1.ping.error_state
      1 litmus.fake_org_100_endpoint_1.dev1.ping.loss
      1 litmus.fake_org_100_endpoint_1.dev1.ping.max
      1 litmus.fake_org_100_endpoint_1.dev1.ping.mdev
      1 litmus.fake_org_100_endpoint_1.dev1.ping.mean
      1 litmus.fake_org_100_endpoint_1.dev1.ping.min
      1 litmus.fake_org_100_endpoint_1.dev1.ping.ok_state
      1 litmus.fake_org_100_endpoint_1.dev1.ping.warn_state
      1 litmus.fake_org_100_endpoint_2.dev1.dns.answers
      1 litmus.fake_org_100_endpoint_2.dev1.dns.default
      1 litmus.fake_org_100_endpoint_2.dev1.dns.error_state
      1 litmus.fake_org_100_endpoint_2.dev1.dns.ok_state
      1 litmus.fake_org_100_endpoint_2.dev1.dns.time
      1 litmus.fake_org_100_endpoint_2.dev1.dns.ttl
      1 litmus.fake_org_100_endpoint_2.dev1.dns.warn_state
      2 litmus.fake_org_100_endpoint_2.dev1.http.connect
      2 litmus.fake_org_100_endpoint_2.dev1.http.dataLength
      2 litmus.fake_org_100_endpoint_2.dev1.http.default
      2 litmus.fake_org_100_endpoint_2.dev1.http.dns
      2 litmus.fake_org_100_endpoint_2.dev1.http.error_state
      2 litmus.fake_org_100_endpoint_2.dev1.http.ok_state
      2 litmus.fake_org_100_endpoint_2.dev1.http.recv
      2 litmus.fake_org_100_endpoint_2.dev1.http.send
      2 litmus.fake_org_100_endpoint_2.dev1.http.statusCode
      2 litmus.fake_org_100_endpoint_2.dev1.http.throughput
      2 litmus.fake_org_100_endpoint_2.dev1.http.total
      2 litmus.fake_org_100_endpoint_2.dev1.http.wait
      2 litmus.fake_org_100_endpoint_2.dev1.http.warn_state
      2 litmus.fake_org_100_endpoint_2.dev1.ping.avg
      2 litmus.fake_org_100_endpoint_2.dev1.ping.default
      2 litmus.fake_org_100_endpoint_2.dev1.ping.error_state
      2 litmus.fake_org_100_endpoint_2.dev1.ping.loss
      2 litmus.fake_org_100_endpoint_2.dev1.ping.max
      2 litmus.fake_org_100_endpoint_2.dev1.ping.mdev
      2 litmus.fake_org_100_endpoint_2.dev1.ping.mean
      2 litmus.fake_org_100_endpoint_2.dev1.ping.min
      2 litmus.fake_org_100_endpoint_2.dev1.ping.ok_state
      2 litmus.fake_org_100_endpoint_2.dev1.ping.warn_state
      2 litmus.fake_org_100_endpoint_3.dev1.dns.answers
      2 litmus.fake_org_100_endpoint_3.dev1.dns.default
      2 litmus.fake_org_100_endpoint_3.dev1.dns.error_state
      2 litmus.fake_org_100_endpoint_3.dev1.dns.ok_state
      2 litmus.fake_org_100_endpoint_3.dev1.dns.time
      2 litmus.fake_org_100_endpoint_3.dev1.dns.ttl
      2 litmus.fake_org_100_endpoint_3.dev1.dns.warn_state
      1 litmus.fake_org_100_endpoint_3.dev1.http.connect
      1 litmus.fake_org_100_endpoint_3.dev1.http.dataLength
      1 litmus.fake_org_100_endpoint_3.dev1.http.default
      1 litmus.fake_org_100_endpoint_3.dev1.http.dns
      1 litmus.fake_org_100_endpoint_3.dev1.http.error_state
      1 litmus.fake_org_100_endpoint_3.dev1.http.ok_state
      1 litmus.fake_org_100_endpoint_3.dev1.http.recv
      1 litmus.fake_org_100_endpoint_3.dev1.http.send
      1 litmus.fake_org_100_endpoint_3.dev1.http.statusCode
      1 litmus.fake_org_100_endpoint_3.dev1.http.throughput
      1 litmus.fake_org_100_endpoint_3.dev1.http.total
      1 litmus.fake_org_100_endpoint_3.dev1.http.wait
      1 litmus.fake_org_100_endpoint_3.dev1.http.warn_state
      1 litmus.fake_org_100_endpoint_3.dev1.ping.avg
      1 litmus.fake_org_100_endpoint_3.dev1.ping.default
      1 litmus.fake_org_100_endpoint_3.dev1.ping.error_state
      1 litmus.fake_org_100_endpoint_3.dev1.ping.loss
      1 litmus.fake_org_100_endpoint_3.dev1.ping.max
      1 litmus.fake_org_100_endpoint_3.dev1.ping.mdev
      1 litmus.fake_org_100_endpoint_3.dev1.ping.mean
      1 litmus.fake_org_100_endpoint_3.dev1.ping.min
      1 litmus.fake_org_100_endpoint_3.dev1.ping.ok_state
      1 litmus.fake_org_100_endpoint_3.dev1.ping.warn_state
      1 litmus.fake_org_100_endpoint_4.dev1.dns.answers
      1 litmus.fake_org_100_endpoint_4.dev1.dns.default
      1 litmus.fake_org_100_endpoint_4.dev1.dns.error_state
      1 litmus.fake_org_100_endpoint_4.dev1.dns.ok_state
      1 litmus.fake_org_100_endpoint_4.dev1.dns.time
      1 litmus.fake_org_100_endpoint_4.dev1.dns.ttl
      1 litmus.fake_org_100_endpoint_4.dev1.dns.warn_state
      2 litmus.fake_org_100_endpoint_4.dev1.http.connect
      2 litmus.fake_org_100_endpoint_4.dev1.http.dataLength
      2 litmus.fake_org_100_endpoint_4.dev1.http.default
      2 litmus.fake_org_100_endpoint_4.dev1.http.dns
      2 litmus.fake_org_100_endpoint_4.dev1.http.error_state
      2 litmus.fake_org_100_endpoint_4.dev1.http.ok_state
      2 litmus.fake_org_100_endpoint_4.dev1.http.recv
      2 litmus.fake_org_100_endpoint_4.dev1.http.send
      2 litmus.fake_org_100_endpoint_4.dev1.http.statusCode
      2 litmus.fake_org_100_endpoint_4.dev1.http.throughput
      2 litmus.fake_org_100_endpoint_4.dev1.http.total
      2 litmus.fake_org_100_endpoint_4.dev1.http.wait
      2 litmus.fake_org_100_endpoint_4.dev1.http.warn_state
      2 litmus.fake_org_100_endpoint_4.dev1.ping.avg
      2 litmus.fake_org_100_endpoint_4.dev1.ping.default
      2 litmus.fake_org_100_endpoint_4.dev1.ping.error_state
      2 litmus.fake_org_100_endpoint_4.dev1.ping.loss
      2 litmus.fake_org_100_endpoint_4.dev1.ping.max
      2 litmus.fake_org_100_endpoint_4.dev1.ping.mdev
      2 litmus.fake_org_100_endpoint_4.dev1.ping.mean
      2 litmus.fake_org_100_endpoint_4.dev1.ping.min
      2 litmus.fake_org_100_endpoint_4.dev1.ping.ok_state
      2 litmus.fake_org_100_endpoint_4.dev1.ping.warn_state
      2 litmus.fake_org_10_endpoint_1.dev1.dns.answers
      2 litmus.fake_org_10_endpoint_1.dev1.dns.default
      2 litmus.fake_org_10_endpoint_1.dev1.dns.error_state
      2 litmus.fake_org_10_endpoint_1.dev1.dns.ok_state
      2 litmus.fake_org_10_endpoint_1.dev1.dns.time
      2 litmus.fake_org_10_endpoint_1.dev1.dns.ttl
      2 litmus.fake_org_10_endpoint_1.dev1.dns.warn_state
      1 litmus.fake_org_10_endpoint_1.dev1.http.connect
      1 litmus.fake_org_10_endpoint_1.dev1.http.dataLength
      1 litmus.fake_org_10_endpoint_1.dev1.http.default
      1 litmus.fake_org_10_endpoint_1.dev1.http.dns
      1 litmus.fake_org_10_endpoint_1.dev1.http.error_state
      1 litmus.fake_org_10_endpoint_1.dev1.http.ok_state
      1 litmus.fake_org_10_endpoint_1.dev1.http.recv
      1 litmus.fake_org_10_endpoint_1.dev1.http.send
      1 litmus.fake_org_10_endpoint_1.dev1.http.statusCode
      1 litmus.fake_org_10_endpoint_1.dev1.http.throughput
      1 litmus.fake_org_10_endpoint_1.dev1.http.total
      1 litmus.fake_org_10_endpoint_1.dev1.http.wait
      1 litmus.fake_org_10_endpoint_1.dev1.http.warn_state
      1 litmus.fake_org_10_endpoint_1.dev1.ping.avg
      1 litmus.fake_org_10_endpoint_1.dev1.ping.default
      1 litmus.fake_org_10_endpoint_1.dev1.ping.error_state
      1 litmus.fake_org_10_endpoint_1.dev1.ping.loss
      1 litmus.fake_org_10_endpoint_1.dev1.ping.max
      1 litmus.fake_org_10_endpoint_1.dev1.ping.mdev
      1 litmus.fake_org_10_endpoint_1.dev1.ping.mean
      1 litmus.fake_org_10_endpoint_1.dev1.ping.min
      1 litmus.fake_org_10_endpoint_1.dev1.ping.ok_state
      1 litmus.fake_org_10_endpoint_1.dev1.ping.warn_state

consolidation-at-read-time: errors like `json: error calling MarshalJSON for type main.Point: invalid character 'N' looking for beginning of value` at read time

2015/12/02 03:30:52 [D] searchCassandra(): 1 outcomes (queries), 0 total iters
2015/12/02 03:30:52 [D] getSeries: iter mem <chunk T0=1449014400, LastTS=1449027000, NumPoints=9, Saved=false>  values good/total 9/9
2015/12/02 03:30:52 [log.go:202 writerMsg()] [E] json: error calling MarshalJSON for type main.Point: invalid character 'N' looking for beginning of value
2015/12/02 03:30:52 [D] getSeries: iter mem <chunk T0=1449014400, LastTS=1449022200, NumPoints=3, Saved=false>  values good/total 3/3
2015/12/02 03:30:52 [D] load from memory    1.0553c9c7f5934455580d6c66480aa7cb_sum_600 1448940652 - 1449027052 (01 03:30:52 - 02 03:30:52) span:86399s
2015/12/02 03:30:52 [D] AggMetric.Get():    1.0553c9c7f5934455580d6c66480aa7cb_sum_600 1448940652 - 1449027052 (01 03:30:52 - 02 03:30:52) span:86399s
2015/12/02 03:30:52 [D] load from cassandra 1.0553c9c7f5934455580d6c66480aa7cb_sum_600 1448940652 - 1449014400 (01 03:30:52 - 02 00:00:00) span:73747s
2015/12/02 03:30:52 [D] searchCassandra(): 1 outcomes (queries), 0 total iters
2015/12/02 03:30:52 [D] getSeries: iter mem <chunk T0=1449014400, LastTS=1449022200, NumPoints=3, Saved=false>  values good/total 3/3
2015/12/02 03:30:52 [log.go:202 writerMsg()] [E] json: error calling MarshalJSON for type main.Point: invalid character 'N' looking for beginning of value
2015/12/02 03:30:52 [D] searchCassandra(): 1 outcomes (queries), 0 total iters
2015/12/02 03:30:52 [D] getSeries: iter mem <chunk T0=1449014400, LastTS=1449027000, NumPoints=9, Saved=false>  values good/total 9/9
2015/12/02 03:30:52 [log.go:202 writerMsg()] [E] json: error calling MarshalJSON for type main.Point: invalid character 'N' looking for beginning of value

if no data, still posts to kairosdb

AddDatapoints called with empty slices.
not that it matters much, since we're phasing it out.

just use the grafana models?

instead of our own models, can we just use what's in github.com/grafana/grafana/pkg/models ?
seems like the only difference is the Id field used for ES indexing.

add profiletrigger

auto trigger memory profile when memory is nearing limits

update build scripts to build raintank-tank package

need to update pkg/* to build nsq_metrics_tank (we should name the package as raintank-tank if possible.

redis-cluster support

To be able to use redis-cluster with nsq_metrics_to_elasticsearch, metricdefs needs to be updated to use gopkg.in/redis.v3. Additionally, the redis Client object will need to be a redis ClusterClient object (http://godoc.org/gopkg.in/redis.v3#ClusterClient).

Ideally there should be a config option to choose between normal redis and redis cluster. The redis address arguments need changed anyway to handle the cluster case.

support configuration file for services

Either the apps themselves need to support a configuration file for loading settings, or the init script needs to support using an external file (/etc/default/) for defining the options passed on the command line.

It is not practical to have deb/rpm packages that require users to modify the init scripts in order for the server to work correctly in their environment.

see: https://github.com/raintank/ops/issues/133

consolidation-at-read-time: phantom metrics

bulkindexing for probe_events

see prior discussion at #24

bug: state reloading indistinguishable from black magic

in devstack:

i add 1 endpoint at 9:50, so it starts accumulating data, graphs start getting data.
at 10:00 i check in nsqadmin that there's 0 backlog, NMT is 100% up to date. all graphs have data from 9:50 until 10:00
i kill metric_tank, rm /tmp/nmt.gob and restart it. it says starting with fresh aggmetrics
i reload the graphs in grafan. all the data since 9:50 is still there! i thought maybe it's caching but:

root@graphite:/# grep -i cache /var/log/raintank/graphite-*
/var/log/raintank/graphite-api.log:/usr/local/lib/python2.7/dist-packages/flask_cache/__init__.py:152: UserWarning: Flask-Cache: CACHE_TYPE is set to null, caching is effectively disabled.
/var/log/raintank/graphite-api.log:  warnings.warn("Flask-Cache: CACHE_TYPE is set to null, "

also, zooming in grafana still shows the data (so it's not caching by to/from)

where is the data coming from? in network tab, i see requests being made to http://localhost/api/graphite/render and responding properly

resume where we left off after downtime

i hear if raintank-metric goes down and comes back up later, i doesn't get the data from rabbitmq it missed. we should fix this. that said I do think the realtime stream is much more important than historical data. It's not worth it delaying ingestion of our real time metrics by several hours because we have to catch up on a days worth of old data.
I would rather have a high prio real-time stream and a lower QOS stream for the historical data.

extend metricDef schema to improve graphite find query performance

see raintank/graphite-kairosdb#16

dashboard on dev is broken, probably due to raintank-metric or graphite-api

raj brought this to my attention.
go to any endpoint that has ping enabled, then go ping performance range and loss

then remove the removeAbovePercentile part and it "works",
though there's probably still something wrong since in the legend table it shows NaN.0

chunks that are not persisted yet can be cleared

this code section does not account for the chunk not being persisted yet, when it clears.
the less chunks we keep in memory, the more likely this becomes.
specifically i think for some rollup data we only want to keep 1 chunk in memory.

        if len(a.Chunks) < int(a.NumChunks) {
            log.Debug("adding new chunk to cirular Buffer. now %d chunks", a.CurrentChunkPos+1)
            a.Chunks = append(a.Chunks, NewChunk(t0))
        } else {
            chunkClear.Inc(1)
            log.Debug("numChunks: %d  currentPos: %d", len(a.Chunks), a.CurrentChunkPos)
            log.Debug("clearing chunk from circular buffer. %v", a.Chunks[a.CurrentChunkPos])
            a.Chunks[a.CurrentChunkPos] = NewChunk(t0)
        }

should all tags have user defined keys?

this is the continuation of a topic that started at #11 (comment) (first point)
then read #11 (comment) and the few comments afterwards.

i basically suggested it might make sense to drop the requirement that all tags have a key defined at the origin (statsd calls etc)

i can't come up with an example where finding a tag-key is impossible or really hard but I think that's not the point.
i know i've had it in the past working in vimeo's backend code (to which i no longer have access) where a lot of stuff was going on and the systems had a lot of considerations before deciding on actions. we were instrumenting those actions (and the considerations that let up to them) so had a lot of metric dimensionality and while you would be able to come up with keys for all dimensions, they increased the length of the statsd invocations and the amount of work you had to do (typing and also thinking about wording and semantics). if you were creating tens of metrics, this was annoying.

on the other hand, not typing out keys in the instrumentation code and just using auto assigned tag keys was fine because you'd first search for the data and then drill down using the keys that the dashboard tells you it has auto-assigned (n1, n2, n3, ...)

so it's more about "it can be easier, sometimes people just want to go from A to B as quickly as possible", not "it's a must because of technical reasons"
I think @torkelo also has an opinion on this, he mentioned the other day a dowside of tag based systems (compared to graphite for example) is having to type out tag keys.

i know that @jedi4ever also mentioned an appreciation for the no-key-enforced approach of datadog, letting him just tag "staging", "production" etc instead of having to prefix with "environment:"

nsq__to_ tools don't compile with go 1.5

https://groups.google.com/forum/#!topic/nsq-users/hEeMBu2IPw8
due to new "internal" convention

use grafana log library for logging in all tools

metric-tank has already been updated to use github.com/grafana/grafana/pkg/log, the other nsq_* tools also need to be updated.

This will allow us to turn down the log volume.

some tags should not contribute to the metric ID

(*schema.MetricData)(0xc8201de300)({
 OrgId: (int) 1,
 Name: (string) (len=37) "litmus.localhost.dev1.dns.error_state",
 Metric: (string) (len=22) "litmus.dns.error_state",
 Interval: (int) 10,
 Value: (float64) 0,
 Unit: (string) (len=5) "state",
 Time: (int64) 1443523161,
 TargetType: (string) (len=5) "gauge",
 Tags: ([]string) (len=3 cap=3) {
  (string) (len=13) "endpoint_id:1",
  (string) (len=12) "monitor_id:4",
  (string) (len=14) "collector:dev1"
 }
})

Id() uses all tags.
if we get a metric with the exact same name/metric/OrgId and all tags, except monitor_id is now 5, then it should be the same metric with the same ID.

metrics2.0 address such scenarios by having regular tags and meta tags which can change and don't contribute to the ID. see http://metrics20.org/spec/
why are storing monitor_id in there anyway?

as we grow to support all kinds of metrics, it 'll be even more common to have meta tags that should be able to change without changing the ID, so I think we should implement meta tags like in metrics2.0.

eventQueue filling up

We're seeing it both on dev-portal and production where the eventQueue for grafana events is filling up with events and not being read by raintank-metric. The code there looks in order, but I noticed that the raintank-metric processes in production are restarting with fair regularity (but not every minute), but unfortunately their logs get wiped out so we don't see whatever error messages they spit out before dying. I've changed their logging to syslog to try and capture that. This could possibly be related to the eventQueue issue.

NMT needs more unit tests

at least that's what aj said.

instrument metricDef indexing errors

see #24
and if they sustain over time, alert

NMT doesn't properly handle out of order data

due to the way nsqd currently fills over traffic from a in-mem to via-diskqueue channels (by selecting on them), it can arbitrarily reorder your data. ideally if the in memory channel is always empty this shouldn't happen but maybe due to minor hickups. we can see if increasing the size of the memory buffer helps though obviously then we would incur more data loss in case of an nsqd crash.

@woodsaj confirmed this by feeding data into nsqd in order, and have an NMT consumer with 1 concurrent handler, and the data was out of order.

I've had conversations with Matt Reiferson (of nsq) seeing how feasible it would be to add simple ordering guarantees to nsqd, even if merely per-topic per nsqd instance. but even that seems quite complex/tricky and would require a different model for requeues, defers, msg timeouts etc and would be a drastically different nsqd behavior, even with nsqio/nsq#625

his recommendation was to use an ephemeral channel to always read the latest data to serve up to users from RAM, and just drop what we can't handle, an additionally use a diskqueue backed channel which you read from and store into like HDFS, so that you can then use hadoop to properly compute the chunks to store in archival storage (i.e. go-tsz chunks in cassandra) even on out of order data.
though this seems like far more complexity than we want, although i like the idea of separating in-mem data and archival storage, that seems to let us simplify things. but using hadoop to work around poor ordering after the fact...

what we can also do:

current approach, but keep a window of messages which we sort, let's say of 10seconds long, and after 10s we can assume we have a good order and decode the messages and commit their metrics to go-tsz chunks and we wouldn't have much risk getting chunks that are >10s late. but of course then it will also take 10s for data to start showing up when NMT responds to queries. hmm well i guess the query handler could also look through the messages in the window and pull data from there.
related idea: don't explicitly keep a window of messages to sort, but keep simple non-go-tsz-optimized datastructures of points (like simple arrays of uint32-float64 pairs) so that the metrics of all new messages can immediately be added and are available for querying. whenever the data is getting old enough to move to cassandra, that's when we generate the chunks, at which point the data should be very stable.
however this means for update operations we might commit the wrong values if the 2 writes for the same slot happen in the wrong order (though we're not currently doing any updates) and also it would be less RAM efficient to keep the data in such arrays.

note that in both above approaches we assume ordering of messages is all we need.
in reality messages from the collectors can contain points for different timestamps (and this is hard to address in the collectors per AJ) so in NMT we would have to order the actual points, not just the messages.

tank update procedure

we have basically 2 options IMHO:

stop tank, saving chunks to disk, update tank, start it, which means it has to decode from disk, and work through a backlog of data. this is AFAIK why we did #40
3 issues with this approach:
a) the ordering problem with nsq causes data loss and gaps in data (messages from during downtime will come out of order and we have to drop metrics if they're older than previous)
b) if the upgrade comes with changes to data structures, it will be a bunch of extra work to accommodate migrating the old data into a new format (if possible), and possibly having to throw away the old chunks, creating an even bigger gap of data.
c) the datasource as a whole is effectively offline during this process.
launch a new tank instance parallel to the old one on a different ip/port/machine.
the downsides are:
a) it needs to run for a while and cover all data range between now until when we have data in cassandra, ideally up to numChunks*numspan and/or whatever range our dashboards are set to so cassandra doesn't see an increased read load when we switch.
b) it's operationally a bit more complex
c) we need to implement some additional logic to not have both instances be saving the same data to cassandra, i guess some kind of signal for an instance to tell it not to save anything, and then another one to tell it to start saving starting with data from a given boundary timestamp. haven't thought too much about this.

but the upside is the three problems from 1 don't apply here, so we can do the upgrade seamlessly without any data loss.

There's some hope for 1a though:

I'm hopeful that #48 + probably some parrallelized encoding/decoding will make things fast and the time gap between stop and start small, so that we're in fairly good shape until we fix ordering properly
@woodsaj was going to experiment with max-in-flight and concurency settings, which I think will result in mostly-correct-ordering out of NSQ. but only one way to find out...

though due do 1b and 1c i think perhaps we shouldn't be spending too much time on save/restore of chunks to/from disk, and just go for 2 ? I think its downsides may be easy to overcome.
@woodsaj thoughts?

move tag sorting / id generation to MetricData creation

(i'm using the shorthands here, see https://github.com/raintank/raintank-metric/blob/master/README.md)
currently, *schema.MetricData.Id() is called:

in nme at every datapoint for each series, to look up and save metric definition
in nmt at every datapoint for each series, needs to lookup the AggMetric structure (not yet, but should)
not in nmk , doesn't seem to call Id() but this service will most likely be deprecated

this operation, especially when called so often, is expensive: it has to sort the tag strings, allocates a buffer and a few temporary strings and calls Printf a bunch of times.
while these services are always in flux, it's clear that this will operation will happen more frequently in the consumers than in the producers.

solutions:
A) sort the tags when we create the MetricData instance, so that Id() can skip that step (but still repeatedly does the other work)
B) pre-generate the Id field at MetricData creation time, and store it as a member of the struct

I think we should do A+B, but i also think this is low priority. we can address it when it becomes more of a real problem.
what do you think @woodsaj ?

laggy backfill could theoretically result in false alerting negatives

with the NSQ approach

mostly, after a consumer recovery where we catch up on a bunch of older messages
potentially also anytime, if things are a little slow

the fill-up of recent data is not as smooth as i would like. (meaning there's some gaps or even just a gap at the end) and I think that in theory it's possible for false negatives to be triggered, if first there's enough errors known to cause an alert, but then we momentarily can't backfill as fast as real time advances, it's possible for num errors to "drop again" (because of some nulls) and the alert to become OK, and then backfill to succeed (more errors) resulting in critical, (potentially this could repeat a few times) resulting in a few bogus alerts.

this needs more testing "in the cloud" to see how this behaves in real life scenarios

some ideas that may help

the upcoming per-topic WAL in nsqd which will enforce ordering and also potentially allow to seek at "the last 300 messages" to cleanly recover the last 5min first.
look at the message id's to collect contiguous batches of the last 5min, then submit all those, then ack all of them
manipulating offset to account for incomplete data, potentially looking at nsq queue depths.

this ticket is mostly to remind myself to think about this and at some point address it.

not enough chunks returned/queried by/from cassandra

./metric_tank --chunkspan 600 --numchunks 3

running graphite-watcher with this patch

+               then := time.Now().Add(-time.Duration(50) * time.Minute)
+               q.Start = &then

shows that only 2 chunks per row are returned.
similar when opening a dashboard and querying the last hour of data. chunks per row is 3 and the first section of an incomplete chunk is not used. for example, for a query of last hour, at 12:11 there is no data from 11:11 until 11:20

consolidation-at-read-time: errors like "GetAggregated called with unknown aggSpan 60"

these happen when loading chunks from /tmp/nmt.gob and since then the aggregation rules have changed.
we don't dynamically update the aggregation rules after importing the data. this would be pretty tricky anyway (accommodating various subtle differences in automagic ways)
and the loading/saving approach has been deprecated anyway, see #56 so i don't think we should spend time fixing this.

instead we may want to just deprecate the whole chunk saving/loading.

buffer metrics when tsdb can't be reached.

chunks should have a header with their version type

in the very likely case we'll later want to upgrade the format, work with different types of chunks etc, it seems sensible to put 1Byte up front.

oops

edit: wrong repo

NMT consumes too much memory

10 orgs -> 1200 metrics, default settings is 10min worth

NMT rss is 24MB, in go inuse_space is 4MB
should be 60 * 1200 * 1.5 = 108kB

null values become 0

in the dashboards, when probe errors on http requests, it still shows 0ms in the UI, even with "null as null" set.

what's probably happening is that the probe does report nulls but the schema initializes float64's to 0

let's switch tank back to go-metrics instead of statsd

i'm not a fan of 56d3a07, the pointsPerMetric stat is quite valuable, especially in prod. i tried out a change where we just collect this metric without all the additional goroutines and without having to lock the AggMetrics for each metric (262ad85), but the profile still reveals that we spend almost all of our time in statsd udp writes

(pprof) top30 -cum
4.95s of 12.66s total (39.10%)
Dropped 319 nodes (cum <= 0.06s)
Showing top 30 nodes out of 199 (cum >= 0.90s)
      flat  flat%   sum%        cum   cum%
     0.01s 0.079% 0.079%      9.78s 77.25%  runtime.goexit
     0.03s  0.24%  0.32%      4.56s 36.02%  github.com/Dieterbe/statsd-go.(*Client).Send
     0.03s  0.24%  0.55%      4.32s 34.12%  net.(*conn).Write
     0.04s  0.32%  0.87%      4.29s 33.89%  net.(*netFD).Write
     4.02s 31.75% 32.62%      4.20s 33.18%  syscall.Syscall
     0.01s 0.079% 32.70%      3.98s 31.44%  syscall.Write
     0.01s 0.079% 32.78%      3.97s 31.36%  syscall.write
     0.03s  0.24% 33.02%      3.93s 31.04%  fmt.Fprintf
     0.02s  0.16% 33.18%      3.14s 24.80%  github.com/grafana/grafana/pkg/metric/statsd.(*Meter).Value
     0.02s  0.16% 33.33%      3.12s 24.64%  github.com/grafana/grafana/pkg/metric/statsd.Meter.Value
     0.03s  0.24% 33.57%      3.10s 24.49%  github.com/Dieterbe/statsd-go.(*Client).Timing

getting the graph with the web command confirms that all the conn writing is caused by the statsd client

as i've mentioned before, there are some things we could do (sampling, pre-computing, buffering udp writes) but ultimately all those are still workarounds for the fact that udp writes come with an overhead, and frankly i don't think any of us have time right now to work on optimizing that. there are some other golang statsd clients that may be more advanced but it takes time to evaluate them cause in my experience there's always a lot of subtle differences between different clients.

go-metrics is based on a much nicer model IMHO. it can just compute the metrics in-process and stream them straight to graphite, or to $whatever, and also expose them via expvar (json over http)
in my experience it works quite well.

when rolling up data compute median as well as average(mean)

Currently when showing aggregated data, we are seeing numerous large spikes in the graphs. This typically happens when for example, a HTTP connect spikes from 30ms to 2000ms likely due to a lost syn packet. Though this is valid data, these individual points significantly affect the averages being computed when rolling up.

By storing the median value, we will help to improve the user experience and show smoother graphs, which are a better representation of the performance of an endpoint.

too high cpu usage

my laptop has 8 Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz cores.
i'm running the devstack, with just 1 endpoint configured with 3 checks every 10seconds.
in top, raintank-metric is running with 80 to 180% cpu. this seems like way too much.

i suggest importing _ "net/http/pprof" and creating a http listener, so that when cpu % is high, we can get performance profiles at <tcp_addr>/debug/pprof/ (probably should expose the http port through docker as well)

this is the first time i see this though, btw. maybe it only happens in certain conditions. but i'll keep an eye on it and would love to give you a profile next time.

etc/raintank-sample.conf can be removed right?

looks like it's the old example for raintank-metric

move metric-tank data serialization from GOB to msgpack

CPU usage encoding/decoding the data to binary disk is hurting performance.

we need to move from GOB to msgpack which is an order of magnitude faster.

https://github.com/alecthomas/go_serialization_benchmarks

count is not correct for new aggregators

not sure what's going on yet but

2015/12/03 06:34:40 [I] starting with fresh aggmetrics.
2015/12/03 06:34:40 [I] connected to nsqd
2015/12/03 06:34:40 [I] starting listener for metrics and http/debug on :6063
HUH NewAggregator() called creating agg with 21600 2
HUH NewAggregator() called  creating agg with 21600 2
HUH NewAggregator() called  creating agg with 21600 2
2015/12/03 06:34:42 [W] handler adding litmus.grafana.dev1.ping.max 1449124481 515.178661
2015/12/03 06:34:42 [D] AggMetric 1.9b7bbb1b5ea8144bbf33addc2b411645 Add(): created first chunk with first point: <chunk T0=1449124200, LastTs=1449124481, NumPoints=1, Saved=false>
2015/12/03 06:34:42 [D] AggMetric 1.9b7bbb1b5ea8144bbf33addc2b411645 pushing 1449124481,515.178661 to aggregator 600
2015/12/03 06:34:42 [D] aggregator 600 Add(): added point to existing aggregation
2015/12/03 06:34:42 [D] AggMetric 1.9b7bbb1b5ea8144bbf33addc2b411645 pushing 1449124481,515.178661 to aggregator 7200
2015/12/03 06:34:42 [D] aggregator 7200 Add(): added point to existing aggregation
2015/12/03 06:34:42 [D] AggMetric 1.9b7bbb1b5ea8144bbf33addc2b411645 pushing 1449124481,515.178661 to aggregator 21600
2015/12/03 06:34:42 [D] aggregator 21600 Add(): added point to existing aggregation

all aggregators must be fresh and when adding first point, the cnt should still be 0.

nsqd_metrics_to_elasticsearch needs flag to point to ES

right now it seems to default to elasticsearch.
once fixed, should also remove the /etc/hosts entry.

NMT GC look at ts of when last received, not last point

because we might want to send old data in batches etc.

rollups should align to minute boundary

Currently when the metric consumer starts the rollup period is based on the time the first metric is received. If the consumer restarts, then a new rollup period will be used.

Aligning the rollup period to consistent boundaries should fix this and provide more consistent results.

tank chunk keeping vs clearing, GC, refactoring

because of these facts

Go's VM does not release memory to the kernel once allocated, even if the data has been cleared. it'll just keep the memory to reuse it when it needs it.
tank uses a circular buffer with a consistent size
it's always nice to be able to serve more data out of RAM especially if the cost is very low.

.. i figured we can hold on to any chunks occupying a slot in the circular buffer unless we need to force an eviction (newer data needs to reuse the slot and evict the older data).
for example, if we got some data, then a long time (more than what would fit in NMT ram) we don't get any data, and then data again. we might as well hold on to the old data until we need to clear the slot because we received new data maps to that slot.
but if there was no new data (perhaps because a system went down, a probe temporarily disabled itself, whatever) there is no need to clear the old data, because it wouldn't gain us any memory back.

I could have documented this more explicitly, but so the idea is about keeping as much data as we can, bound by the numchunks and chunkspan directives for purposes of controlling RAM usage (but not forcing out data for the sake of forcing out data), to be able to serve as much data out of RAM as possible and exploiting gaps in data to serve larger queries, possibly removing the need to query cassandra at all in some scenarios.
if we phrase the NMT mission statement as "keep the last numchunks of numspan size for each metric" (which seems reasonable) then we don't need any GC for chunks at all, and just evict on an as-needed basis. though there should probably be a facility to clear out inactive metrics (on an aggmetric basis, not chunk basis)

I saw you did some work around this @woodsaj , i haven't studied all the new code yet, but wanted to make these ideas explicit to see what you think and if you have a different opinion.
the alternative approach (not sure if that's what you're going for) would be a more strict forcing of all chunks in the circular buffer to be the last numchunks chunks, and have gaps if there was missing data. this could be simpler in code but also adds the need for GC.

Some thoughts re #42:

i was surprised that 98b0bf1 adds more lines than removes
I see a new feature periodically scan chunks and close any that have not recieved data in a while and chunk has not been written to in a while. Lets persist it. will this be a problem if the data stream gets interrupted (and points get delayed so that it may take tens of minutes or hours before a new point for a metric arrives)
i'm concerned about the newly added GC which locks the entire aggmetrics struct for the duration of the entire GC run. this introduces a delay on queries which scales linearly with the amount of metrics, it's not easy to solve this. this is yet another reason why it's IMHO better to just avoid GC and let data expire naturally.

rollup periods should be configurable.

Currently rollup intervals are hard coded to 10minutes and 6hours. We should allow these intervals to be set in the configuration file.

future of raintank-metric, use something else?

please help me fill this in. we need to agree on what our requirements/desires are before talking about using other tools

current requirements?

safely relay metrics from our queue into storage and ES without losing data in case we can't safely deliver
decode messages from our custom format used in rabbitmq (but i suppose we could also store them differently in rabbit?)
encode messages into our custom format, to be stored in ES

possible future requirements

real time aggregation
real time processing/alerting (I personally don't think we need to be too concerned about this just yet. once we have high performance/scalability requirements we'll probably use a dedicated real time processing framework like spark/storm/heron/...)

questions

can we write our own decode, encode, processor plugins in Go, in heka?
can somebody describe what we do with ES from the raintank-metric/rabbitmq perspective and how dependent this is on the main storage backend? like if kairosdb is down, can we or must we still update ES? if ES is down, can or must we still write to kairos?
does rabbitmq support multiple readers of the same data, and does it maintain what has been acked by which reader?

make tags strings instead of numbers

(*schema.MetricData)(0xc8201de300)({
 OrgId: (int) 1,
 Name: (string) (len=37) "litmus.localhost.dev1.dns.error_state",
 Metric: (string) (len=22) "litmus.dns.error_state",
 Interval: (int) 10,
 Value: (float64) 0,
 Unit: (string) (len=5) "state",
 Time: (int64) 1443523161,
 TargetType: (string) (len=5) "gauge",
 Tags: ([]string) (len=3 cap=3) {
  (string) (len=13) "endpoint_id:1",
  (string) (len=12) "monitor_id:4",
  (string) (len=14) "collector:dev1"
 }
})

just curious, why tags like endpoint_id:1 and not endpoint=localhost ? (the latter would facilicate some search (metrics2.0/graph-explorer style) scenarios where you can search for all metrics with endpoint=localhost etc, or autocomplete/auto-suggest: give me all values for endpoint=local*)

rollups.

the time has come.

https://github.com/raintank/ops/issues/112 has some details
i know we wrote down some thoughts etc at the summit, do we still have those notes? or perhaps not that important
implementation will probably be an nsq consumer that generates all lower-res streams and stores them (including for current data), as opposed to a design where lower-res only starts where higher-res ends.
we can generate spread data like librato/omniti/hostedgraphite, or store individual min/max/avg/.. series, or use an algo like LTTB (see https://github.com/sveinn-steinarsson/flot-downsample/)

support clean restarts of NMT

When a shutdown is initiated, NMT needs to stop:

stop consuming from NSQ
flush all aggemetrics to disk
exit

when starting up we need to:

load all aggemetrics from disk
start consuming from NSQ

This allows us to restart NMT without experiencing data loss. This is necessary as we plan to deploy an early version and frequently iterate on it.

allow backends to be enabled/disabled via config

raintank-metric currently write metrics to both Kairosdb and influxdb. We should make the app configurable so that you can choose to write to one or the other or both.

graphite-watcher metrics seem broken

The lag values just grow linearly, even when there is no load on the stack. What are these values supposed to be showing?