m3db / m3 Goto Github PK

M3 monorepo - Distributed TSDB, Aggregator and Query Engine, Prometheus Sidecar, Graphite Compatible, Metrics Platform

License: Apache License 2.0

Makefile 0.31% Go 96.99% Thrift 0.09% Shell 1.29% HTML 0.77% CSS 0.01% JavaScript 0.29% Dockerfile 0.06% TLA 0.11% HCL 0.05% Jsonnet 0.03% Assembly 0.01%

prometheus kubernetes graphite metrics tsdb query-engine aggregator m3

m3's Introduction

M3

Distributed TSDB and Query Engine, Prometheus Sidecar, Metrics Aggregator, and more such as Graphite storage and query engine.

More Information
Install
- Dependencies
Usage
Contributing

More Information

Community Meetings

You can find recordings of past meetups here: https://vimeo.com/user/120001164/folder/2290331.

Install

Dependencies

The simplest and quickest way to try M3 is to use Docker, read the M3 quickstart section for other options.

This example uses jq to format the output of API calls. It is not essential for using M3DB.

Usage

The below is a simplified version of the M3 quickstart guide, and we suggest you read that for more details.

Start a Container

docker run -p 7201:7201 -p 7203:7203 --name m3db -v $(pwd)/m3db_data:/var/lib/m3db quay.io/m3db/m3dbnode:v1.0.0

Create a Placement and Namespace

#!/bin/bash
curl -X POST http://localhost:7201/api/v1/database/create -d '{
  "type": "local",
  "namespaceName": "default",
  "retentionTime": "12h"
}' | jq .

Ready a Namespace

curl -X POST http://localhost:7201/api/v1/services/m3db/namespace/ready -d '{
  "name": "default"
}' | jq .

Write Metrics

#!/bin/bash
curl -X POST http://localhost:7201/api/v1/json/write -d '{
  "tags": 
    {
      "__name__": "third_avenue",
      "city": "new_york",
      "checkout": "1"
    },
    "timestamp": '\"$(date "+%s")\"',
    "value": 3347.26
}'

Query Results

Linux

curl -X "POST" -G "http://localhost:7201/api/v1/query_range" \
  -d "query=third_avenue" \
  -d "start=$(date "+%s" -d "45 seconds ago")" \
  -d "end=$( date +%s )" \
  -d "step=5s" | jq .

macOS/BSD

curl -X "POST" -G "http://localhost:7201/api/v1/query_range" \
  -d "query=third_avenue > 6000" \
  -d "start=$(date -v -45S "+%s")" \
  -d "end=$( date +%s )" \
  -d "step=5s" | jq .

Contributing

You can ask questions and give feedback in the following ways:

M3 welcomes pull requests, read contributing guide to help you get setup for building and contributing to M3.

This project is released under the Apache License, Version 2.0.

m3's People

Contributors

Stargazers

Watchers

Forkers

xephon-contrib richardartoul katezaps jskelcy glycerine huydx deepankarsharma gdtm86 haoxuu easyfmxu hberni fossabot isaachier xyntrix poblahblahblah andrewmains12 kjm0001 mbrukman hien dyhpoon gksinghjsr bonedaddy shaunstanislauslau zhangxu19830126 ptzagk ebs-tn everesio bpd1069 forging2012 sjanulonoks lookfirst vishalbelsare cuulee altoplano chakra-coder eahydra haxine delkyd linzongkao luyulong skymysky archer-christ woerwin hadoop835 justintung zmyer austinyan gzcassan hoteam wang1219 futurefackbook 321dao b-xiang zhangjunqiang davidmr001 beaver-company awesome-golang twkun awesomegolang gaoshuibo ylsgit yanghongkjxy baicai-ai yuanfeng0905 dmeulen yunstanford hujter kiefersmith zzheffn dikang123 khattab aland-zhang matejzero mistshi sciffer chaossir berthartm il-mattone martin-ly slpcat amoshappy fire yehuangcn iwankgb nodwu edkolev moonight albert19882016 huangll betenkowski shadowwalker2718 lyuboraykov clncode songrgg wisre himanshpal guozanhua shz117 stjordanis kevingrandon

m3's Issues

Run R/W workload during DTests

Investigate why adding back a node being removed causes high fetch latency

After a node is added back during a remove (backing out of a remove node) - we have noticed that the fetch latency spikes up to 100x the normal latency (of well less than 1s) for quite some time or until the node is restarted. The write latency seems not to be affected or other functions however.

The latency seems to only affect that single node and overall cluster latency is steady using a read consistency of majority or lower.

Address receiving updates to dynamic topology but then not being able to connect within the connect timeout

Perhaps we keep retrying indefinitely until we receive a new update and then indefinitely try the new update, and so forth?

Investigate flaky integration test: TestDynamicNamespaceDelete

This test fails frequently with the output:

--- FAIL: TestDynamicNamespaceDelete (25.19s)
	Error Trace:	dynamic_namespace_delete_test.go:128
	Error:		Should be true

Refactor series block merging

Changes:

Our integration tests caught (yay!) an edge case in repairs during block merging (discussed below). Address that.
Introduce a write lock within a series buffer, and modify the current series lock to be used appropriately (can become a read lock on the series, and a write lock on the buffer)
Block merging logic exists in both series.buffer and series.blocks. Buffer only attempts to rotating blocks out of the buffer, refactor this to transfer block ownership from buffer.blocks to series.blocks, and perform the merge lazily there.

Integration test issue:

Say we have 2 m3db replicas (R0, R1), each with a block for foo at time t0 (block b0 with R0 and block b1 with R1). Each block has content not present in the other, i.e. needs to be merged.
At some later time t1, R0 starts repairing the differences - it fetches metadata from its peer, observes the difference, and issues a Merge() on the block.
Underneath the covers, this queues a merge to be triggered when the block is read next, and because of the way we order the code, we swap metadata to reflect that of the peer block retrieved. The merge of the blocks is not triggered upon metadata retrieval.
So down the road, when R1 starts it's repairs, it asks for metadata from R0 and gets back same metadata it has, and doesn't attempt a merge.
Thereby, failing to repair data it should.

Enable race detector for all tests

Add some prop tests for the DecoderStream implementations

As per @prateek's comments on #395:

Feels like you're having to test a lot of weird combination of conditions, how would you feel about adding a few property tests for the decoder?

We should probably implement some tests if we're worried about further possible issues.

Speed up cache shard indices on bootstrap

Right now bootstrapping takes some time simply to open all the file descriptors and read the shard indices when caching the shard indices during a bootstrap.

This could be done with some reasonable level of parallelism to speed this up.

Add more granular observability per namespace

Some metrics that will be useful:

ticking/bootstrapping/flushing status per namespace
metrics for failed ticks/flushes/etc

Enable ErrCheck during metalinting

Found a couple of places we're dropping error messages during #306. Enough proof that we need to add https://github.com/kisielk/errcheck to our gometalinter config.

Emit metrics from client to easily detect a node down

Currently its a little hard to write a node down alert given the metrics the client currently emits.

We need to emit metrics to make detecting this case simple.

Refactor M3DB to not rely upon commitlog filename timestamp

The fs package of M3DB has a function called "filesBefore" which returns a list of commitlog who's block starts before a given time period. Instead of relying upon the timestamps in the filenames, we should delete this function, and packages that depend on it (currently just the cleanup manager) just use the fs package to get a list of all commitlog files, and then use the ReadLogInfo function in the commitlog package to determine the block start time for each file.

Address data loss during block rotation when there are out of order writes

It appears when ingesters are retrying writes, sometimes the data in the same block during which the retries were happening are lost.

Make `all-gen` idempotent on `master`

Currently running make all-gen creates changes in generated files even when the source files are un-changed. Address this to minimise noise in PRs/overhead on developers.

Switch to model where metadata for unread blocks can be removed from memory

Right now we retain block metadata in memory after we unwire the actual data from memory. This means looking up whether series exist or don't exist is very fast but also means we use a whole lot more memory than necessary. Since our performance currently is adversely affected by how large our heap is due to Go's GC and that we also do not want to be memory bound long term, we need to move to a model where we can look this up on demand.

This requires a significant amount of changes at the shard layer due to our current checks of existence is a simple map lookup, etc.

It is something that can drastically reduce the hardware footprint required for large datasets (100s of TBs).

Filter by namespace when bootstrapping for commit log bootstrapper namespace

We persist a namespace identifier as we write entries to the commit log, but don't use this whilst bootstrapping from commitlog. The commitlog bootstrapper needs to filter entries in the commit log based on the namespace it's bootstrapping for.

Why not Prometheus or Influx

this seems like a big engineering effort you guys are reinventing the wheel instead of using one of the current solution. why not use Prometheus or Influx?

change map[time.Time]XXX to map[int64]XXX

There were cases where we could not find a hit in map for the same time but different timezone, we should use UnixNanoSeconds as the key to represent a time

Better support for dynamic namespace updates/removals

The current implementation listens to KV for namespace changes, and does the following:

(a) If any new namespaces are listed, it creates the corresponding namespaces, bootstraps, and starts serving reads/writes for the new namespaces.
(b) If any namespaces currently running in the process are not listed in the KV update, it does NOT remove the corresponding in-memory objects. These changes will be applied when the process restarts.
(c) If any namespaces current running in the process have different settings in the KV update (e.g. new retention), it does NOT apply the updates to the in-memory objects. These changes will be applied the process restarts.

We should enhance the code to support (b) & (c) safely.

Add back the ability to blocking flush after a bootstrap

The following PR works around an issue we encountered when naively trying to flush after a bootstrap:
#268

It does this by simply waiting for the next tick and a flush to occur to actually finish the bootstrap process.

Ideally we do actually flush directly after a bootstrap, however we need to coordinate with the ticking procedure to ensure that all buffers have been rotated into each series by waiting for a tick to rotate all the buffers before we can start a flush cycle (otherwise me miss blocks that are just about to rotate in for that time window when we flush and we don't flush them again because the time window is marked flushed).

Address writes counted toward consistency but then dropped when restarting during bootstrapping

Add support for writing/reading arbitrary values not just floats

Currently all the interfaces both at RPC and the package level only allow for float64 values to be written and read into M3DB.

There is no need for such such a restriction, as long as users can provide a stream encoder and decoder there's no reason why they can't write/read arbitrary data structures to M3DB for time series data.

Perhaps we can just add some generic write/read methods alongside the current specialized float64 methods.

Avoid redundant creating pools during Options construction

In the production code path, we call storage.NewOptions(), which constructs all underlying pools. And then overwrite them to the pools specified in the configuration. We can save a lot of allocs by avoiding this redundant creation. One possible way to do this is to take a create ctor which doesn't initialise pool values.

We should audit the code-base to see where else this applies – pretty much all Options which have pools specified need to be considered.

Experiment with drwmutex technique (using CPUID instruction) for faster pooling of objects

@prateek mentioned drwmutex the other day, which uses a CPU instruction that provides the current executing core - albeit only on Linux x86:
https://github.com/jonhoo/drwmutex

All our object pools currently use a channel backed object pool that might be a contention based bottleneck when scaling up further (this should first be proved).

We should investigate when attempting to push the writes per node to the limit using a lockless based object pool that is implemented in Go assembler, at least the get() and the put() calls, and uses CPUID to select the per core queue of pooled objects and returns either the next available pooled object or nil on empty. Currently Go assembler code is not preempted and it is unlikely to ever be, this will make it safe to provide this functionality.

Create linter to detect time.Time == time.Time and add to all repos

Optimise Commit Log Bootstrapping

Two changes:
(1) The commit log bootstrapper reads all the files present in the commit log directory, regardless of the time range it's bootstrapping for. This can be optimised to only read the correct files.
(2) The commit log bootstrapper is run per namespace, which means we read all commit log files present on disk for each namespace we bootstrap. Optimise this to only require a single read pass across all the namespaces.

Rethink peer streaming fetch block retry semantics

At some point we should rethink our retry semantics here. If any of these fail they won't be retried against the peer we want them from and we will fallback to other peers and attempt to fetch from them instead.

Now that we are doing merged reads for the peers bootstrapper, we can't really fallback to the next peer (we want blocks from all the peers in the case where we need merges).

For now this is fine, but we probably want real retry semantics in the future for trying to retrieve from the same peer on failure.

Address peer bootstrapping retries strategy for nodes that go down

Right now they'll be retried against a different peer, but for checksums that don't agree requests go out to all peers. Thus a retry against a different peer will be a duplicative request and also not be valuable as the entire reason it was issued to that peer was to merge the results together.

Create integration test to verify M3DB handles timezones correctly internally

Spin up a node in one timezone, create a client with another timezone, and then verify that everything is written / read correctly.

Test faulty build using DTest

Using this ticket to track known bad builds we should test once we have a decent (any?) workload being sent in DTests. We should ensure the DTest suite is able to capture these issues:

888db3e is known to have double frees

/cc @robskillington

Address panic in multireader iterator

panic: runtime error: index out of range

goroutine 2815650 [running]:
panic(0xa76940, 0xc42000e0e0)
        /usr/lib/go-1.7/src/runtime/panic.go:500 +0x1a1
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/m3tsz.(*readerIterator).Current(0xd2d51e8180, 0xecf9e1f01, 0x0, 0xeefe60, 0x28b4576c1162192c, 0x413c01, 0x0, 0x0, 0x0)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/m3tsz/iterator.go:355 +0x139
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding.(*multiReaderIterator).moveIteratorToValidNext(0xc9c3fb73b0, 0x7f010f41b378, 0xd2d51e8180, 0x7f010f41b378)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/multi_reader_iterator.go:167 +0x8e
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding.(*multiReaderIterator).moveIteratorsToNext(0xc9c3fb73b0)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/multi_reader_iterator.go:146 +0x10d
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding.(*multiReaderIterator).moveToNext(0xc9c3fb73b0)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/multi_reader_iterator.go:95 +0x309
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding.(*multiReaderIterator).Next(0xc9c3fb73b0, 0xecf9e1ffb)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/encoding/multi_reader_iterator.go:67 +0xa9
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/client.(*blocksResult).mergeReaders(0xc597a55e60, 0xecf9df480, 0xc500000000, 0xeefe60, 0xc43bc842a0, 0x2, 0x2, 0x0, 0x0, 0x0, ...)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/client/session.go:1851 +0x174
code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/client.(*blocksResult).addBlockFromPeer(0xc597a55e60, 0xec70c0, 0xc536a033c0, 0xd0a6c55680, 0x0, 0x0)
        /var/cache/udeploy/r/statsdex_m3dbnode/sjc1-produ-0000000283-v2/tmp/src/code.uber.internal/infra/statsdex/vendor/github.com/m3db/m3db/client/session.go:1829 +0x81a

Investigate TestCommitLogBootstrap integration test failing for large blockNum values

The integration test TestCommitLogBootstrap currently passes when the blockNum is set to 30. However if you increase this value to 300 it fails (some series / datapoint simply don't get bootstrapped)

I've narrowed the issue down to the bootstrapper discarding the data due to the following check:

blockStart := dp.Timestamp.Truncate(blockSize)
		blockEnd := blockStart.Add(blockSize)
		blockRange := xtime.Range{
			Start: blockStart,
			End:   blockEnd,
		}
		if !ranges.Overlaps(blockRange) {
			// Data in this block does not match the requested ranges
			continue
		}

Either the test is generating invalid data, or the bootstrapper logic is incorrect and its throwing away correct data.

[bootstrap] Investigate increasing GOMAXPROCS to improve IO performance

I found this interesting discussion recently on improving the IO throughput of Go programs. The tldr is that setting GOMAXPROCS to a value which is higher than the number of cores can improve IO throughput at the expense of a negligible increase in CPU from increased context switching. I know we've been trying to cut down on commit log bootstrap time so I thought I would bring this up as a potential knob we can optimize if IO is a bottleneck contributing to longer bootstrapping times.

cc @robskillington @prateek

Fix approach to generating mocks

The current approach to generating mocks is untenable as we add developers that aren't the original authors, as it requires manual deduction to trace back from the generated code to the process used to generate that code, esp since there isn't a 1:1 mapping between the generated mocks and the interfaces used to generate them. We need a simple reproducible build target that can be used to re-generate mocks for the types that require them.

Add checksum verification of blocks exchanged during peers bootstrapping

Improve test coverage to >85% and make it a blocker for PRs

Need to set the coveralls thresholds at coveralls.io once achieved >85%.

Make `persist/fs` filesystem interactions testable

We're currently passing *os.File around, which is a struct and not mockable. E.g. case where we needed this ability: #347. Should migrate the codebase to use something like https://talks.golang.org/2012/10things.slide#8 which allows us to mock fs interactions.

Either state that M3TSZ is lossy or fix precision loss for some values interpreted as "int like"

This test case fails if adding it to roundtrip_test.go:

func TestExtraPreciseRoundTrip(t *testing.T) {
	start := time.Now().Truncate(time.Second)
	testRoundTrip(t, []ts.Datapoint{
		{Timestamp: start, Value: 484.81953300000004},
		{Timestamp: start.Add(2 * time.Second), Value: 336.34150700000004},
		{Timestamp: start.Add(3 * time.Second), Value: 138.22963599999997},
		{Timestamp: start.Add(4 * time.Second), Value: 442.91275199999995},
	})
}

cc @martin-mao @xichen2020

[database] Investigate use of a distributed Read-Write Mutex

I came across an interesting package recently that implements a distributed read/write mutex and I thought it might be worth investigating for m3db. Under the hood, the package uses a slice of sync.RWMutex whose length is equal to the number of CPUs on the server. Calls to RLock first check which CPU the goroutine is running on and use that to index into the underlying slice and call RLock on the resulting mutex. Calls to Lock, on the other hand, iterate through the slice and call Lock on each mutex. As a result, the lock can lead to pretty drastic performance improvements (the README provides some benchmarks) for workloads which run on servers with a large number of CPUs and for which reads greatly dominate writes. It seemed plausible to me that the workload characteristics of db might be such that it would see improvements, so I figured I might bring it up.

cc @robskillington @prateek

Add ability to cancel a bootstrap during a topology change

Right now a bootstrap will continue to try and complete regardless of whether a topology change invalidates a whole set of shards it is bootstrapping.

To avoid wasted work we should make bootstraps cancellable and ensure we restart the cycle should another topology change occur.

Investigate long bootstrap times for the commit log bootstrapper

This may be simply a performance issue or could be an edge case with the commit log iterator, but it seems when bootstrapping from multiple large commit log files the commit log bootstrapper takes an exceptionally long time - much longer than previously benchmarked on a single file.

Fix broken test TestDeleteInactiveFiles

Closing context synchronously when closing blocks makes some integration tests hang

Closing blocks asynchronously causes goroutine number spikes during a topology change because it closes context asynchronously. However after changing it to a blocking close, some integration tests (e.g., TestPeersBootstrapMergePeerBlocks) are stuck.

Use a centralized .ci repo as a submodule in all M3DB repos

To avoid unique changes being made to all, like we currently do. Not sustainable.

Add bootstrapping status endpoint

We need the ability to tell if a m3dbnode is currently bootstrapping for dtest. Expose that in the health endpoint if it's cheap, else expose under a new detailed status endpoint.

Cleanup files for shards no longer owned

After shards move around the cluster the node doesn't cleanup the files for shards it no longer owns.

Document our suggested testing strategies

unit testing
- note about black-box v white-box testing, need both
Property testing
Integration testing
Maybe include release testing methodology
- dtests
- shadow-traffic

Add instrumentation for background processing tasks

Add more instrumentation for background processes (tick/flush/bootstrap/cleanup/...), i.e.
(1) Add logging per namespace (info level),
(2) Add metrics per namespace,
(3) Add logging per shard (probably debug level, with ability to turn up),
(4) Add write/read numbers per shard - useful to gauge load distribution

While we're here, also:
(4) Migrate to using Zap for logging
(5) Use lumberjack to specify output directory for logs, rotation policy
(6) Document dynamic log level changing per namespace/shard via curl

Shut down doesn't leave all components in proper state

For example, flush manager is left in flushing state preventing a successful restart bc a flush is "in progress"

There are possible many other places that this happens during shut down

Add dtest to test dynamic namespace updates

Add dtests to ensure dynamic namespaces are created/removed as expected