couchbase / moss Goto Github PK

View Code? Open in Web Editor NEW

946.0 72.0 60.0 926 KB

moss - a simple, fast, ordered, persistable, key-val storage library for golang

License: Other

Go 97.11% Makefile 0.19% Python 2.70%

moss's Introduction

moss

moss provides a simple, fast, persistable, ordered key-val collection implementation as a 100% golang library.

moss stands for "memory-oriented sorted segments".

Features

ordered key-val collection API
100% go implementation
key range iterators
snapshots provide for isolated reads
atomic mutations via a batch API
merge operations allow for read-compute-write optimizations for write-heavy use cases (e.g., updating counters)
concurrent readers and writers don't block each other
child collections allow multiple related collections to be atomically grouped
optional, advanced API's to avoid extra memory copying
optional lower-level storage implementation, called "mossStore", that uses an append-only design for writes and mmap() for reads, with configurable compaction policy; see: OpenStoreCollection()
mossStore supports navigating back through previous commit points in read-only fashion, and supports reverting to previous commit points.
optional persistence hooks to allow write-back caching to a lower-level storage implementation that advanced users may wish to provide (e.g., you can hook moss up to leveldb, sqlite, etc)
event callbacks allow the monitoring of asynchronous tasks
unit tests
fuzz tests via go-fuzz & smat (github.com/mschoch/smat); see README-smat.md
moss store's diagnostic tool: mossScope

License

Apache 2.0

Example

import github.com/couchbase/moss

c, err := moss.NewCollection(moss.CollectionOptions{})
c.Start()
defer c.Close()

batch, err := c.NewBatch(0, 0)
defer batch.Close()

batch.Set([]byte("car-0"), []byte("tesla"))
batch.Set([]byte("car-1"), []byte("honda"))

err = c.ExecuteBatch(batch, moss.WriteOptions{})

ss, err := c.Snapshot()
defer ss.Close()

ropts := moss.ReadOptions{}

val0, err := ss.Get([]byte("car-0"), ropts) // val0 == []byte("tesla").
valX, err := ss.Get([]byte("car-not-there"), ropts) // valX == nil.

// A Get can also be issued directly against the collection
val1, err := c.Get([]byte("car-1"), ropts) // val1 == []byte("honda").

For persistence, you can use...

store, collection, err := moss.OpenStoreCollection(directoryPath,
    moss.StoreOptions{}, moss.StorePersistOptions{})

Design

The design is similar to a (much) simplified LSM tree, with a stack of sorted, immutable key-val arrays or "segments".

To incorporate the next Batch of key-val mutations, the incoming key-val entries are first sorted into an immutable "segment", which is then atomically pushed onto the top of the stack of segments.

For readers, a higher segment in the stack will shadow entries of the same key from lower segments.

Separately, an asynchronous goroutine (the "merger") will continuously merge N sorted segments to keep stack height low.

In the best case, a remaining, single, large sorted segment will be efficient in memory usage and efficient for binary search and range iteration.

Iterations when the stack height is > 1 are implementing using a N-way heap merge.

In this design, the stack of segments is treated as immutable via a copy-on-write approach whenever the stack needs to be "modified". So, multiple readers and writers won't block each other, and taking a Snapshot is also a similarly cheap operation by cloning the stack.

Limitations and considerations

NOTE: Keys in a Batch must be unique. That is, myBatch.Set("x", "foo"); myBatch.Set("x", "bar") is not supported. Applications that do not naturally meet this requirement might maintain their own map[key]val data structures to ensure this uniqueness constraint.

Max key length is 2^24 (24 bits used to track key length).

Max val length is 2^28 (28 bits used to track val length).

Metadata overhead for each key-val operation is 16 bytes.

Read performance characterization is roughly O(log N) for key-val retrieval.

Write performance characterization is roughly O(M log M), where M is the number of mutations in a batch when invoking ExecuteBatch().

Those performance characterizations, however, don't account for background, asynchronous processing for the merging of segments and data structure maintenance.

A background merger task, for example, that is too slow can eventually stall ingest of new batches. (See the CollectionOptions settings that limit segment stack height.)

As another example, one slow reader that holds onto a Snapshot or onto an Iterator for a long time can hold onto a lot of resources. Worst case is the reader's Snapshot or Iterator may delay the reclaimation of large, old segments, where incoming mutations have obsoleted the immutable segments that the reader is still holding onto.

Error handling

Please note that the background goroutines of moss may run into errors, for example during optional persistence operations. To be notified of these cases, your application can provide (highly recommended) an optional CollectionOptions.OnError callback func which will be invoked by moss.

Logging

Please see the optional CollectionOptions.Log callback func and the CollectionOptions.Debug flag.

Performance

Please try go test -bench=. for some basic performance tests.

Each performance test will emit output that generally looks like...

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
spec: {numItems:1000000 keySize:20 valSize:100 batchSize:100 randomLoad:false noCopyValue:false accesses:[]}
     open || time:     0 (ms) |        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s || cumulative:        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s
     load || time:   840 (ms) |  1190476 wop/s |   139508 wkb/s |        0 rop/s |        0 rkb/s || cumulative:  1190476 wop/s |   139508 wkb/s |        0 rop/s |        0 rkb/s
    drain || time:   609 (ms) |        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s || cumulative:   690131 wop/s |    80874 wkb/s |        0 rop/s |        0 rkb/s
    close || time:     0 (ms) |        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s || cumulative:   690131 wop/s |    80874 wkb/s |        0 rop/s |        0 rkb/s
   reopen || time:     0 (ms) |        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s || cumulative:   690131 wop/s |    80874 wkb/s |        0 rop/s |        0 rkb/s
     iter || time:    81 (ms) |        0 wop/s |        0 wkb/s | 12344456 rop/s |  1446616 rkb/s || cumulative:   690131 wop/s |    80874 wkb/s | 12344456 rop/s |  1446616 rkb/s
    close || time:     2 (ms) |        0 wop/s |        0 wkb/s |        0 rop/s |        0 rkb/s || cumulative:   690131 wop/s |    80874 wkb/s | 12344456 rop/s |  1446616 rkb/s
total time: 1532 (ms)
file size: 135 (MB), amplification: 1.133
BenchmarkStore_numItems1M_keySize20_valSize100_batchSize100-8

There are various phases in each test...

open - opening a brand new moss storage instance
load - time to load N sequential keys
drain - additional time after load for persistence to complete
close - time to close the moss storage instance
reopen - time to reopen the moss storage instance (OS/filesystem caches are still warm)
iter - time to sequentially iterate through key-val items
access - time to perform various access patterns, like random or sequential reads and writes

The file size measurement is after final compaction, with amplification as a naive calculation to compare overhead against raw key-val size.

Contributing changes

Please see the CONTRIBUTING.md document.

moss's People

Contributors

Stargazers

Watchers

Forkers

steveyen mschoch tleyden abhinavdangeti bsubhashni hisundar bkumaran pombredanne guangminglion cmdev8 gloriousfutureio godeep mylearning2017 gophersgang db4u wuxingyi kokizzu mapbased ngaut go-awesome luoxy123 andy9775 zhanglei matthiasrms tkandal developgo razertory isgasho venom90 bonedaddy masson2013 gertcuykens wheelcomplex shalabhms qrpike jkellz-dev haodreams arealdeadone zhiruchen ksaturn murolo dthadi3 shuff1e suryatmodulus jaydenwen123 tsywkgo forkkit icodein qjawko standardgalactic zhangxi123051 ajunlonglive ivanfong01 jerriezhou iq-scm taragolang muddlebee brunotm jamestiotio olegff

moss's Issues

optimization idea: copy-on-write, ref-counted segmentStack's

Currently, creating snapshot copies the the segment stack, which means memory allocations and copying of pointers. Instead, perhaps creating a snapshot should just bump a ref-count on some existing segmentStack which should be treated as immutable (except for the ref-count).

Whenever a writer (such as ExecuteBatch, the background Merger, etc) wants to modify the segmentStack, it should use copy-on-write.

The existing moss implementation actually does parts of the above anyways, so we might be near to that already.

race to complete data write

It appears that store.Persist() (and store.Close() ) return before persistence to disk is actually completed.

This creates a race and causes a simple write-then-read test (https://github.com/glycerine/bulletin/blob/master/bulletin_test.go#L10) to fail, unless a print is inserted to slow down the test code.

to reproduce: clone https://github.com/glycerine/bulletin and run go test -v


go test -v 
=== RUN   Test001BulletinPersistanceToDisk

  read/write to persistent disk should work ✘


Failures:

  * /Users/jaten/go/src/github.com/glycerine/bulletin/bulletin_test.go
  Line 39:
  Expected: '[]byte{0x61, 0x20, 0x76, 0x61, 0x6c, 0x75, 0x65}'
  Actual:   '[]byte(nil)'
  (Should resemble)!


1 assertion thus far

--- FAIL: Test001BulletinPersistanceToDisk (0.00s)
FAIL
exit status 1
FAIL    github.com/glycerine/bulletin   0.014s

commenting in the line https://github.com/glycerine/bulletin/blob/master/bulletin.go#L58 slows down the test, causing it to succeed on my mac book. Hence the code is timing finicky. It should not be. I expected that once Persist() or Close() completed, the data is safely on disk. However, inspection of disk reveals no files written before process shutdown. Hence there is background async processing going on that should be synchronous with the completion of Persiste() and/or Close()

Q on write amplification

@mschoch I enjoyed your talk on Moss at GopherCon. At the end you pointed out a situation (I couldn't discern exactly when) where a small write lead to alot of copying/write-amplification.

I just wanted to inquire if that issue had been addressed?

rework moss Segment interface to be more generic

From discussion with @mschoch and @sreekanth-cb, thoughts on changes to Segment interface to make it more generic, and not so specific to an array based implementation...

type Segment interface {
+ // DRAFT / NEW method...
+ Get(key []byte) (operation uint64, val []byte, error)
+
+ // And also remove FindKeyPos().
+
+ // DRAFT - replace FindStartKeyInclusivePos() with
+ // some opaque cursor / handle based approach...
+ type Cursor interface {
+         Current() (operation uint64, key []byte, val []byte, error)
+         Next() error
+ }
+
+ FindStartKeyInclusiveCursor(startKeyInclusive, endKeyExclusive []byte) (Cursor, error)

leveled compaction only considers top/root collection

The leveled compaction policy decision only looks at the root collection when making its decision in compactMaybe(). Maybe need to recursively examine the child collections.

One thought is if any collection in the collection hierarchy needs a full compaction, then that's the overriding priority.

potential garbage of kvs pair during compactWriter.Mutate()

Saw this while doing a code review perusal, and wondering if the following allocates garbage...

pair := []uint64{opKlVl, uint64(keyStart)}

From: https://github.com/couchbase/moss/blob/master/store_compact.go#L534

If so, perhaps a pre-allocated pair can be held as part of the compactWriter and reused across multiple calls to compactWriter.Mutate()

Iterator only returning one record

I'm using this code to store and retrieve records from moss.

But when I call GetStoredSQSMessages() it only seems to return the last entry, as opposed to all the entries.

If I run strings data-0000000000000001.moss I can see all the records I'm expecting, so I know their somewhere in the moss, but I just can't get at them w/ the iterator.

Can you take a look at my GetStoredSQSMessages method and see if I'm doing anything wrong.

If nothing is obvious, should I try repro'ing this in a unit test? I'm storing the moss in a docker volume mount, so it's possible I'm doing something funny (but like I said I can see all the records with strings, so it seems to be an iterator problem)

feature: add support for reverse iteration

is a large pwrite() more efficient than many concurrent smaller pwrite()'s?

In moss, it might have a large segment (many megabytes) that needs to be written or appended to the end of the file.

Is it more efficient to call pwrite() once with the entire segment's large data buffer, where perhaps the OS/filesystem knows how to efficiently split that up into concurrent I/O channel use...

...or, should moss split the big buffer into multiple, smaller, block-aligned sections, where moss can spawn off multiple goroutines to invoke several pwrite()'s concurrently.

Which one is faster / utilizes more I/O bandwidth or channels?

Add goodput stats to moss

Thanks to @mcpaddy for the idea & intro to this term.

(wonder if besides "goodput", there's also "badput" and "uglyput")

use golang Context idiom

Q: Transaction performance

Hi,
this project looks very interesting. I have a lot of transaction writes, instead of multiple writes per transaction.

How many transaction writes of small data can moss handle per second on average ssd? For example with BoltDB it was only about 250, so I wonder if this project can perform better or if it is also limited by the file system.

add optional CompactionCallback

See also the CompactionFilter feature from rocksdb.

Some use cases to consider...

the compaction callback (provided by the app) might want an entry to disappear -- for example, to implement expirations
the compaction callback might want to change the value associated with a key, perhaps to compress or cleanup the data within a particular value
the compaction callback might want to atomically create/update/delete a whole slew of other keys -- for example, if I expire the entry for "user-session:0321", I also want to delete other keys with the same prefix, like "user-session:0321-gestures" and "user-session:0321-stats". This might be an advanced, separate functionality that deserves a different ticket/issue.

resolve endianness issue

This note occurs in the source:

https://github.com/couchbase/moss/blob/master/api.go#L69-L71

See Rob Pike's article for why we shouldn't care:

https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html

My recommendation is that we always write moss data little-endian

call fsync on parent dir on file first creation

see http://danluu.com/file-consistency/

...and might also need to fsync the grand-parent dir, if mossStore was the one that actually created the parent dir?

moss Merge Operator result value should be checked for going over val size limits

...and then an appropriate error should be returned / noted

TestStoreCompaction, TestStoreCompactionDeferredSort sometimes fails

using commit 9fdd764, sometimes seeing go test failures

$ go test
--- FAIL: TestStoreCompaction (0.04s)
store_test.go:747: expected only 1 file, got: 2, fileNames; [data-000000000000001d.moss (33017) data-000000000000001e.moss (37113)]
store_test.go:758: expected seq: 29 to equal numBatches: 30
store_test.go:778: expected nextFNameseq to be seq+1
--- FAIL: TestStoreCompactionDeferredSort (0.04s)
store_test.go:747: expected only 1 file, got: 2, fileNames; [data-000000000000001d.moss (33017) data-000000000000001e.moss (37113)]
store_test.go:758: expected seq: 29 to equal numBatches: 30
store_test.go:778: expected nextFNameseq to be seq+1
FAIL
exit status 1
FAIL github.com/couchbase/moss 5.517s

cleanup README to favor moss store use cases

That is, the up-front example code should favor an code sample of using OpenStoreCollection() instead of the current caching use-case of NewCollection()

consider using sub-tests?

So, if we're willing to only support Go 1.7+ we should consider using sub-tests.

Here is an example failure you get today:

--- FAIL: TestIteratorSeekTo (0.00s)
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>
	iterator_test.go:439: expected done, got: <nil>

But, when I look inside, there are several variations of seekTo being tested, in my case maybe they're all failing, but the point is you can't easily tell which ones are.

I started using the sub-test capability in vellum and it worked pretty nice. Here is a table style test that uses it:

https://github.com/couchbaselabs/vellum/blob/master/builder_test.go#L26-L109

I include a description field (as opposed to just a comment) explaining what this table entry is testing. Then I use that description when running the sub-test, so that if it fails, it prints out the details of which case failed. And of course description need not be a static string either, you could fmt.Sprintf() a parameterized case version.

As for the output, if you run the one I mentioned above with -v you get:

--- PASS: TestCommonPrefixLen (0.00s)
    --- PASS: TestCommonPrefixLen/both_slices_nil (0.00s)
    --- PASS: TestCommonPrefixLen/slice_a_nil,_slice_b_not (0.00s)
    --- PASS: TestCommonPrefixLen/slice_b_nil,_slice_a_not (0.00s)
    --- PASS: TestCommonPrefixLen/both_slices_empty (0.00s)
    --- PASS: TestCommonPrefixLen/slice_a_empty,_slice_b_not (0.00s)
    --- PASS: TestCommonPrefixLen/slice_b_nil,_slice_a_not#01 (0.00s)
    --- PASS: TestCommonPrefixLen/slices_a_and_b_the_same (0.00s)
    --- PASS: TestCommonPrefixLen/slice_a_substring_of_b (0.00s)
    --- PASS: TestCommonPrefixLen/slice_b_substring_of_a (0.00s)
    --- PASS: TestCommonPrefixLen/slice_a_starts_with_prefix_of_b (0.00s)
    --- PASS: TestCommonPrefixLen/slice_b_starts_with_prefix_of_a (0.00s)

And if I intentionally break on of the, you can see how the error output is more useful:

--- FAIL: TestCommonPrefixLen (0.00s)
    --- FAIL: TestCommonPrefixLen/both_slices_nil (0.00s)
    	builder_test.go:105: wanted: 1, got: 0

Now I can go right to the both_slices_nil case to start debugging.

Also, the cmd-line lets you filter by these name as well, so if I wanted to just rerun a single case that I'm working on I can do:

go test -run "CommonPrefix/both"
--- FAIL: TestCommonPrefixLen (0.00s)
    --- FAIL: TestCommonPrefixLen/both_slices_nil (0.00s)
    	builder_test.go:105: wanted: 1, got: 0
FAIL
exit status 1
FAIL	github.com/couchbaselabs/vellum	0.007s

Lots more details in the blog: https://blog.golang.org/subtests

If you're OK, I could start migrating some of them over time.

Leveled compaction support for heavy DGM

The Data Greater than Memory (DGM) performance of moss store's Log-Structured Merge Arrays is limited by the speed of the merge operation during compaction.
Currently this compaction is done in a single level which comes from the default setting of 0 as the compaction threshold. This can result in heavy write amplification.
Based on discussion with @steveyen, to mitigate this situation, moss store persistence can follow this simple approach:
maxSmallSegments=3, maxBigSegments=2
Initially, maxSmallSegments=0, maxBigSegments=0
Persistence appends small segments to end of the file...
|-seg0-||-seg1-||-seg2-| (maxSmallSegments=3)
On the next round of persistence, the above 3 segments can be compacted into a new file
|====seg0===| (maxSmallSegments=0, maxBigSegments=1)
Following this further persistence rounds simply append smaller segments
|====seg0====||-seg0-||-seg1-||-seg2-|
Now the next round of persistence, only compacts the small segments making the file look as follows..
|====seg0====||...seg0...||...seg1...||...seg2...||=====seg1=====|
The rationality behind this is that compacting fewer segments would be faster than constantly rewriting the file on every delta.

Later to support efficient persistence to disk, we can adopt a simple size-tiered leveled compaction support in mossStore by splitting the Footer across multiple levels:
data-L0-0000xx.moss: most recent segments.
data-L1-0000xx.moss: segments merged from L0
data-L2-0000xx.moss: segments merged from L1
We can then size tier these on the levels to achieve good tradeoff between space, read and write efficiencies.

stats update failing data race detection

Here...

$ go test -v -race ./...

  github.com/couchbase/moss.(*collection).updateStats()
      /Users/steveyen/go/src/github.com/couchbase/moss/collection.go:427 +0x313

Previous write at 0x00c4204a4000 by goroutine 26:
  github.com/couchbase/moss.(*segment).Swap()
      /Users/steveyen/go/src/github.com/couchbase/moss/segment.go:274 +0x176

Trying to close an unstarted collection results in deadlock

To repro:

func TestCloseUnstartedCollection(t *testing.T) {

	m, _ := NewCollection(CollectionOptions{})
	m.Close()

}

Results in:

 go test -v -run TestCloseUnstartedCollection
=== RUN   TestCloseUnstartedCollection
fatal error: all goroutines are asleep - deadlock!

How hard would it be to make Close() a no-op if Start() had never been called?

Btw in the README Example I don't see Start() being called, so that code might deadlock too.

document that leveled compaction only looks at top/root collection

The leveled compaction policy decision only looks at the root collection when making its decision in compactMaybe(), so need to doc that up more clearly somewhere.

consider allowing duplicate keys in a batch

The current restriction which disallows duplicate keys in a batch is potentially problematic for applications.

The reasons for this restriction are:

segments are sorted, and the sort is not stable, thus notion of "initial insert order" is lost
subsequently binary search may land on any of the duplicates, and yield undefined behavior

Ideas:

Add new collection option (possibly the default) which does the following:

use a stable sort
additional step to scan/find duplicates
-- choices here are to either remove (possibly needing to reslice large data)
-- or to introduce a new NOOP operation, which hides these duplicates

Further, if we introduce these NOOPs, the binary/search code might also have to honor these by looking forward/backward.

Also, some segment persisters could also choose to elide these NOOP operations.

partial compaction of index/kvs only.

In the segment.kvs array, there's 16^H^H 8 bits that are currently unused / RESERVED for the future, out of the 128 bits for each entry... https://github.com/couchbase/moss/blob/master/segment.go#L149

Maybe we can use those 16^H^H 8 bits somehow. One thought is as moss is appending dirty segments to the end of the file, perhaps it might opportunistically sometimes decide to compact some of the kvs arrays of the segment stack into a big kvs array. That is, leave all the old, already persisted buf's as-is/untouched, but the latest "compacted kvs" would use (some of?) the 16 bits to refer to the old buf that each entry points to.

The thinking is instead of performing binary search on the multiple kvs arrays down through the segment stack, there'd be only a single kvs array to binary search.

(Update: 16 -> 8 bits (thanks marty!); also if 8 bits ain't enough, then it's possible that keys and vals at runtime might not be near the max-keyLen and max-valLen limits, so that might be another place to get bits; and another place might be the offset into the buf)

Add some lock precedence design blurb/docs

Perhaps to the design docs, add a paragraph or so on lock acquisition precedence, to help avoid any lock inversion deadlock future bugs... see also 9fdd764

Buckets?

Hi, from the description it is not obvious to me if moss supports buckets(like bolt)?

There is the

child collections allow multiple related collections to be atomically grouped

which I'm not exactly sure if it is something like buckets or it is just a bunch of selected records manually put together?

add times taken to fsync() to the histograms

Sometimes, perhaps fsync() can take awhile -- might be good to know how long.

multi-directory, storage hierarchy, ephemeral awareness

This is probably too big for one ticket/issue, but...

Many systems have multiple I/O devices. And, today's moss only writes to a single directory, so isn't able to leverage multi-device I/O concurrency/bandwidth.

Some of those devices will be faster than others (SSD vs HDD).

And, some might be ephemeral (ex: EC2 ephemeral storage) versus long-lived (ex: EBS).

It might be debatable whether a "sufficiently advanced moss of the future" should be aware of such things, or whether such concerns should instead remain (like today) as just the application's responsibility to shard their moss usage on their own and handle the movement of data amongst storage hierarchies on their own.

TestOpenStoreCollection fail, store_test.go:1035

Saw this one time only so far (sporadic)...

=== RUN TestOpenStoreCollection
--- FAIL: TestOpenStoreCollection (0.21s)
store_test.go:1035: expected reopen store to work
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x1180dfd]

goroutine 6028 [running]:
testing.tRunner.func1(0xc4200709c0)
/usr/local/go/src/testing/testing.go:622 +0x29d
panic(0x11cb760, 0x12f0710)
/usr/local/go/src/runtime/panic.go:489 +0x2cf
_/Users/steveyen/dev/couchbase-server.spock/godeps/src/github.com/couchbase/moss.TestOpenStoreCollection(0xc4200709c0)
/Users/steveyen/dev/couchbase-server.spock/godeps/src/github.com/couchbase/moss/store_test.go:1038 +0xebd
testing.tRunner(0xc4200709c0, 0x120e038)
/usr/local/go/src/testing/testing.go:657 +0x96
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:697 +0x2ca
exit status 2
FAIL _/Users/steveyen/dev/couchbase-server.spock/godeps/src/github.com/couchbase/moss 18.562s

status of moss / production ready?

Thanks again for Moss!

How production-ready is moss these days? Is it used at couch?

Are there any concerns about data loss?

Thanks!

Calling Start() on a StoreCollection gives "panic: close of a closed channel"

To repro:

Add a call to m.Start() in the TestOpenStoreCollection test:

func TestOpenStoreCollection(t *testing.T) {
	tmpDir, _ := ioutil.TempDir("", "mossStore")
	defer os.RemoveAll(tmpDir)

	var mu sync.Mutex
	counts := map[EventKind]int{}
	eventWaiters := map[EventKind]chan bool{}

	co := CollectionOptions{
		MergeOperator: &MergeOperatorStringAppend{Sep: ":"},
		OnEvent: func(event Event) {
			mu.Lock()
			counts[event.Kind]++
			eventWaiter := eventWaiters[event.Kind]
			mu.Unlock()
			if eventWaiter != nil {
				eventWaiter <- true
			}
		},
	}

	store, m, err := OpenStoreCollection(tmpDir, StoreOptions{
		CollectionOptions: co,
	}, StorePersistOptions{})
	if err != nil || m == nil || store == nil {
		t.Errorf("expected open empty store collection to work")
	}
	m.Start()   // <--------- add this line

        .... etc ...

Will give error:

 go test -v -run TestOpenStoreCollection
=== RUN   TestOpenStoreCollection
panic: close of closed channel

It seems like a confusing interface to have to call Start() on non-persistent collections, but you can't call Start() on persisted collections. Is this intentional, or should calling Start() work on persisted collections?

Build failures on non-amd64 architectures

I noticed that tests fail on non-amd64 architectures. Failure logs are available on Debian's reproducible build project [1]. These failure logs were produced using rev:8ea508f. I've also reproduced this same behavior on rev:61afce4.

--- PASS: TestCollectionStatsClose (0.00s)
=== RUN   TestCollectionGet
--- PASS: TestCollectionGet (0.00s)
=== RUN   TestMossDGM
--- FAIL: TestMossDGM (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x804931b]

goroutine 1022 [running]:
testing.tRunner.func1(0x1881aa00)
        /usr/lib/go-1.8/src/testing/testing.go:622 +0x226
panic(0x81d40c0, 0x82c2248)
        /usr/lib/go-1.8/src/runtime/panic.go:489 +0x22a
sync/atomic.LoadUint64(0x186f42bc, 0x0, 0x0)
        /usr/lib/go-1.8/src/sync/atomic/asm_386.s:159 +0xb
github.com/couchbase/moss.(*dgmTest).getDGMStats(0x186f4240, 0x23d, 0x0, 0x82c6080)
        /build/golang-github-couchbase-moss-0.0~git20170828.0.61afce4/obj-i686-linux-gnu/src/github.com/couchbase/moss/dgm_moss_test.go:469 +0x88
github.com/couchbase/moss.TestMossDGM(0x1881aa00)
        /build/golang-github-couchbase-moss-0.0~git20170828.0.61afce4/obj-i686-linux-gnu/src/github.com/couchbase/moss/dgm_moss_test.go:887 +0x2f9
testing.tRunner(0x1881aa00, 0x8204518)
        /usr/lib/go-1.8/src/testing/testing.go:657 +0x7e
created by testing.(*T).Run
        /usr/lib/go-1.8/src/testing/testing.go:697 +0x242
exit status 2
FAIL    github.com/couchbase/moss       0.032s

[1] https://tests.reproducible-builds.org/debian/rb-pkg/buster/i386/golang-github-couchbase-moss.html

occasionally sync during compaction

Occasionally fsync()'ing during compaction (e.g., after ever 16mb written) might have performance impact -- see if an optional parameter for this might help moss?

add checksums

@hisundar had an idea / discussion on ckecksums...

Every 4K written in either the kvs array or the buf array as part of segment can be checksummed. The checksums can be stored in a new system child collection.

A new ReadOption might allow the user to either ask for optional checksum check (or allow folks to skip checksum checking) as the segment is accessed. We know mmap will pagefault in entire pages (like 4k), so after the page fault, access of 4k to double-check the checksum should be cheap (in RAM, but not necessarily in cacheline).

research madvise() syscalls to avoid mmap() stalls

mmap() is terrific, leading to a simple moss implementation.

But, due to the magic of page-faulting in data from disk from mmap(), which is an operation that's invisible to the app, the app thread may suddenly become blocked... and the golang scheduler has no idea it's happening and cannot arrange for the app thread to be used for other needs.

Perhaps various OS API's like madvise() and siblings can help manage this situation in more controlled fashion and reduce page faults.

Add IncrementalPersistence method to Segment interface & decouple from compaction

Currently moss store compactions only work for the Basic segment kind since it uses a custom bufferedSectionWriter. Adding an incremental persistence method to the Segment interface will help decouple the compaction logic from the actual implementation of the Segment interface.
This can make moss much more extensible with custom segment types which support unique application needs such as - btrees for faster indexing, compressed segments for smaller file sizes, finite state transducers, bloom filters, etc

moss Batch should check for keys and/or vals that are too large

...and then return an appropriate error

add compression to mossStore

See also blevesearch/bleve#553 for some real-world use case.

optimization - same length keys

if moss can somehow detect for a batch or segment that all the keys are exactly the same length, then one optimization might be to compress the kvs array -- don't need to store the key length

document the non-reusability of Batch instances better

After an application calls collection.ExecuteBatch(batch), you're supposed to batch.Close() and not reuse it, as mentioned here...

https://godoc.org/github.com/couchbase/moss#Collection

But, it should probably be mentioned more prominently in the README or other doc places.

(See also #14)

add an optional durability level to WriteOptions

moss defaults to asynchronous batch persistence, but users should be able to ask for synchronous batch persistence by just flipping on an option.

optimize: files writes should be block-sized

On both the regular persistence pathway and the compaction pathway, moss performs writes block-aligned with the starting byte of the to-be-written buffer. But, what about the end of the buffer?
Need to check/review the codepaths to see if moss is writing out block-sized buffers. If not, the filesystem/os might need to perform a wasteful read-modify-write of the last block.

add a non-snapshot Collection.Get(key, ReaderOptions) API

If a user just wants to lookup a single item in a collection, they have to first create a snapshot, then snapshot.Get(), then snapshot.Close().

One issue is creating a snapshot means memory allocations (of the snapshot instance and taking a copy of the segment stack).

A "onesie" API to just lookup a single item, if it can be implemented efficiently (without undue memory allocations and with having to hold a lock for a long time), more be more convenient for folks to grok.

(See also #14)

consider making all options arguments pointer and support nil as default

Most of the options structs are small, but they must be copied on each invocation, as opposed to simply copying the pointer.

Most applications will be fine creating one set of options and reusing it.

Passing nil, and having the library interpret it as "default" options is common pattern in Go.

support for multiple collections

Users can currently fake out multiple collections by explicitly adding a short collection name prefix to each of their keys. However, such a trick is suboptimal as it repeats a prefix for every key-val item.

Instead, a proposal to support multiple collections natively in moss would be by introducing a few additional methods to the API, so that we remain backwards compatible for current moss users.

The idea is that the current Collection in moss now logically becomes a "top-most" collection of an optional hierarchy of child collections.

To the Batch interface, the proposal is to add the methods...

NewChildCollectionBatch(childCollectionName string, hints) (Batch, error)

DelChildCollection(childCollectionName string) error

When a batch is executed, the entire hierarchy of a top-level batch and its batches for any child collections will be committed atomically.

Removing a child collection takes precedence over adding more child collection mutations.

To the Snapshot interface, the proposal is to add the methods...

ChildCollectionNames() ([]string, error)

ChildCollectionSnapshot(childCollectionName string) (Snapshot, error)

And, that's it.

The proposed API allows for deeply nested child collections of child collections, but the initial implementation might just return an error if the user tries to have deep nesting.

Panic on Get

I was trying to use moss as a backend store for Dgraph (https://github.com/dgraph-io/dgraph/tree/try/moss). But faced this issue

panic: runtime error: slice bounds out of range

goroutine 348 [running]:
github.com/couchbase/moss.(*segment).FindStartKeyInclusivePos(0xc4201ee000, 0xc42a1cb7e0, 0x16, 0x16, 0x16)
	/home/ashwin/go/src/github.com/couchbase/moss/segment.go:313 +0x19b
github.com/couchbase/moss.(*segmentStack).get(0xc4201cac30, 0xc42a1cb7e0, 0x16, 0x16, 0x1e, 0x0, 0x7f7f00, 0xc42006efc0, 0x1f, 0x2a, ...)
	/home/ashwin/go/src/github.com/couchbase/moss/segment_stack.go:90 +0x26b
github.com/couchbase/moss.(*segmentStack).Get(0xc4201cac30, 0xc42a1cb7e0, 0x16, 0x16, 0xc4201cac00, 0x6, 0x6, 0x6, 0x0, 0x6)
	/home/ashwin/go/src/github.com/couchbase/moss/segment_stack.go:74 +0x75
github.com/couchbase/moss.(*Footer).Get(0xc42006efc0, 0xc42a1cb7e0, 0x16, 0x16, 0x465600, 0x1, 0x6, 0xc424e85910, 0x465182, 0xfc7740)
	/home/ashwin/go/src/github.com/couchbase/moss/store_footer.go:426 +0x8a
github.com/couchbase/moss.(*snapshotWrapper).Get(0xc420192fc0, 0xc42a1cb7e0, 0x16, 0x16, 0x0, 0x0, 0xc4200928f0, 0x6, 0x0, 0xc424e85948)
	/home/ashwin/go/src/github.com/couchbase/moss/wrap.go:94 +0x62
github.com/couchbase/moss.(*segmentStack).get(0xc42a1f0d20, 0xc42a1cb7e0, 0x16, 0x16, 0xffffffffffffffff, 0x0, 0xc424e85a00, 0x1, 0x1, 0x6, ...)
	/home/ashwin/go/src/github.com/couchbase/moss/segment_stack.go:110 +0xa2
github.com/couchbase/moss.(*segmentStack).Get(0xc42a1f0d20, 0xc42a1cb7e0, 0x16, 0x16, 0x0, 0x0, 0x414ea2, 0xc42a1f0cd0, 0x50, 0x48)
	/home/ashwin/go/src/github.com/couchbase/moss/segment_stack.go:74 +0x75
github.com/dgraph-io/dgraph/posting.(*List).getPostingList(0xc42a1e9e00, 0x0, 0x0)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/posting/list.go:190 +0x1ef
github.com/dgraph-io/dgraph/posting.(*List).updateMutationLayer(0xc42a1e9e00, 0xc42a1f0cd0, 0x0)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/posting/list.go:263 +0x125
github.com/dgraph-io/dgraph/posting.(*List).addMutation(0xc42a1e9e00, 0xfdec00, 0xc425576e40, 0xc42226c660, 0x5, 0x5, 0xb090c8)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/posting/list.go:340 +0xd7
github.com/dgraph-io/dgraph/posting.(*List).AddMutationWithIndex(0xc42a1e9e00, 0xfdec00, 0xc425576e40, 0xc42226c660, 0x0, 0x0)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/posting/index.go:171 +0x2da
github.com/dgraph-io/dgraph/worker.runMutations(0x7f62912c2280, 0xc425576e40, 0xc4221e0000, 0x3e8, 0x400, 0x0, 0x0)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/worker/mutation.go:50 +0x21a
github.com/dgraph-io/dgraph/worker.(*node).processMutation(0xc42008c000, 0x2, 0x114, 0x0, 0xc420f4c000, 0x92c5, 0xa000, 0x0, 0x0, 0x0, ...)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/worker/draft.go:372 +0x13e
github.com/dgraph-io/dgraph/worker.(*node).process(0xc42008c000, 0x2, 0x114, 0x0, 0xc420f4c000, 0x92c5, 0xa000, 0x0, 0x0, 0x0, ...)
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/worker/draft.go:405 +0x36a
created by github.com/dgraph-io/dgraph/worker.(*node).processApplyCh
	/home/ashwin/go/src/github.com/dgraph-io/dgraph/worker/draft.go:444 +0x49b

Any help would be appriciated.
Thanks!

add IteratorOption to hint on how much to favor cache stability (whether it's ok to blow the cache)

Currently, running an iterator through all the keys can blow the hot working set out of the OS page cache, potentially ejecting the hot, cached entries that the app might need.

On the other hand, perhaps this is the behavior that the app wanted, where iterating through all keys will happen to warm up entries into the OS page cache.

The idea is to add an optional IteratorOption that allows the user to have a little bit more control, as a hint to the iterator to avoid blowing caches if possible.

For example, perhaps the compactor could use such an option, as although it needs to iterate through all the active data, it's copying them to a new file. So old pages from the soon-to-be-deleted old file really don't need to stick around in cache.

new child collections API need fuzz / smat testing

add API for synchronous persistence

Persistence is currently asynchronously handled by the background persister goroutine, and if folks want to ensure synchronous persistence, they have to go through a backflip dance steps like hooking up an event callback as seen in various unit tests.

Betcha some folks will just want a simple synchronous persistence API.