The sled's discuss from spacejam

implement mvp benchmark calls to db

take the arguments, and implement these operations:

spin up the specified number of threads
execute the desired number of total operations
roughly maintain the specified proportion of reads, writes, cas, del, and scan operations among the operations

modularize core components

Right now some of the core components such as tree, log, etc. are larger files which include a couple of different structs, enums and their corresponding methods and functions.

Those components should be modularized so that a clear file hierarchy is present which makes it easier to understand and reason about the code, its different components and their relationships.

@spacejam already started the process for the Log component in https://github.com/spacejam/rsdb/pull/27.

add child split fragment

this was skipped to simplify the in-memory prototype, but we need to use a frag so that we can linearize key updates with splits

template-based fuzzing on pagecache recovery

create Epoch structure

should have similar basic structure as Arc:

shared header, represented as AtomicUsize
enroll on clone
un-enroll + re-enroll + merge GC + un-enroll on drop
associated lock-free stack of GC enums for different types of stuff
associated thread-local GC state that is merged after initial un-enroll

libfuzzer travis setup via cargo fuzz

log cleaning

https://people.eecs.berkeley.edu/~brewer/cs262/LFS.pdf

make LRU scan & thrash resistant

zstd dictionary support

the use of a dictionary dramatically improves zstd compression and decompression performance.

collect statistics on each operation in the benchmark tool

when executing operations, we want to know about:

throughput (requests / s)
latency (time per request)

For latency, we care about the latency distribution, which means we need to collect percentiles. This is actually a pretty interesting problem, since almost all existing solutions REALLY suck and use probabilistic approches that destroy accuracy. One approach that I like is:

create a Vec of AtomicUsize that is pretty big.
create a funciton that maps from a measured value to a "slot number"
each slot number corresponds to values that are ~1% larger than the ones in the previous one
when measuring something, like latency, convert the latency to the slot number using the function, then look up the AtomicUsize and fetch_add 1 to it.
when done with all measurements, you can scan over the counts to determine any percentile. it's good to know, at a minimum, 0th (fastest), 50th (median), 75th, 90th, 95th, 97.5th, 99th, 99.9th, 99.99th, 99.999th, and 100th (max).

here's an example of a function that converts from a number to a slot number, and back: https://github.com/spacejam/loghisto/blob/master/metrics.go#L316

For throughput, just atomically increment a counter for each of the ops, and have a thread spit out the values every second, resetting it to 0. The latency percentiles can be for the duration of the entire benchmark, for now.

epoch-based GC

basic idea:
https://github.com/spacejam/tla-rust#lock-free-epoch-based-gc

it may be sufficient to use ping-pong epochs for this to simplify some logic. we keep a single word "header" that is CAS'd by threads to coordinate:

there is only one active epoch at a time, specified by the active bit
each epoch is either in the open or sealed state
when an operation begins on shared state (the tree), a thread first looks at the shared coordination header, and sees which buffer is active
if the active buffer is also in the open state, the thread tries to increment the associated active thread count by 1 using a CAS operation
if the active buffer is in the sealed state, we must wait for the thread that was the last one to finish after sealing it to finish the GC operation and transition it into the active state, so we spin until it is available
the thread then operates on the shared state
if the thread makes any shared state unreachable, it pushes it to a thread-local buffer that will be added to an epoch after un-enrolling. The state remains readable until the end of the epoch, when it is guaranteed that all threads have moved on to a later epoch, and cannot have viewed it anyway.
at some point, a condition is reached that causes the current epoch to be sealed, whether a periodic time since the last seal or the amount of garbage in the epoch
the thread that detects this condition tries to simultaneously set the seal bit on the epoch and flip the active bit using a single CAS operation, causing enrolling and GC'ing threads to use the other epoch
when the operation on the shared state is complete, the thread un-enrolls in its epoch by decrementing the shared header by 1
if its starting epoch is sealed, and the decrement would cause the epoch active thread count to drop to 0, then this thread is responsible for triggering (not necessarily performing) a GC operation on each item in the GC stack associated with the epoch. note that the thread still has not done anything with its thread-local GC state, it is just cleaning up the stuff from other threads.
after the epoch's GC state has been handed off to another thread for cleaning up, or has been completely cleaned up in this thread, the designated clean-up thread then marks the epoch as open again. it can combine this with the CAS that decrements the active thread count to 0 to cut down on memory barriers.
after un-enrolling in the epoch, the thread then enrolls again, and adds its local GC state to that epoch's GC stack
when un-enrolling in the "write" epoch, the same logic applies for handling decrements on sealed epochs as before, and the thread must handle the actual clean-up when responsible

quickcheck on log

improve lock-free radix pointer ratio

the current radix implementation is super naive, and all nodes contain 64 (1 << 6) atomic pointers to possibly nonexistant children. because this is a dense keyspace, the first few layers will be full, with the bottom leaves (of which there are 64x more than the above layer...) possibly being quite sparse. we want to keep the overall structure lock-free, since this structure is in the hot-path for every operation. look into ways of using less memory for sparse nodes, and growing/shrinking nodes with a CAS to keep sparsity in check.

benchmark tool

read in arguments https://github.com/spacejam/rsdb/issues/55
implement simple N threads, set/get/cas/del ops https://github.com/spacejam/rsdb/issues/56
measure throughput and latency https://github.com/spacejam/rsdb/issues/57
implement the advanced options that arguments were created for
implement zipfian key distributions to target some hot keys more than others, simulating more realistic popularity among data.

parallelize page frag reading

recover persisted flushes on start-up

naive recovery, no handling of LRU cache-related stuff yet. no page table snapshotting. read it all into memory for reconstruction.

implement iterator on Log trait
implement recovery method on PageCache
write relevant page metadata to flushes to support reconstruction
test that a log's stabilized updates are visible after shutting down and restarting

log replay after restoring page table snapshot

add logging to page operations

during all page operations:

create consolidated page / block of updates
reserve space for it in log
CAS
either abort or commit the log reservation

tune Lru

calls to the Lru bookkeeping structures are inefficient. should use simpler data structures.

test with ALICE

https://github.com/madthanu/alice

array-stack

reconstructing pages in-memory involves chasing linked-list pointers around

implement a linked-array structure that minimizes cache thrashing during page reconstruction

template-based fuzzing on tree operations

implement enroll - un-enroll - merge thread-local GC state logic

trivial mutation logging

set/del only

separate sealing of a lock-free log buffer from completing a reservation

Maybe `Frag: Send`?

I am trying to port rsdb to crossbeam-epoch: https://github.com/crossbeam-rs/crossbeam-epoch/

In doing so, I found out that Stack probably should require T: Send, because an item can be sent to elsewhere. In turn, I guess PageCache's P should be Send because of inner: Radix<Stack<CacheEntry<P>>>, and then Frag should be Send. What do you think?

enroll in GC epoch on operations on shared state

fuzz log parser

abstract concurrent algorithms to formally-verifiable form

io buffer claim
io buffer seal
io buffer unique write
gc enter epoch
gc write garbage to (possibly different) epoch
gc clear garbage guaranteed not to be in use by thread in epoch or any earlier epochs
tree ops...
etc...
https://github.com/spacejam/tla-rust

dirty page flushing

periodically flush the dirty buffer
make the LogID in CacheEntry non-optional

in the future we may want to buffer things before claiming a log slot for updates, but for now let's just keep it simple

compression of items written to log slots

we can't compress on the log-buffer level without destroying the property of a LogID being its physical offset, so we need to compress before writing here. Desired properties:

works well with small amounts of bytes of human languages
configurable

test torn pages on log

write garbage to log after valid entries
ensure read on valid entries suceeds
ensure read on garbage fails
ensure that during recovery, we detect the torn page and allow new entries to reuse the garbage space
ensure that the new entries, and old valid entries are valid before shutting down
recove again, ensure all valid entries are readable

frag location metadata in pagetable

the pagetable should know how to retrieve frags from the log to satisfy non-resident reads

on recovery, load page locations in as a stack with a flush pointing to the disk location
on page-in, CAS the stack to the actual one

quickcheck on tree

quickcheck on pagecache

prefix encoding and key extractor support

tune materializer calls

calls to the page materializer take a lot of time, try to cache results from it so we don't spend as much cpu calling it on the same hot pages over and over.

multi-fragment flush strategy

to improve data locality and reduce written data, write adjacent chunks of frags in a single stack at once to the log

add failing memory leak test script

add detached state to thread-local GC structure

read in arguments for benchmark tool

the benchmark tool should accept arguments for these parameters:

number of threads
number of total operations
proportion of reads, writes, cas, del, and scan operations among the operations
freshness bias (prefer recent/likely to be in cache, prefer old/not likely to be in cache, no preference)
non-present-key chance (the chance that a request for cas/get/set/del may be sent for a key that does not exist)
key size min, max, median
value size min, max, median
scan iterations min, max, median

handle tree root hoists during recovery

currently the pagecache recovers itself, but does not communicate to the tree on top about updates it encounters to the tree root, since it doesn't know what a tree root is. We should support an interface that lets trees inject custom logic into the recovery (and eventually the snapshotting) process.

pagetable snapshotting

on recovery, we should be able to recover a snapshot of the pagetable, then read the log after the pagetable was created for full reconstruction. we only want to read and recover the structure of pages and the offsets of chunks of frags in the log for future retrieval, without pulling in any kv data.

initially this can be done by forking and snapshotting, similar to redis

persistent state definition

define structures around persisted fragments and base nodes, and a way of easily generating a block of serialized state from a Frags struct.

spacejam / sled Goto Github PK

sled's Issues

Recommend Projects

Recommend Topics

Recommend Org