marc-casperlabs / stuffer_shack-rs Goto Github PK

0.0 1.0 0.0 51 KB

Rust 100.00%

stuffer_shack-rs's Introduction

stuffer_shack

Stuffer shack is a data store intended for medium amounts of append-only data. It stores all data on disk in a simple write log, and keeps a reconstructible in-memory index to look it up. The ideal input for stuffer_shack is content-addressed data that is never deleted.

It was originally written to replace LMDB in a very specific use case: If data is never altered or deleted and keys are completely random, stuffer_shack should be able to lower disk usage and underlying write operations compared to LMDB. If this is not true for your data, please reconsider using stuffer_shack.

Features

Zero-copy (via mmap): The backing store of stuffer_shack is a memory mapped file containing both keys and values. Any read from the shack returns a slice in that region.
Time-based data locality: All values are stored sequentially in the order they are written, avoiding the problem of random keys (like hashes) maximizing distance between related entries.
Compactness: The overhead for storing a value on disk is only 4 bytes.
Sequential disk writes: Data is written to disk in a maximally sequential manner, only database header is ever rewritten. This reduces the number of IOOPS to a minimum, which may be at a premium in cloud-based storage.
O(1) read and write performance: No data must ever be reallocated or reordered after a write. Any read is only dependent on a single HashMap lookup.
Unlimited lock-free concurrent reads: No lock needs to be acquired to read data.
Durability: The database cannot be corrupted due to power outages, only unfinished writes will be lost.

Limitations

Due to its specialized data model, stuffer_shack has quite a range of limitations:

Memory bound: The entire index is kept in memory, requiring roughly key_size + 4 bytes of memory per stored key to operate (i.e. a database with 10 million items and 32 byte keys will use at least 360 megabytes of RAM).
No disk reclaiming: Any value written will stay in the write log forever and may end up unreachable if its key is overwritten.
No deletion support: While a key can be overwritten, deletion is not possible.
No stored index: The index must be reconstructed every time the database is loaded.
Limited value size: Values can be no larger than 4 GB.
No read transactions: Reads cannot be batched to ensure a consistent view of the world when reading multiple values.
No write transactions: Every write is executed and persisted immediately, with no rollback or multi-write atomicity guarantees.
No alignment enforcement: For maximal compactness, data is written unaligned to disk and memory. As a result, access may be slowed down when reading unaligned values.
Endianness dependency: The database always uses host endianness internally. Currently it is not possible to copy a database to a machine with a different CPU architecture that has different endianness.
No integrity checks: All data is untyped and no user-defined conditions can be enforced.

Roadmap

This project does not have a detailed roadmap yet, but you can have a look at

stuffer_shack-rs's People

Contributors

Watchers

stuffer_shack-rs's Issues

Deletion support

Currently, items cannot be deleted. Deletion could be added via tombstones that mark a value as deleted. These tombstones would be written to the write log like regular updates.

This does require some change in data structures and possible increase the record header size. An alternative is to designate a special value (like u32::MAX) to mean value deletion.

Do not panic when mmap end is reached.

Currently, the database panics when the end of the memory map is reached, due to a failing range check. Depending on the operating system, memory maps may not be allocated as a large size (e.g. Linux allows allocating almost the entire address range, whereas OS X seems to limit to available memory, or at least a smaller size).

Read-only mode for parallel access

With how the database is structured at the moment, it should be possible to open it from multiple processes without issues, as long as only one process is writing to it. This opens the door for tools that inspect a running system.

A good way to prevent multiple write opens should be determined, possibly using flock or something similar, otherwise the user is responsible for guarding against multiple write accesses.

Persistent index

The index is currently the memory-usage Achilles' heel of the system in terms of scalability, as it just does not operate if not enough RAM is present to hold all keys in memory. The current design also allows for the nice sequential write properties.

As an option, it should be possible to specify that the index be written to disk as well, to allow for faster startup and less memory usage. Which data structure to use and how to design this is left as an exercise to the developer.

This is a fairly significant step that pushes the library from simple write-log to almost real database and may not even be desirable, as it is definitely scope creep for the project.

Builder pattern

As the database acquires potential configuration options (like alignment, memory map size, etc), it is likely necessary to add a StufferShack::build() -> StufferShackBuilder method, with two finalizing constructors (in-memory and file).

An alternative would be to pass the builder as settings to the opening functions. An unsolved decision is whether to tie the key size to the builder as well.

Alignment support

Currently, value are not aligned, data starts straight after the header to minimize space. Typically alignments are important for two reasons:

If alignment is not on a page size, data may be overwritten multiple times. This is usually not a problem, rather a necessity, since many items are smaller than a typical page size (4096 bytes).
If the lengths of record headers not in alignment with the CPUs word size, five operations (two reads, two shifts and combination, see illustration) instead of one are necessary to read it.

A user can enforce some alignment themselves if they like, for smaller values - since the header is a nice, round 64 bytes, if they were to use 32 byte keys and wrote only values with lengths divisible by 4, they would easily get 4 byte aligned values. However, this is cumbersome and requires in-depth knowledge of the database.

It would be good if alignment was enforced in the database itself:

On construction, the user specifies the desired alignment (typically, 4, 8 or 4096 bytes, the implications should be thoroughly documented). Mismatched values result in an error.
An entire record is enforced to be aligned to the value. This is preferable for just aligning values in the base of large alignments. For smaller ones, the key should usually be a multiple of the alignmente anyway.
No data is written to the record. Rather the rule is that every record is extended with padding until it is aligned.

Bit recycling

If alignment is always enforced and defined to be at least 2/4, a number of bits suddenly become available for other purposes (e.g. indicators for deduplication (#16) or tombstones for deletion (#13)).

Endianness conversion

Currently, all values are written in host endianness, since just "raw" u32s are written. The system guards against accidental changes in endianness using a special value in the header.

To make databases transferrable to other systems, at least two solutions exist:

Convert on load: If the endianness check fails when a database is opened, while the index is rebuilt, rewrite all lengths. The result is a database with different endianness.
Use a fixed endianness manually. This means changing length fields to either [u8; 4] and using the respective function, or defining a clever type that does this transparently. It would be good if special CPU instructions for reading foreign endianness functions were automatically used.

Endianness issues can be solved by either standardizing on a specific endianness, or offering a function to rewrite all lengths in the database on open while the index is

Read transactions

Read transactions offer a way of looking at a snapshot of the data. If data is immutable (so keys cannot be updated), they are not always useful, as the only difference observed between transactional and non-transactional reads is that new items may show up. This does make a difference if data is discoverable through outside means, or not all dependencies of data are written first.

If data is mutable, read transactions offer a real obvious benefit.

An easy implementation of read transactions is by making the index an immutable datastructure (see im-rs) that can simply be cloned to achieve read transactions.

Single-write transaction support

By writing multiple records sequentially without updating the insertion pointer, it is possible to implement simple transaction support.

Example:

User requests a new transaction, causing a write lock.
Multiple new writes are completed, without the insertion pointer being advanced (a transaction-local copy is kept instead).
Upon abort, nothing is persisted.
If committed instead, the transaction-local insertion pointer is instead written to the db.

Parallel write support

Currently, only a single write can take at a time. By adding a lock around the insertion pointer, multiple writes can take place in parallel, where each write will reserve space for its data first, then write it.

Example:

Write 1 begins, reserves 4000 bytes of space and updates insertion pointer, then releases the lock around pointer.
Write 2 begins, resources 50 bytes of space and updates insertion pointer, then release the lock around pointer.
Write 2 finishes first, inserting its key into the index hashmap.
Write 3 starts.
Write 1 finishes, insert its key into the index hashmap.
Similarly, write 3 finishes.

This scheme has the issue that upon crash, partial writes are visible (the insertion pointer has been updated). By adding a second pointer tracking completed writes, durability can be preserved.

The module documentation needs to be greatly improved.

While the README.md is passable, the documentation seen by rustdoc is passable at best.

Reading-while-writing

Currently, a &mut reference is required to update any data in database. By either locking the index structure or removing the ability to overwrite items, reads and a single write may take place in parallel --- any read can simply be served from the existing structure (data is immutable), while writes only update index and insertion pointer once completed.

Delta compression

A delta-compression feature can be added which would kick in when records are added or overwritten. A user supplies a "previous value hint" when inserting a value. If a key is overwritten, this could implicitly be the previous value.

Given a suitable delta-compression scheme, the new value can be encoded as the previous value + the delta. A simple approach for a compression scheme is to feed the previous data into a stream compressor (e.g. DEFLATE), flushing, then feeding the new value in and only capturing the output.

The new value can be compared for resulting size if not too large or the compressed written directly. Compression can also be enabled only for a specific range of values.

If an already delta compressed value is used as the basis, it may be useful to specify a limit on how deep the compression chain can be built before a new uncompressed snapshot is written.

This feature is complicated by or incompatible to #22 .

Make `read` and `write` us a generic type

The write and read interface should be improved to automatically convert keys where possible, avoiding the need for GenericArray and making it possible to just call it with a [u8; _] as the key. This likely means moving them do a do_read function to avoid the monomorphization overhead for the entire read/write function.

Document pattern of using a static instance map

A probably good pattern for an app with a single database is using an instance that is statically allocated, allowing it to hand out &'static [u8] across the entire program. This makes passing data around quite easy.

A torture-test suite needs to be added.

This should take the form of model-based testing: A number of operations (read, write, [delete], overwrite, close, reopen) should be specified as an enum, making them applicable to a shack db as well as a reference mapping from std, like HashMap. A sequence of these should be generated using proptest and applied step-by-step to both a reference collection (like HashMap) and the shack db. By comparing their contents in between each step, we ensure these work as expected.

This test should be run and compared for different permutations of settings of the DB itself.

Re-examine close of backing file

Somewhat related to #5 and #2.

Documentation for sparseness of files can/should be added, but the current use of mem::forget needs to be examined, as it may be possible to get away without this call.

Deduplication support

Stuffer shack assumes that data is mostly unique (for content addressed data this would be strictly true by definition, barring collisions), there is little use for a deduplication feature. Some applications may, however, have unchangeable data that is not deduplicated. By allowing for deduplication at the database level, we can provide a convienient solution to take care of this problem.

We make the assumption that:

Reads and non-duplicated writes are much more frequent than actual duplicated writes.
Space is still at a premium.

Deduplication can be added by having a separate index and a fast hash function (e.g. xxhash); the deduplication index will store an offset pointing to the record that already contains the value. This increases memory usage on the order of O(n).

Whenever a duplicate value is added, the key with a pointer to the existing record with the same value is written, instead of the usual key+value pair. The on-disk format is yet to be decided. One option is to reserve another value (u32::MAX-1 and make it be followed by a u64 that is the offset.

If alignment support and a minimum alignment is enforced, a single bit can be used to encode that a deduplication offset is about to follow. A u32 can then follow, to be combined into a u64 offset.

Example:

xx xx xx xy zz zz zz zz

If y has the lowest bit set to 1, indicates a de-duped value, with the real record being found at

(zz zz zz zz << 32) | (xx xx xx xy & 0xFFFFFFFC)

which is a 64 bit value.

A final optimization is to disable deduping for values whose size is either 4 bytes or lower, or whose record size is less than the alignment size, as there are no on-disk savings in these cases.

If space reclamation is implemented (see #22), the entire list of deduplicated values must be kept, otherwise it is sufficient to store one entry for each unique value.

Review volatile memory

A verification needs to be done (and documented) that all writes to the memory map are done in order, using the abstractions provided, and never elided or reordered by the compiler.

Space reclamation

With #13 added or just basic overwrites, there may be unaddressable items left in the database. Their space can be reclaimed in one of two ways:

A cleanup function can be added that walks the entire database and only keeps records are present after the index has been walked. The output can either be a brand-new database, requiring additional space but being safe from crashes during the rewrite, or be done in-place by simple re-writing all items in order.
Alternatively, a free-list can be created and items may be inserted if they are fits. This requires additional "free" records to be written, which can be done using a sentinel value if keys are at least 4 bytes large (e.g. u:32::MAX-2 indicates that a slot is free, followed by the size of the free space. This is padded to the key size.

Free lists are kept in memory. Fitting is done based on smallest that can fit the new recrod + another free record if there is remaining space. No further consideration is given to fragmentation issues.

Allow for using pre-hashed keys

Currently, items are inserted into the index hashmap without any special treatment. If the user can guarantee that the keys are all hash-like, it may be possible to omit the hashing step there by using a null hasher instead (see https://github.com/Fraser999/Hash-Hasher).

Configurable `mmap` size

Right now, the size for the mmap is hardcoded. It needs to be configured and the rationale about choosing the proper size documented (see #17 as well).

marc-casperlabs / stuffer_shack-rs Goto Github PK

stuffer_shack-rs's Introduction

stuffer_shack

Features

Limitations

Roadmap

stuffer_shack-rs's People

Contributors

Watchers

stuffer_shack-rs's Issues

Bit recycling

Recommend Projects

Recommend Topics

Recommend Org