Hi, It’s really an excellent work. And I am also working on a new kv

Thanks, good questions :) Yes, indeed, the code

Thanks for your reply, (1.2) For the finding th

The BW is ~2GB/s per disk so ~16GB/s in total. So it takes 6.25s

some questions about the implementation about kvell HOT 4 OPEN

blepers commented on July 2, 2024

some questions about the implementation

from kvell.

Comments (4)

BLepers commented on July 2, 2024

Thanks, good questions :)

Yes, indeed, the code on github doesn't fully reflect that part of the paper. Two points here:

1.1 We did some experiments with the B Tree using mmap to allocate memory for cells instead of the standard malloc/new. The memory is never allocated from our user page cache, we used a regular mmap (so allocated from the kernel directly). To do that, we just modified the memory allocator used by the B tree, but the code was too messy, so I didn't include it. (Performance is anyway very bad when the index doesn't fit in memory.)

1.2. Even when the B Tree is mmaped and not malloced, we still have to scan the whole store to rebuild the index in case of a crash. The mmap is just there to tell the kernel that it is OK to swap out some pages, but we have no guarantee that the index data on disk is up to date. So, in all cases, we scan the store for recovery. The numbers reported in the paper correspond to the time it takes to scan the full store (100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the store :) ).
If you use a header for each 4K block, you can guarantee atomicity using timestamps. E.g.:
block1 = [rdt1, data]
block2 = [rdt2, data]
if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.

Another option is to add a the rdt at the beginning and end, and force the end to be written after the beginning. E.g.: for an item that spawns N blocks
write blocks 1-(N-1) [rdt, data...............]
wait for the writes to be done
write block N [...end of data, rdt]
Then it is easy to check if the whole item has been written by comparing the rdt in block 1 and N.
No need for a GC. We could use a similar technique as what TCMalloc uses. E.g.: count the number of free spots in a 2MB chunk of the file. When the chunk has no item, it can be given back to the filesystem (there is a syscall to free a portion of a file). We can maintain the count in memory (1 int per 2MB is small). So a simple refcount is enough, no need for a GC. Obviously, in the worst case, it wastes a lot of space, but it is extremely unlikely (it works for memory allocators, so no reason that it wouldn't work here).
Current values are chosen very randomly. But you can choose the values to minimize wasted space. The kernel does that for its slab allocator (values are chosen to match sizes frequently used by the kernel).

from kvell.

Beyer-Yan commented on July 2, 2024

Thanks for your reply,

(1.2) For the finding that “100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the store“. I notice that the max bandwidth is 1.6GB/s for Amazon-8NVMe config, in which kvell uses it to test the recovery performance. In the evaluation(section 6.6), kvell takes 6.6s to scan the whole database, about 100GB size. That is, the bandwidth is about 15Gb/s. That may be a little confusing, since it needs to take at least 62.5s to load the whole database into memory with max disk bandwidth. One possible explanation is that the evaluation does not write the whole 100GB database into the disk when you simulate a crash by killing the database in the middle of a YCSB A workload. However, the evaluation in the paper does not tell us the actual data size the simulation has written into disk when the evaluation starts recovering from crash for all three test cases. And also, the evaluation may not guarantee a recovering case from the same size of (already written) database for all tree tests. So May I know the actual recovering time for the whole 100GB database for kvell?
For atomicity, a data writing shall be successful, that is the old data item is replace by new data, or failed, that is the old data keeps intact. In fact, both options you have mentioned may not guarantee such principle. eg.:
block1 = [rdt1, data]
block2 = [rdt2, data]
if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.
A counterexample is that a data writing for an item with size of 2 blocks may successfully flushes the new data into block1 but fails to flushes it into block2 because of a crash or a hardware fault. that is, the old data item is damaged. Under such case, commit-log may be essential.
That is a good idea to count the number of free spots in a 2MB chunk of the file. But as far as I am concerned, removing a portion of a file is quite hard, which looks like costlier than compaction. Since if one chooses to free a portion of a file, he has to load the whole file into memory and delete the portion, then dumps the new data into a new file. Even though, the sparse-file feature is supported in most of file systems, it is quite limited and cannot be used to free a portion of a file.

from kvell.

BLepers commented on July 2, 2024

The BW is ~2GB/s per disk so ~16GB/s in total. So it takes 6.25s to scan 100GB.
For updates > 4KB you don't write in place. During recovery, you might have the 2 items on disk but you can use the rdt to only keep the newest version. Doing the updates not in place is basically free thanks to the in-memory freelist (we have a new version of KVell that handles that properly, we might release it at some point).
FALLOC_FL_PUNCH_HOLE ;)

from kvell.

Beyer-Yan commented on July 2, 2024

That is great. Look forward to the new version of KVell.
Thanks for your answers. :)

from kvell.

some questions about the implementation about kvell HOT 4 OPEN

Comments (4)

Related Issues (11)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent