Coder Social home page Coder Social logo

Comments (4)

BLepers avatar BLepers commented on July 2, 2024

Thanks, good questions :)

  1. Yes, indeed, the code on github doesn't fully reflect that part of the paper. Two points here:

    1.1 We did some experiments with the B Tree using mmap to allocate memory for cells instead of the standard malloc/new. The memory is never allocated from our user page cache, we used a regular mmap (so allocated from the kernel directly). To do that, we just modified the memory allocator used by the B tree, but the code was too messy, so I didn't include it. (Performance is anyway very bad when the index doesn't fit in memory.)

    1.2. Even when the B Tree is mmaped and not malloced, we still have to scan the whole store to rebuild the index in case of a crash. The mmap is just there to tell the kernel that it is OK to swap out some pages, but we have no guarantee that the index data on disk is up to date. So, in all cases, we scan the store for recovery. The numbers reported in the paper correspond to the time it takes to scan the full store (100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the store :) ).

  2. If you use a header for each 4K block, you can guarantee atomicity using timestamps. E.g.:
    block1 = [rdt1, data]
    block2 = [rdt2, data]
    if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.

    Another option is to add a the rdt at the beginning and end, and force the end to be written after the beginning. E.g.: for an item that spawns N blocks
    write blocks 1-(N-1) [rdt, data...............]
    wait for the writes to be done
    write block N [...end of data, rdt]
    Then it is easy to check if the whole item has been written by comparing the rdt in block 1 and N.

  3. No need for a GC. We could use a similar technique as what TCMalloc uses. E.g.: count the number of free spots in a 2MB chunk of the file. When the chunk has no item, it can be given back to the filesystem (there is a syscall to free a portion of a file). We can maintain the count in memory (1 int per 2MB is small). So a simple refcount is enough, no need for a GC. Obviously, in the worst case, it wastes a lot of space, but it is extremely unlikely (it works for memory allocators, so no reason that it wouldn't work here).

  4. Current values are chosen very randomly. But you can choose the values to minimize wasted space. The kernel does that for its slab allocator (values are chosen to match sizes frequently used by the kernel).

from kvell.

Beyer-Yan avatar Beyer-Yan commented on July 2, 2024

Thanks for your reply,

  1. (1.2) For the finding that โ€œ100GB at 15GB/s = fast recovery, so there is no point in trying to have an up to date version of the index disk, it is way more costly than scanning the storeโ€œ. I notice that the max bandwidth is 1.6GB/s for Amazon-8NVMe config, in which kvell uses it to test the recovery performance. In the evaluation(section 6.6), kvell takes 6.6s to scan the whole database, about 100GB size. That is, the bandwidth is about 15Gb/s. That may be a little confusing, since it needs to take at least 62.5s to load the whole database into memory with max disk bandwidth. One possible explanation is that the evaluation does not write the whole 100GB database into the disk when you simulate a crash by killing the database in the middle of a YCSB A workload. However, the evaluation in the paper does not tell us the actual data size the simulation has written into disk when the evaluation starts recovering from crash for all three test cases. And also, the evaluation may not guarantee a recovering case from the same size of (already written) database for all tree tests. So May I know the actual recovering time for the whole 100GB database for kvell?

  2. For atomicity, a data writing shall be successful, that is the old data item is replace by new data, or failed, that is the old data keeps intact. In fact, both options you have mentioned may not guarantee such principle. eg.:
    block1 = [rdt1, data]
    block2 = [rdt2, data]
    if(rdt1 == rdt2) data is correct. You just have to reconstruct the item when reading.
    A counterexample is that a data writing for an item with size of 2 blocks may successfully flushes the new data into block1 but fails to flushes it into block2 because of a crash or a hardware fault. that is, the old data item is damaged. Under such case, commit-log may be essential.

  3. That is a good idea to count the number of free spots in a 2MB chunk of the file. But as far as I am concerned, removing a portion of a file is quite hard, which looks like costlier than compaction. Since if one chooses to free a portion of a file, he has to load the whole file into memory and delete the portion, then dumps the new data into a new file. Even though, the sparse-file feature is supported in most of file systems, it is quite limited and cannot be used to free a portion of a file.

from kvell.

BLepers avatar BLepers commented on July 2, 2024
  1. The BW is ~2GB/s per disk so ~16GB/s in total. So it takes 6.25s to scan 100GB.

  2. For updates > 4KB you don't write in place. During recovery, you might have the 2 items on disk but you can use the rdt to only keep the newest version. Doing the updates not in place is basically free thanks to the in-memory freelist (we have a new version of KVell that handles that properly, we might release it at some point).

  3. FALLOC_FL_PUNCH_HOLE ;)

from kvell.

Beyer-Yan avatar Beyer-Yan commented on July 2, 2024

That is great. Look forward to the new version of KVell.
Thanks for your answers. :)

from kvell.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.