blepers / kvell Goto Github PK

KVell: the Design and Implementation of a Fast Persistent Key-Value Store

License: MIT License

Makefile 0.32% C 72.78% C++ 22.81% Shell 1.73% Perl 0.23% Gnuplot 2.11%

kvell's Introduction

Baptiste Lepers

I am a Maître Assistant at the Université de Neuchâtel. My current focus is on proving and improving the performance of concurrent systems. I work on finding bugs in Linux, optimizing the performance of storage, memory, graph engines, and schedulers.

PhD openings: https://sites.google.com/view/usydphd/home

Google Scholar profile: https://scholar.google.com/citations?user=6dsH-1oAAAAJ&hl=en&oi=ao

Contact: [email protected]

Previously, I was a university lecturer at the University of Sydney, a postdoc at EPFL working with Willy Zwaenepoel and a postdoc at SFU working with Alexandra Fedorova. I completed my PhD at the Université de Grenoble under the supervision of Vivien Quéma.

Projects

OFence: Pairing Barriers to Find Concurrency Bugs in the Linux Kernel. Bug finding in the Linux kernel using static analysis.
KVell: Fast and efficient Key-Value Store for NVMe SSDs.
IPanema: proving properties of concurrent schedulers using formal methods.

kvell's People

Contributors

Stargazers

Watchers

Forkers

tchaikov rgmacedo halo-geng igordzreyev baoxuezhao zhiqiangliu26 slvn-lefebvre yangxi zhangquan0126 virgilshi songfeifei123 dimstars beyer-yan kinderriven juno-kim junkrat77 suxiao111 cheneyding denicallllee taoluo huststt beneslami differentsc lsmdb tenghuanhe determinant loveleon woodscumming jliu9 pkusnail yansenliang sz-npe wuxb45 jyizheng flutgirl zagzhang ihectorperez zzunny97 jiwangreal flexible-address-space guzhouke yuhala supermt 5ace pleiadesian xiaochunhe jhonxue kkanellis dreamchaserjin g410-smartnicgroup zhengyi-yang chang290 c744402859 pdsl-dpu-kv warmchay zxdhw heejin5178 dibyendumajumdar animeshtrivedi zondayc ityyrm geepi leic-ss

kvell's Issues

question about kvell latency

Hi,

I just test kvell in two optane p4800x 375G SSDs with ycsb-a, but found that the latency was very high, Do you think the result is reasonable? (From paper 6.5.1 section, the kvell average latency can achieve an average latency of 158us. )

I test ycsb-a workload 20,000,000 request in database of 100,000,000 elements.

Here is ycsb-a result

I modify code in print_stats function in stats.c, adding three statistical outputs

1.median--cycles_to_us(stats.timing_value[last*50/100])
2. p99--cycles_to_us(stats.timing_value[last*50/100])
3. p999--cycles_to_us(stats.timing_value[p999]); p999=(int)last*99.9/100

So when the ycsb-a test finished, I can get throughput, avg-latency, median, p90,p99,p999 automatically.

Here is configuration for kvell runtime(I directly use the default setting in your code)

Configuration for yscb-a uniform in two optane disk
1.1 Page cache size: 30 GB
1.2 Workers: 8 working on 2 disks
1.3 IO configuration: 64 queue depth (capped: yes, extra waiting: no)
1.4 Queue configuration: 256 maximum pending callbaks per worker
1.5 Datastructures: 3 (memory index) 3 (pagecache)
1.6 Thread pinning: yes
1.7 Bench: YCSB (100000000 elements)
Configuration for yscb-a zipfian in two optane disks
2.1 Page cache size: 30 GB
2.2 Workers: 8 working on 2 disks
2.3 IO configuration: 64 queue depth (capped: yes, extra waiting: no)
2.4 Queue configuration: 256 maximum pending callbaks per worker
2.5 Datastructures: 3 (memory index) 3 (pagecache)
2.6 Thread pinning: yes
2.7 Bench: YCSB (100000000 elements)

Looking for your reply.
Best regards

some questions about the implementation

Hi,

It’s really an excellent work. And I am also working on a new kv store with no commit-log, fully asynchronous, no gc and share-nothing features based on spdk. But when I read the paper and all the source code you have pushed, I find something puzzling.

In section 5.3, about the user page cache, the paper says that “but B tree data is allocated from an mmap-ed file and can be paged out by the kernel”, but in the actual implementation, the page cache is just allocated by malloc in memory. So i am puzzled about whether the in-memory B tree index is reconstructed effectively enough as the evaluation shown. Because when the system is restarted each time, kvell has to scan all the disk data to construct the B tree index.
As you said in section 5.2 that “Items larger than 4K have a timestamp header at the beginning of each block on disk”, but actually, the implementation only involves items no more than 4K in the source code. In fact, when an item size is larger than 4K, the atomic writing cannot be guaranteed, as modern NVME devices only provide block-granularity atomic writing semantics. under such case, the commit-log seems to be essential to guarantee the atomicity and the durability. So can a no-commit-log kv store be implemented when the item size larger than 4K?
No-gc is an important design feature in kvell, but from the implementation, when lots of items with the same size are put in kvell, the size of the corresponding slab file will increase. Under the case, when the items are deleted, the space in that slab will not be shrunk. So I think gc is still needed to reclaim the deleted space.
Each slab file stores fixed size kv items, that is, when the size of an item is less than the slab slot size, kvell has to pad the item. So i am really wandering what is the best slot size setting. The values given by kvell are “size_t slab_sizes[] = { 100, 128, 256, 400, 512, 1024, 1365, 2048, 4096 }”. Is there any deep reason for the design?

Look forward to your kind reply.

Some questions about evaluation

Hi,

KVell is an excellent work and I'm also very interested in it. But when I read the paper and test the code, I found some problems. Here are my questions.

In section 6.5.3, your test results showed a significant performance degrade(24M/280K) when indexes do no fit in RAM. How did you test the performance of KVell in different RAM sizes? It's mentioned in the paper that the B tree data is allocated from an mmap-ed file. Did you use mmap-ed file as the paper said or just flush indexes to the swap? Is the mmap-ed file or swap configured on NVMe SSD? Since the performance degration is obvious, I'd like to know the reason.
When I tested the code, I found that every time I tested, I needed to delete the data and reinitalize, otherwise I wound get an error like this.

Is this a bug in the code? How did you test the recover time?
Figure 5 shows that rocksdb reached 800KOps/s under YCSB-C workload with zipfian key distribution. When I tested rocksdb with YCSB, it did not seem to achieve such high throughput., so I was wondering how you configured rocksdb and compared it to KVell.

Looking forward to your reply. Thanks~

strange behaviors in kvell

Hi,

According to scripts/run-aws.sh, we will run ycsb many times. The first time, kvell will generate a database(e.g 100g), then run ycsb workload. In the second time, kvell can reuse database in the last time and recover it. However, I found that sometimes, after recovery of database, it would stop suddenly, very confusing. Do you know why this happened?

During my test, using 2 disks, 4 workers per disk and setting queue depth to 1, I found the Latency and Bandwidth cannot be matched. For example, for ycsb-uniform, latency is 116us, thp is 409838(req/s). Theoretically, the ideal thp is equal to (1/116)(24)*10^6= 68965 req/s, which is smaller than 409838. Can you explain this phenomenon?

Best regards
Looking for your reply.

some questions about paper

Hi,How did you experiment with bandwidth and CPU utilization for RocksDB?One more question,How is the time ratio of each step of the RocksDB in the merge process calculated? Thanks~

Some questions about the transaction support

Hi,

KVell is an excellent work. I'm very interested in it. I'm very curious about the transaction support. How does KVell support atomicity and isolation? Are "share nothing" and "partition" designs still efficient in the transaction support?

If there are multiple updates in one transaction, they can be mapped to different workers and written to different devices. How does KVell guarantee none or all of them are persistent to devices ? If the system crash during a transaction processing and only part of the updates are persistent to the devices, how to deal with this during recovery?
If there are two concurrent transactions, how to guarantee that Txn A can not see Txn B's uncommitted updates? Which isolation level does it support (read uncommitted, read committed or repeatable read)?
In the paper, I found that YCSB-A (50% update and 50% read) always have the same performance with YCSB-F(50% update and 50% read-modify-write). Why do they always have the similar performance and how did KVell implement the read-modify-write?

Thanks.

About the YCSB+T benchmark

Hi, @BLepers .

I have a question about the implementation of YCSB+T.

Where did you get the latest YCSB+T project?

The only place I can find to fork it is on Akon Dey's github.
https://github.com/akon-dey/YCSB

Could you help me to implement the ycsb+t benchmark correctly?

In his article he reports having such methods:

"
• doTransactionInsert() creates a new account with an
initial balance captured from doTransactionDelete() operation described below.

• doTransactionRead() reads a set of account balances
determined by the key generator.

• doTransactionScan() scans the database given the start
key and the number of records and fetches them from the
data base.

• doTransactionUpdate() reads a record and add $1 from
the balance captured from delete operations to it and write
it back.

• doTransactionDelete() reads an account record, add the
amount to the captured the balance (capture used in
doTransactionInsert()) and then deletes the record.

• doTransactionReadModifyWrite() reads two records,
subtracts $1 from the one of the two and adds $1 to
the other before writing them both back.
"

In the akon repository where I made the fork, I didn't find the implementation of the methods doTransactionInsert() , doTransactionRead() , doTransactionScan() , doTransactionUpdate() and doTransactionDelete().

I just noticed that the doTransactionReadModifyWrite() method is implemented, where it subtracts the value 1 from account A and assigns that value to account B.

Could you help me understand this part of the implementation?

Regards,
Caio

questions about remote KVell

Hi,
I've just gone through the paper and realized that's a great work. I was thinking of implementing distributed KVell. Since the current KVell is run as an embedded key-value store (i.e., the client and the KVell server were on the same machine), I was curious what the network overhead would be if I moved the client on a different machine. To do so, how should I break the entire project as client/server?

down scaled Kvell

Hi,

Since KVell is run on i3.metal instance, I was wondering how it could be possible to change the KVell configuration codes to be able to run on i3.large instance. Because I was thinking of running KVell remote client.

Now the problem is the AWS console does not let me have 2 instances of i3.metal. It seems KVell is compatible with i3.metal. And I am not able to run the code in i3.large instance. So, how can I alter the code?

Thanks,

There are some things I don't understand in the code interface reading

The performance of kvell is much better than rocksdb, which makes me interested in learning this project. When I was reading your paper, I saw in the implementation part of the paper that kvell implements the same interface as LSM-Tree:（writes Update(k,v), reads Get(k), and range scans Scan(k1,k2). Update(k,v) associates valuev to key k. Update(k,v) only returns once the value has been persisted to disk.）. But when I started to read the code, I did not find these functions. Can you tell me where or alias these interfaces appear in the code? thank you very much!

./autogen.sh

In different machines, it might be possible not to have "Autoconf". So, my suggestion is that, in order to no to encounter errors during first steps of installation, it is better to add the command " apt install autoconf" in autopen.sh file.