Coder Social home page Coder Social logo

raft-engine's Introduction

raft-engine

Rust

A WAL-is-data engine that used to store multi-raft log

background

Currently, we use an individual RocksDB instance to store raft entries in TiKV, there are some problems to use RocksDB to store raft entries. First entries are appended to WAL and then flushed into SST, which are duplicated operation in this situation. On the other hand, writing amplification hurts performance when compaction. Writing is also needed when dropping applied entries. Disk read is also needed when compaction.

RaftEngine is aimed at replacing the RocksDB instance to store raft entries. Here I will explain how RaftEngine works and what we benefit.

thread safe

First of all, all operations on RaftEngine are thread safe.

WAL is data

pipe_log

RaftEngine contains two major components, pipe_log and memtable. pipe_log is consists of a series of files, which only supports appending write and all regions share the same pipe_log. pipe_log is where data persist.

memtable

Each region has an individual memtable. A memtable is consists of three members: entries, kvs, and file_index.

  • entries: a VecDeque, which stores active entries or the position in pipe_log of these active entries for this region. Operation on VecDeque is fast.
  • kvs: a HashMap, stores key-value pairs for this region.
  • file_index: another VecDeque, store which files contain entries for this region, it is used to GC expired files.

LogBatch

LogBatch is similar to rocksdb’s WriteBatch, giving write an atomic guarantee. It supports adding entries, putting key-value pairs, deleting keys, and the command to clean a region.

compact entries

compact_to(region_id, to_indx) only drops entries in memtable, and update memtable’s file_index. This operation is only in memory and is very fast.

recovery

When the engine restarts we read files in pipe_log one by one, and apply each LogBatch to memtable for each region. After that, we get the first index info from kv engine for each region and use this first index to compact memtable. Currently, we have two RocksDB instances, and each one has its own WAL. When kv engine’s WAL lost some data after the machine's crash down, the applied index which persisted in kv engine maybe fallback, but raftdb’s WAL says these entries have been dropped, which will cause panic. RaftEngine rarely meets such a situation because RaftEngine will keep the latest several files. As I have tested, recovering from 1GB pipe_log data takes 4.6 seconds, which is acceptable. After using crc32c, recover from 1.37GB log files takes 2.3 seconds.

write process

Add entries or key-value pairs into LogBatch, LogBatch will be encoded to a bytes slice, and append to the currently active file. The previous writing will return the file number which serves the writing, and then apply this LogBatch to associated memtables.

GC expired files

We use memtable’s file_index to get which file is the oldest file that contains active entries for each region, and files before the oldest file of all regions who can be dropped. RaftEngine has a total-size-limit parameter, which is default 1GB. And when the total size of files reaches this value, we will check whose entries left behind so long. If a region doesn’t have new entries for a long time, and the number of entries doesn't reach gc_raftlog_threshold, we need to rewrite these entries to the currently active file, and the old files can be dropped as soon as possible. Since the amount of entries is small, rewriting is trivial. If a region’s follower lefts behind a lot, causing entries in this region can’t be compacted. We hint raftstore to compact this region’s entries, since we wait for the slow follower so long, “total-size-limt’s” time.

Test result in TiKV

Reduce above 60% raft associated IO. Reduce disk load and disk IO utilization by 50% at lease. Increase data importing speed around 10%-15%.

raft-engine's People

Contributors

hicqu avatar tabokie avatar zhangjinpeng87 avatar mrcroxx avatar fullstop000 avatar busyjay avatar little-wallace avatar yuqi1129 avatar koushiro avatar renkai avatar longfangsong avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.