Coder Social home page Coder Social logo

diode23 / bef Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gbletr42/bef

0.0 0.0 0.0 454 KB

Block Erasure Format - An extensible, fast, and usable file utility to encode and decode interleaved erasure coded streams of data.

License: GNU General Public License v3.0

Shell 5.77% C 89.92% Makefile 1.38% M4 2.92%

bef's Introduction

WARNING

This software package has not been extensively battle tested in the real world. While your data is probably safe, there may be data eating bugs hiding

Description

Block Erasure Format is a file utility and file format designed to fix the pain points I've personally had with existing utilities. It has a nice and easy to use interface, at least according to me, it is simple with minimal overhead, and it is very fast. It is also designed to be modular and extensible, with modular hash and parity library backends. The file format is fully streamable, meaning it does not need to have a seekable file to work, so you can just pipe data right in from say tar. It is finally a very small piece of software, only around 1.5 klocs, so it should be readily auditable and forkable.

Format

The format is designed to be simple, although it was quite a bit more complicated earlier in the design process. It is based on the concept of a 'block' of data, which is then split into data and parity fragments by the parity backend. Then these fragments are hashed and interleaved with fragments from n other blocks. A simple diagram is below. M is the last fragment number, which also are the parity fragments.

[B1-F1] -> [B2-F1] -> [B3-F1] -> [B1-F2] -> [...] -> [B1-FM] -> [B2-FM] -> [B3-FM]

This format is pretty similar to the one used in CDs, but unlike that standard bef is fully variable in how it can follow this format. The number of parity and data fragments, the number of blocks to interleave, the block size, the hash, and the specific parity library providing the erasure codes are all customizable and stored in a header right before the data.

Currently, that is all the information stored in the header, making it only 20 bytes large. However, we want to be able to extend the format in the future and ensure we are getting a good header. So the header has additional operation flags and padding to make it 64 bytes, is duplicated in case it corrupts, and a hash is available to check its integrity. In the worst case, we can omit the header entirely if we already know all information.

I believe this format is, or at least can be with the right settings, highly resilient to burst corruption. Under default settings, it can handle in the worst case one burst of slightly larger than 8KiB per 192KiB. It is however not resilient to random noise, and will corrupt beyond repair in such environments. Luckily environments with such ambient noise in computing are rare, outside of telecommunications.

What will it build on?

I have built and tested it against x86-64 and x86, on Debian Bookworm and Alpine Linux 3.19, and the results are that it seems to work on both architectures!

Only Linux is supported for now, it is not cross platform.

Dependencies

Mandatory dependencies to build this package are

There are some additional optional dependencies as well

  • BLAKE3 for BLAKE3
  • OpenSSL for SHA1, SHA256, SHA3, BLAKE2S, and MD5
  • Zlib for CRC32
  • liberasurecode for Jerasure, ISA-L, and its own native implementation of Reed solomon codes, support.

Most of these are provided by distributions, except for liberasurecode and BLAKE3's C interface.

Benchmarks

These benchmarks are done on a Dell Latitude 7490, i5-7300U, 16GB of RAM, . The test file is 1GiB of random data recently read with cat right before the benchmark, and also in /tmp/. The software packages being compared are mine's truly, zfec, and par2cmdline (you'll see why its last ;) )

Run bef zfec par2
encode time (SSD) 2.17s 17.23s 41.88s
decode time (SSD) 1.06s 1.72s 0s (doesn't touch original data)
decode time + corruption (SSD) 4.39s 1.79s 21.62s
encode time (tmpfs) 2.05s 11.64s 42.90s
decode time (tmpfs) 1.12s 1.46s 0s (doesn't touch original data)
decode time + corruption (tmpfs) 1.11s 1.44s 19.92s

As one can see, bef is significantly faster than each option except zfec when it comes to decoding a corrupted fragment or two on disk. Par2 is very very slow, and that's very much one of the main reasons I made this software.

Future Support/Compatibility

With v0.1, the binary format is now frozen in place and will NEVER change. It can still be extended via use of flags and padding, but it will never be unable to be read by future versions of bef. Thus I guarantee full backward and partial forward compatibility, with the caveat that, since I am not an oracle, the forward compatibility is limited to the subset of features available at that given version, and thus incompatible with newer features extended to the binary format.

However, the CLI is not frozen in place, but I will try my hardest to never modify, and the internal library API/ABI has no guarantees of any compatibility with any other version.

bef's People

Contributors

gbletr42 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.