This software package has not been extensively battle tested in the real world. While your data is probably safe, there may be data eating bugs hiding
Block Erasure Format is a file utility and file format designed to fix the pain points I've personally had with existing utilities. It has a nice and easy to use interface, at least according to me, it is simple with minimal overhead, and it is very fast. It is also designed to be modular and extensible, with modular hash and parity library backends. The file format is fully streamable, meaning it does not need to have a seekable file to work, so you can just pipe data right in from say tar. It is finally a very small piece of software, only around 1.5 klocs, so it should be readily auditable and forkable.
The format is designed to be simple, although it was quite a bit more complicated earlier in the design process. It is based on the concept of a 'block' of data, which is then split into data and parity fragments by the parity backend. Then these fragments are hashed and interleaved with fragments from n other blocks. A simple diagram is below. M is the last fragment number, which also are the parity fragments.
[B1-F1] -> [B2-F1] -> [B3-F1] -> [B1-F2] -> [...] -> [B1-FM] -> [B2-FM] -> [B3-FM]
This format is pretty similar to the one used in CDs, but unlike that standard bef is fully variable in how it can follow this format. The number of parity and data fragments, the number of blocks to interleave, the block size, the hash, and the specific parity library providing the erasure codes are all customizable and stored in a header right before the data.
Currently, that is all the information stored in the header, making it only 20 bytes large. However, we want to be able to extend the format in the future and ensure we are getting a good header. So the header has additional operation flags and padding to make it 64 bytes, is duplicated in case it corrupts, and a hash is available to check its integrity. In the worst case, we can omit the header entirely if we already know all information.
I believe this format is, or at least can be with the right settings, highly resilient to burst corruption. Under default settings, it can handle in the worst case one burst of slightly larger than 8KiB per 192KiB. It is however not resilient to random noise, and will corrupt beyond repair in such environments. Luckily environments with such ambient noise in computing are rare, outside of telecommunications.
I have built and tested it against x86-64 and x86, on Debian Bookworm and Alpine Linux 3.19, and the results are that it seems to work on both architectures!
Only Linux is supported for now, it is not cross platform.
Mandatory dependencies to build this package are
There are some additional optional dependencies as well
- BLAKE3 for BLAKE3
- OpenSSL for SHA1, SHA256, SHA3, BLAKE2S, and MD5
- Zlib for CRC32
- liberasurecode for Jerasure, ISA-L, and its own native implementation of Reed solomon codes, support.
Most of these are provided by distributions, except for liberasurecode and BLAKE3's C interface.
These benchmarks are done on a Dell Latitude 7490, i5-7300U, 16GB of RAM, . The test file is 1GiB of random data recently read with cat right before the benchmark, and also in /tmp/. The software packages being compared are mine's truly, zfec, and par2cmdline (you'll see why its last ;) )
Run | bef | zfec | par2 |
---|---|---|---|
encode time (SSD) | 2.17s | 17.23s | 41.88s |
decode time (SSD) | 1.06s | 1.72s | 0s (doesn't touch original data) |
decode time + corruption (SSD) | 4.39s | 1.79s | 21.62s |
encode time (tmpfs) | 2.05s | 11.64s | 42.90s |
decode time (tmpfs) | 1.12s | 1.46s | 0s (doesn't touch original data) |
decode time + corruption (tmpfs) | 1.11s | 1.44s | 19.92s |
As one can see, bef is significantly faster than each option except zfec when it comes to decoding a corrupted fragment or two on disk. Par2 is very very slow, and that's very much one of the main reasons I made this software.
With v0.1, the binary format is now frozen in place and will NEVER change. It can still be extended via use of flags and padding, but it will never be unable to be read by future versions of bef. Thus I guarantee full backward and partial forward compatibility, with the caveat that, since I am not an oracle, the forward compatibility is limited to the subset of features available at that given version, and thus incompatible with newer features extended to the binary format.
However, the CLI is not frozen in place, but I will try my hardest to never modify, and the internal library API/ABI has no guarantees of any compatibility with any other version.