ligurio / unreliablefs Goto Github PK

A FUSE-based fault injection filesystem.

Home Page: https://ligurio.github.io/unreliablefs/unreliablefs.1.html

License: MIT License

CMake 6.65% Roff 4.36% C 70.18% Python 18.08% Lua 0.73%

fault-injection filesystem fuse-filesystem software-testing software-testing-tools quality-assurance fault-injection-filesystem fuse chaos-engineering chaos-testing

unreliablefs's Introduction

UnreliableFS

is a FUSE-based fault injection filesystem that allows to change fault-injections in runtime using simple configuration file.

Supported fault injections are:

errinj_errno - return error value and set random errno.
errinj_kill_caller - send SIGKILL to a process that invoked file operation.
errinj_noop - replace file operation with no operation (similar to libeatmydata, but applicable to any file operation).
errinj_slowdown - slowdown invoked file operation.

Building

Prerequisites:

CentOS: dnf install -y gcc -y cmake fuse fuse-devel
Ubuntu: apt-get install -y gcc cmake fuse libfuse-dev
FreeBSD: pkg install gcc cmake fusefs-libs pkgconf
OpenBSD: pkg_add cmake
macOS: brew install --cask osxfuse

$ cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug
$ cmake --build build --parallel

Packages

Using

$ mkdir /tmp/fs
$ unreliablefs /tmp/fs -basedir=/tmp -seed=1618680646
$ cat << EOF > /tmp/fs/unreliablefs.conf
[errinj_noop]
op_regexp = .*
path_regexp = .*
probability = 30
EOF
$ ls -la
$ umount /tmp/fs

Documentation

See documentation in unreliablefs.1 and unreliablefs.conf.5.

License

unreliablefs's People

Contributors

Stargazers

Watchers

Forkers

sqaunderhood jingliu9 romit18 safizn jliu9 xaizek

unreliablefs's Issues

Add missed FUSE operations

https://libfuse.github.io/doxygen/structfuse__operations.html

Fault injection with truncate file

Add printing version

It should be possible to show current version using command line option.

something like bit flip
Data degradation is the gradual corruption of computer data due to an accumulation of non-critical failures in a data storage device. The phenomenon is also known as data decay, data rot or bit rot.

Data degradation results from the gradual decay of storage media over the course of years or longer. Causes vary by medium:

Solid-state media, such as EPROMs, flash memory and other solid-state drives, store data using electrical charges, which can slowly leak away due to imperfect insulation. The chip itself is not affected by this, so reprogramming it approximately once per decade prevents decay. An undamaged copy of the master data is required for the reprogramming.
Magnetic media, such as hard disk drives, floppy disks and magnetic tapes, may experience data decay as bits lose their magnetic orientation. Periodic refreshing by rewriting the data can alleviate this problem. In warm/humid conditions these media, especially those poorly protected against ambient air, are prone to the physical decomposition of the storage medium.[3][4]

see charybdefs 9

https://github.com/scylladb/charybdefs/blob/master/cookbook/recipes.py
https://github.com/scylladb/charybdefs/blob/master/tests/out_of_disk_space_test.py
Empirical Measurements of Disk Failure Rates and Error Rates
https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/
Bit Rot: How Hard Drives and SSDs Die Over Time
https://www.howtogeek.com/660727/bit-rot-how-hard-drives-and-ssds-die-over-time/

There’s a lot more to it than that, but this provides a basic idea of how the two storage types keep their data. Now let’s look at how they can lose it through bit rot. With hard drives, as mentioned above, saved bits can flip their magnetic polarity. If enough of them flip without being corrected, that can lead to bit rot. Solid-state drives, meanwhile, lose their data when the insulating layer degrades and the charged electrons leak out.

How long it takes to see bit rot in practice depends on a variety of issues. Hard drives have the potential to last with their data intact for decades even if powered down. SSDs, meanwhile, are said to lose their data within a few years in the same state. In fact, there are reports that, if they’re stored in an unusually hot location, the data on an SSD can be wiped out even faster.

The Truth About SSD Data Retention
https://www.anandtech.com/show/9248/the-truth-about-ssd-data-retention
DRAM Errors in the Wild: A Large-Scale Field Study

Report #1 mentions vendors declaring "Bit Error Rate of 10-12 for their memory modules", "a observed error rate is 4 orders of magnitude lower than expected". For memory related tasks, at a rate of 8 GBps this means a single bit flip may occur every minute (10-12 vendors BER) or once in two days (10-16 BER)

https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf

According to #2, there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1-5 bit errors per hour for 8GB of RAM according to my napkin. Paper says the same: "mean correctable error rates of 2000–6000 per GB per year".

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing
https://www.fiala.me/pubs/papers/sc12-redmpi.pdf

The #3 report says, double bit flips "were deemed unlikely" but at ORNL's Cray XT5 they were observed "at a rate of one per day for 75,000+ DIMMs" even with ECC. And single-bit errors should be higher.

Bitrot and atomic COWs: Inside “next-gen” filesystems
https://web.archive.org/web/20150306225935/http://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/
https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder_html/index.html

In a recently initiated effort, Schwarz et al. [28] have started to gather failure data at the Internet Archive, which they plan to use to study disk failure rates and bit rot rates and how they are affected by different environmental parameters. In their preliminary results, they report ARR values of 2-6% and note that the Internet Archive does not seem to see significant infant mortality. Both observations are in agreement with our findings.

https://stackoverflow.com/questions/24181878/how-to-random-flip-binary-bit-of-char-in-c-c

Describe how to make realistic fault injections

see Phoebe https://github.com/gluckzhang/phoebe

Fault injection with kill caller

Using unreliablefs via LD_PRELOAD

LD_PRELOAD can be used when FUSE in unavailable on a system
FUSE can be used with static target binaries

it allows to use other fault injections like clear cache.

Log operations to a text file

Sometimes users wonder was fault injection happen or not. Logging can help to answer on such questions.

Logging real operations in pass-through mode without fault injections allows to analyze operations frequency per type and set probabilities for fault injections similar to real usage.

TODO: create a script that analyze operations log.

Introduce configuration file

Possible formats:

JSON (https://github.com/zserge/jsmn)
INI
- https://github.com/mariusae/trickle/blob/master/conf.c
- https://github.com/benhoyt/inih
XML

Configuration file should have number of sections each of them describes:

field with type of fault injection:
- errinj_remove (see #35)
- errinj_truncate (see #34)
- errinj_kill_caller kill process that run operation (see #28)
- errinj_incomplete_write incomplete writes (see #27)
- errinj_corrupted_write corrupted writes (see #8)
- errinj_errno (random or fixed) (+ regex that describes what errors one desired to return) (see #6 #9)
- errinj_noop replace operation with no-op for those operations where it possible (see #18)
- errinj_delayed_op delay on operations (see #29)
- errinj_clear_cache (see #57)
field with probability (0-100%)
field with regexp for filesystem path where fault injection should happen

For example:

[errinj_errno]
path_regexp = *.xlog
operation_regexp =
probability = 60

[errinj_noop]
path_regexp = *.*
operation_regexp =
probability = 30

Add option parsing

see https://github.com/libfuse/libfuse/wiki/Option-Parsing

needed for #12 #7 and #1

test_setxattr() is broken

        os.setxattr(target, attr_name, attr_value)                                                                                                
        assert attr_name.decode("utf-8") in os.listxattr(target)                                                                                  
>       assert os.getxattr(target, attr_name) == attr_value                                                                                       
E       AssertionError: assert b'' == b'unreliablefs'                                                                                             
E         Right contains 12 more items, first extra item: 117                                                                                     
E         Full diff:                                                                                                                              E         - b''                                                                                                                                   
E         + b'unreliablefs'                                                                                                                       
                                                                                                                                                  
tests/test_unreliablefs.py:497: AssertionError

Support injections used in sqlite testing

https://www.sqlite.org/src/doc/trunk/src/test_vfs.c

  TestFaultInject ioerr_err;
  TestFaultInject full_err;
  TestFaultInject cantopen_err;

...

  if( (p->mask&TESTVFS_OPEN_MASK) &&  tvfsInjectIoerr(p) ) return SQLITE_IOERR;
  if( tvfsInjectCantopenerr(p) ) return SQLITE_CANTOPEN;
  if( tvfsInjectFullerr(p) ) return SQLITE_FULL;

Build DEB and RPM packages

https://github.com/rflament/loggedfs/blob/master/loggedfs.spec

or use CPack to build both packages

Is disk latency really getting injected?

charybdefs 18

Fault injection with removing file

Simulate post fsync failure like cuttlefs

FUSE file system with private page cache to simulate post fsync failure characteristics of modern file systems
https://github.com/WiscADSL/cuttlefs

Fault injection with reading from any file returns EOF

Applicable to read(2) only (?): https://linux.die.net/man/2/read

test_create is broken on FreeBSD

def test_create(setup_unreliablefs):
        mnt_dir, src_dir = setup_unreliablefs
        name = name_generator()
        fullname = pjoin(mnt_dir, name)
        with pytest.raises(OSError) as exc_info:
            os.stat(fullname)
        assert exc_info.value.errno == errno.ENOENT
        assert name not in os.listdir(mnt_dir)
        fd = os.open(fullname, os.O_CREAT | os.O_RDWR)
        os.close(fd)
>       assert name in os.listdir(mnt_dir)
E       AssertionError: assert 'testfile_6' in []
E        +  where [] = <built-in function listdir>('/tmp/pytest-of-root/pytest-0/test_create0/mnt')
E        +    where <built-in function listdir> = os.listdir
tests/test_unreliablefs.py:130: AssertionError

https://cirrus-ci.com/task/5170998354903040?command=test#L47

Update macOS images

mojave is gone on cirrus ci
bigsur is released

Add xfstests to regression testing

xfstests is a rich filesystem testsuite
it's worth to test unreliablefs with it at least once or even add to regression testing on CI
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/

Fault injection with clear cache

posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED)

Regression test with next incorrect symlink handling

charybdefs 11 and 10

Bump minimal CMake version in FindFUSE.cmake

on macOS Catalina:

-- Detecting CXX compile features - done
CMake Deprecation Warning at cmake/FindFUSE.cmake:41 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.
Call Stack (most recent call first):
  CMakeLists.txt:13 (find_package)

https://github.com/ligurio/unreliablefs/runs/1756590354

Update a README

It would be nice to have detailed descriptions about possible FUSE performance, about simulation real errors using unreliablefs

Performance

TODO: FUSE_PASSTHROUGH, see https://lkml.org/lkml/2020/8/12/547
TODO: https://chubaofs.readthedocs.io/en/latest/user-guide/fuse.html

Simulation of real errors

https://danluu.com/file-consistency/
see Phoebe https://github.com/gluckzhang/phoebe
All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
All File Systems Are Not Created Equal: https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_pillai.pdf
https://www.snia.org/sites/default/files/files2/files2/SDC2013/presentations/TestingMethodologies/VinodEswaraprasad_Software-Based_Fault-Storage-v1-0.pdf
http://pages.cs.wisc.edu/~gibson/pdf/fault-injector.pdf
A survey on simulation-based fault injection tools for complex systems https://hal-auf.archives-ouvertes.fr/hal-01075473/document
Describe a reasons to test with noop for fsync(2) ¹ and ²

Add target to Makefile with manual page check

mandoc -T lint unreliablefs.1

test_symlink() is broken

Traceback (most recent call last):                                                                                                                
  File "/home/sergeyb/sources/unreliablefs/tests/test_unreliablefs.py", line 110, in test_symlink                                                 
    os.symlink(target_path, link_path)                                                                                                            
OSError: [Errno 5] Input/output error: '/tmp/testfile_5' -> '/tmp/pytest-of-sergeyb/pytest-0/test_symlink0/mnt/testfile_4'

Report statistics about injected errors

show a report on unmount that may look like this:

errinj_noop triggered 34 times
errinj_errno triggered 12 times

and probably detailed log with datetime when error injection has been triggered

Fault injection with slowdown

parameters: duration

[errinj_slowdown]
duration = 0.1
op_regexp = .*
path_regexp = .*

test_passthrough is broken on FreeBSD 12

setup_unreliablefs = ('/tmp/pytest-of-root/pytest-0/test_passthrough0/mnt', '/tmp/pytest-of-root/pytest-0/test_passthrough0/src')

    def test_passthrough(setup_unreliablefs):
        mnt_dir, src_dir = setup_unreliablefs
        name = name_generator()
        src_name = pjoin(src_dir, name)
        mnt_name = pjoin(src_dir, name)
        assert name not in os.listdir(src_dir)
        assert name not in os.listdir(mnt_dir)
        with open(src_name, 'w') as fh:
            fh.write('Hello, world')
        assert name in os.listdir(src_dir)
>       assert name in os.listdir(mnt_dir)
E       AssertionError: assert 'testfile_20' in []
E        +  where [] = <built-in function listdir>('/tmp/pytest-of-root/pytest-0/test_passthrough0/mnt')
E        +    where <built-in function listdir> = os.listdir

tests/test_unreliablefs.py:344: AssertionError

https://cirrus-ci.com/task/5170998354903040?command=test#L47

Allow specifying a set of errnos to select from

Allow setting a set of errnos rather than just a particular errno or a
random one from the entire set. Update the cookbook for random faults to
exclude any errnos passed via extra arguments.

My objective here is to be able to exclude a specific errno from
random injection. The Go runtime gets confused by EAGAINs, which cause
it to epoll_wait on the file descriptor. I'd like to exclude EAGAIN from
the set of injected errors for my use case.

charybdefs 24

test_append is broken on FreeBSD


setup_unreliablefs = ('/tmp/pytest-of-root/pytest-0/test_append0/mnt', '/tmp/pytest-of-root/pytest-0/test_append0/src')
    def test_append(setup_unreliablefs):
        mnt_dir, src_dir = setup_unreliablefs
        name = name_generator()
        os_create(pjoin(src_dir, name))
        fullname = pjoin(mnt_dir, name)
        with os_open(fullname, os.O_WRONLY) as fd:
            os.write(fd, b'foo\n')
        with os_open(fullname, os.O_WRONLY|os.O_APPEND) as fd:
>           os.write(fd, b'bar\n')
E           OSError: [Errno 9] Bad file descriptor
tests/test_unreliablefs.py:188: OSError

https://cirrus-ci.com/task/5170998354903040?command=test#L66

Fault injection with replace operation with no-op

Like https://github.com/stewartsmith/libeatmydata

replace operation with no-op for those operations where it possible

Implementation https://github.com/stewartsmith/libeatmydata/blob/master/libeatmydata/libeatmydata.c

Add fio to regression test suite

Job files:

'bad file descriptor' is thrown for simple RandomAccessFile.write/seek(0)/read sequence

>>> f = open("/tmp/charibdushka/tst", "wr+")
>>> import binascii
>>> hs="123456789ABCDEF1"
>>> hb=binascii.a2b_hex(hs)
>>> f.write(hb)
>>> f.seek(0)
>>> f.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 9] Bad file descriptor

charybdefs 13

test_seek is broken on FreeBSD

setup_unreliablefs = ('/tmp/pytest-of-root/pytest-0/test_seek0/mnt', '/tmp/pytest-of-root/pytest-0/test_seek0/src')

    def test_seek(setup_unreliablefs):
        mnt_dir, src_dir = setup_unreliablefs
        name = name_generator()
        os_create(pjoin(src_dir, name))
        fullname = pjoin(mnt_dir, name)
        with os_open(fullname, os.O_WRONLY) as fd:
            os.lseek(fd, 1, os.SEEK_SET)
>           os.write(fd, b'foobar\n')
E           OSError: [Errno 9] Bad file descriptor

tests/test_unreliablefs.py:200: OSError

https://cirrus-ci.com/task/5170998354903040?command=test#L47

test_open_unlink() is broken

Traceback (most recent call last):                                                                                                                
  File "/home/sergeyb/sources/unreliablefs/tests/test_unreliablefs.py", line 223, in test_open_unlink                                             
    assert fh.read() == data1+data2                                                                                                               
OSError: [Errno 9] Bad file descriptor

Add regression tests

Fault injection with fake storage capacity

There are many fraudulent USB sticks in circulation that report to have a high capacity (ex: 8GB) but are really only capable of storing a much smaller amount (ex: 1GB). Attempts to write on these devices will often result in unrelated files being overwritten. Any use of a fraudulent flash memory device can easily lead to database corruption, therefore. Internet searches such as "fake capacity usb" will turn up lots of disturbing information about this problem.

https://www.sqlite.org/howtocorrupt.html

Use realistic errno's

Right now unreliablefs uses any available errno, but it is not a real case.
For example errors like "No space left on device" should still provide the
ability to list the directory and change into the directory.
Perhaps every function should have a separate set of available errno's.

charybdefs 20

test_chown() is broken

Traceback (most recent call last):                                                                                                                
  File "/home/sergeyb/sources/unreliablefs/tests/test_unreliablefs.py", line 146, in test_chown                                                   
    os.chown(filename, uid_new, -1)                                                                                                               
PermissionError: [Errno 1] Operation not permitted: '/tmp/pytest-of-sergeyb/pytest-7/test_chown0/mnt/testfile_1'

Add pjdtests to regression tests

https://github.com/pjd/pjdfstest

cannot use touch(1)

$ strace touch tmp/ddd
...
close(3)                                = 0                                                                                                       
utimensat(0, NULL, NULL, 0)             = -1 ENOSYS (Function not implemented)                                                                    
utimensat(AT_FDCWD, "tmp/ddd", NULL, 0) = -1 ENOSYS (Function not implemented)                                                                    
close(0)  
...

Traceback (most recent call last):                                                                                                                
  File "/home/sergeyb/sources/unreliablefs/tests/test_unreliablefs.py", line 325, in test_truncate_fd                                             
    assert fh.read(size) == TEST_DATA                                                                                                             
  File "/usr/lib/python3.8/tempfile.py", line 613, in func_wrapper                                                                                
    return func(*args, **kwargs)                                                                                                                  
OSError: [Errno 9] Bad file descriptor

ligurio / unreliablefs Goto Github PK

unreliablefs's Introduction

UnreliableFS

Building

Packages

Using

Documentation

License

unreliablefs's People

Contributors

Stargazers

Watchers

Forkers

unreliablefs's Issues

Performance

Simulation of real errors

Footnotes

Recommend Projects

Recommend Topics

Recommend Org