tikv / fail-rs Goto Github PK

View Code? Open in Web Editor NEW

332.0 58.0 38.0 70 KB

Fail points for rust

License: Apache License 2.0

Rust 100.00%

fail-rs's Introduction

fail-rs

Documentation.

A fail point implementation for Rust.

Fail points are code instrumentations that allow errors and other behavior to be injected dynamically at runtime, primarily for testing purposes. Fail points are flexible and can be configured to exhibit a variety of behavior, including panics, early returns, and sleeping. They can be controlled both programmatically and via the environment, and can be triggered conditionally and probabilistically.

This crate is inspired by FreeBSD's failpoints.

Usage

First, add this to your Cargo.toml:

[dependencies]
fail = "0.5"

Now you can import the fail_point! macro from the fail crate and use it to inject dynamic failures. Fail points generation by this macro is disabled by default, and can be enabled where relevant with the failpoints Cargo feature.

As an example, here's a simple program that uses a fail point to simulate an I/O panic:

use fail::{fail_point, FailScenario};

fn do_fallible_work() {
    fail_point!("read-dir");
    let _dir: Vec<_> = std::fs::read_dir(".").unwrap().collect();
    // ... do some work on the directory ...
}

fn main() {
    let scenario = FailScenario::setup();
    do_fallible_work();
    scenario.teardown();
    println!("done");
}

Here, the program calls unwrap on the result of read_dir, a function that returns a Result. In other words, this particular program expects this call to read_dir to always succeed. And in practice it almost always will, which makes the behavior of this program when read_dir fails difficult to test. By instrumenting the program with a fail point we can pretend that read_dir failed, causing the subsequent unwrap to panic, and allowing us to observe the program's behavior under failure conditions.

When the program is run normally it just prints "done":

$ cargo run --features fail/failpoints
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/failpointtest`
done

But now, by setting the FAILPOINTS variable we can see what happens if the read_dir fails:

FAILPOINTS=read-dir=panic cargo run --features fail/failpoints
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/failpointtest`
thread 'main' panicked at 'failpoint read-dir panic', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/fail-0.2.0/src/lib.rs:286:25
note: Run with `RUST_BACKTRACE=1` for a backtrace.

For further information see the API documentation.

TODO

Triggering a fail point via the HTTP API is planned but not implemented yet.

fail-rs's People

Contributors

Stargazers

Watchers

fail-rs's Issues

default to injecting crate name in failpoints

I think we should consider defaulting to injecting the crate name in fail_point!. Otherwise it's just too likely to have clashes if this crate is used by library crates for example.

This would need to happen at the next semver break.

cargo feature should be opt-in, not opt-out

I have a usecase where I'd like to add failpoints across several of my libraries, however I'm experiencing some friction due to the way that cargo features are used by this crate.

Failpoints are currently active by default and needs to be disabled (opt-out) in production via the no_fail cargo feature. This poses a problem when nesting a couple of levels of dependencies, as the top-level consumer is no more in charge of those features and can't directly opt-out.

Considering that cargo features are additive, a better approach would be to make failpoints disabled by default and enabling them via a dedicated feature (opt-in). That way, the top-level application/consumer would be optionally in charge of configuring the fail environment and enabling failpoints (transparent to all intermediate libraries).

In practice, this would mean:

getting rid of the no_fail feature
making failpoints disabled by default
introducing a failpoints feature to enable them
releasing fail-0.3 with the new semantic

If this sounds fine to you, I can have a look around and send a PR in the next weeks.

/cc @BusyJay @kennytm @Hoverbear @brson

Add the global failpoint lock pattern directly to the library

When running failpoint unit tests, one must take a global lock so the failpoint configuration stays consistent during parallel execution. We do this in our own failpoints tests, and it's explained extensively in the fail docs. Since the library is significantly less useful without a global lock we might one directly to the library and use them in the tikv failpoints test.

Just copy the pattern from tikv/tests/failpoints into this library, then test tikv against the new failpoints library. This can be done by temporarily replacing the fail dependency in Cargo.toml with a path dependency to the modified version of fail, then running cargo test --test failpoints.

If it all works, then submit the patch here.

Thread-local failpoints

Is your feature request related to a problem? Please describe.

Failpoint unit tests require taking a global lock, preventing test parallelism. An alternate or complimentary solution to a global lock (#23) would be to have a thread-local failpoint configuration, protected by a guard.

Describe the solution you'd like
Add a thread-local configuration that is protected by a guard that performs teardown.

Describe alternatives you've considered
Global locks: #23

Additional context
This would work for single-threaded test cases, but not generally for tests that require multiple threads.

crater fails to test fail-rs

I just noticed in a crater run that fail-rs is broken: https://crater-reports.s3.amazonaws.com/pr-60466/master%237840a0b753a065a41999f1fb6028f67d33e3fdd5/reg/fail-0.2.1/log.txt

It doesn't look like a problem with the crate, but I've asked @pietroalbini about it. Would be nice to have fail tested properly by crater.

support using failpoints from unit tests without requiring serial test execution

Is your feature request related to a problem? Please describe.
Currently failpoints cannot be used from tests without requiring all the tests that hit failpoints to be executed serially. This essentially prevents us from using failpoints from our unit tests without forcing tests to run with a single testing thread. The suggested approach is to place failpoint tests under the tests tree so that they are executed as rust integration tests. However, this means that the tests cannot use any interfaces that are not exposed by the crate, which makes it difficult to write most of our test cases.

Describe the solution you'd like
The reason for this restriction is that failpoints uses a global failpoint registry to control fault injections. It would be nice if there were a way to set up failpoints so that they could use a test-specific registry. One approach to doing this is to support specifying the failpoint registry in calls to the fail_point! macros, e.g.

let registry = <construct or accept a passed in failpoint registry>
fail_point!(&registry, "fail-a-thing", |_| std::io::Error::new(...))

Describe alternatives you've considered
We considered running tests serially and moving our fault tests to tests. Running tests serially might work for now, but could lead to longer build times later. The bigger problem is that it causes tests to fail by default, so developers in our project would always need to remember to run tests with a single thread and configure their ide to do the same, which is painful. Putting tests in tests is not ideal because it requires us to expose a lot of interfaces from our crate that we don't want to to write the tests we want to write.

Clean up crate docs

The crate docs are pretty overwhelming. Figure out how to defer some of that discussion to elsewhere in the docs.

Support enabling conditionally fail_points without the third lambda argument

Is your feature request related to a problem? Please describe.
In most of my failpoints I need to use the condition to enable a fail point, but I rarely use the return feature. Neverthless, I'm forced to use the 3 args version of the macro, defining some return value that makes sense for my function.

Describe the solution you'd like
A fail_point two argument macro with name and enable flag, e.g.: fail_point!("my-fail-point", if: enableFlag)

fail_point! does nothing unless a FailScenario exists

Perhaps I'm doing something wrong, but I have code that looks very similar to the examples, and I can't get it to panic or otherwise respond to failpoints in the environment:

The full code is in https://github.com/sourcefrog/fail-repro

main.rs is

use fail::fail_point;

fn main() {
    println!("Has failpoints: {}", fail::has_failpoints());
    println!(
        "FAILPOINTS is {:?}",
        std::env::var("FAILPOINTS").unwrap_or_default()
    );
    fail_point!("main");
    println!("Failpoint passed");
}

When I run this:

$ FAILPOINTS=main=panic cargo +1.61 r --features fail/failpoints
    Updating crates.io index
...
     Running `target/debug/fail-repro`
Has failpoints: true
FAILPOINTS is "main=panic"
Failpoint passed

$ FAILPOINTS=main=print cargo +1.61 r --features fail/failpoints
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/fail-repro`
Has failpoints: true
FAILPOINTS is "main=print"
Failpoint passed

In case this was broken by a later Cargo change, I tried it on both 1.76 and 1.63 and they both show the same behavior.

This is on x86_64 Linux.

Upgrade to Rust 2018

After TiKV itself is successfully upgraded (tikv/tikv#3896) we can bump fail to Rust 2018 as well. Do a major version bump.

Cannot use `fail_point!` 3 arguments macro without importing it

Describe the bug
Cannot use full name qualification for fail_point! macro in the 3 arguments case

To Reproduce
Just try to compile:

fail::fail_point!("fail-point-3", enable, |_| {});

And you'll get:

error: cannot find macro `fail_point` in this scope
   --> my_code.rs:10
    |
10 |                     fail::fail_point!("fail-point-3", enable, |_| {});
    |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: consider importing this macro:
            fail::fail_point
    = note: this error originates in the macro `fail::fail_point` (in Nightly builds, run with -Z macro-backtrace for more info)

Expected behavior
You should be able to use the macro without importing it with use

Additional context
Looks like the issue is here:

fail-rs/src/lib.rs

Line 841 in 6645f17

fail_point!($name, $e);

The recursive macro invocation should look like this:

$crate::fail_point!($name, $e);

API docs mention `no_fail` feature

The API docs mention the no_fail feature, but that feature no longer exists. Instead the API docs should mention, probably near the top, that failpoints are not active unless the failpoints feature is on, and its existence can be checked (after #38) statically or dynamically with has_failpoints.

cc @lucab

Support dependency wait

Is your feature request related to a problem? Please describe.

Make fail-point support dependencies (one fail-point wait for another before proceed)
we can refer to the implementation of rocksdb syncpoint https://github.com/facebook/rocksdb/blob/e9e0101ca46f00e8a456e69912a913d907be56fc/test_util/sync_point.h

Describe the solution you'd like

Support writting like this fail::cfg("point_A", "wait(point_B)")

wait indicates pause on point_A until point_B is passed.
wait_local indicates point_A is enabled when point_A and point_B are processed on same thread. And it will also pause on point_A until point_B is passed.

Additional context

part of #tikv/rust-rocksdb#361

need a tag for release 0.3

Describe the bug
From the README, the version has already bumped to 0.3. But in https://crates.io/crates/fail, its version is still 0.2.1. I guess we need a tag for release 0.3?

To Reproduce

Expected behavior

System information

Additional context

Release fail 1.0

This can probably be done shortly after upgrading to 2018: #21

I'll probably want to clean up the documentation a bit. It's overwhelming atm.

Support thread group

Is your feature request related to a problem? Please describe.

fail-rs utilizes global registry to expose simple APIs and convenient FailPoint definition. But it also means all parallel tests have to be run one by one and do cleanup between each run to avoid configurations affect each other.

Describe the solution you'd like

This issue proposes to utilize thread group. Each test case defines a unique thread group, all configuration will be bound to exact one thread group. Every time a new thread is spawn, it needs to be registered to one thread group to make FailPoint reads configurations. If a thread is not registered to any group, it belongs to a default global group.

New public APIs include:

pub fn current_fail_group() -> FailGroup;

impl FailGroup {
    pub fn register_current(&self) -> Result<()>;
    pub fn deregister_current(&self);
}

Note that it doesn't require users have the ability to spawn a thread, register the thread before using FailPoint is enough.

Describe alternatives you've considered

One solution to this is pass the global registry to struct constructor, but it will interfere the general code heavily, it needs to be passed to anywhere FailPoints are defined.

Another solution is #24, but it lacks threaded cases support.

test case failed may be due to failpoint lack of isolation

Describe the bug
Four test cases are executed concurrently, resulting in a block.
tikv/tikv#17277 (comment)

To Reproduce
Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

System information

CPU architecture:
Distribution and kernel version:
SELinux on?:
Any other system details we should know?:

Additional context
Add any other context about the problem here.

"Put test cases exercising fail points into their own test crate" — is 'crate' right?

The fail crate's crate-level doc-comment first says—

fail-rs/src/lib.rs

Lines 107 to 108 in 2cf1175

    
           //! this it is a best practice to put all fail point unit tests into their own 
        
           //! binary. Here's an example of a snippet from `Cargo.toml` that creates a

—and then later says—

fail-rs/src/lib.rs

Lines 219 to 220 in 2cf1175

    
           //!    fail points. Put test cases exercising fail points into their own test 
        
           //!    crate.

Should the latter advice say "binary" rather than "crate"?

	//! this it is a best practice to put all fail point unit tests into their own
	//! binary. Here's an example of a snippet from `Cargo.toml` that creates a

	//! fail points. Put test cases exercising fail points into their own test
	//! crate.

tikv / fail-rs Goto Github PK

fail-rs's Introduction

fail-rs

Usage

TODO

fail-rs's People

Contributors

Stargazers

Watchers

Forkers

fail-rs's Issues

Recommend Projects

Recommend Topics

Recommend Org