Coder Social home page Coder Social logo

openslab-osu / featherfault Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 3.53 MB

🔬FeatherFault tells you why your Arduino program is crashing

License: GNU General Public License v3.0

C++ 63.09% Python 36.91%
crash failure arduino-library samd21 feather-m0 fault-diagnosis hardfault

featherfault's Introduction

🔬FeatherFault

Build Status

When a microcontroller crashes or hangs, it can be quite difficult to troubleshoot what caused it. FeatherFault is an attempt to build a system that can not only recover from a crash, but explain why the crash happened. FeatherFault supports all boards using the SAMD21 (Adafruit Feather M0, Arduino Zero, etc.), and future support is planned for the SAMD51.

Getting Started

FeatherFault can be installed through the Arduino Library Manager, or by downloading this repository. The Adafruit ASF core is also required, which can be found here. Once these are both installed, FeatherFault can be activated by adding the following lines to the beginning of a sketch:

#include "FeatherFault.h"

void setup() {
    Serial.begin(...);
    while(!Serial);
    FeatherFault::PrintFault(Serial);
    Serial.flush();
    FeatherFault::StartWDT(FeatherFault::WDTTimeout::WDT_8S);
    ...
}

and decorating the sketch code with MARK statements, making sure to surround suspicious code sections with them. MARK may not be used more than once per line, and must be used both before and after the suspected code:

void loop() {
    // Mark a function
    MARK; 
    do_something_suspicous(); 
    MARK;

    // Mark a loop
    MARK;
    while (unsafe_function_one() == true) { MARK;
        // Ignore safe functions, but mark the unsafe ones
        // Which functions are 'unsafe' is up to the programmer
        safe_function_one();
        safe_function_two();
        safe_function_three();
        MARK;
        unsafe_function_two();
        MARK;
    }
}

Once FeatherFault is activated, it will trigger after a set time of inactivity (we specify 8 seconds above, but this value can be changed), on memory overflow, or on a hard fault. Once triggered, FeatherFault will immediately save the location of the last run MARK statement along with the fault cause, and reset the board. This saved data can then be read by FeatherFault::PrintFault and FeatherFault::GetFault, allowing the developer to determine if the board has failed after it resets.

Usage Example

To show how this behavior works, let's assume that unsafe_function() in the code block below attempts to access memory that doesn't exist, causing a hard fault:

void setup() {
    // Wait for serial to connect to the serial monitor
    Serial.begin(...);
    while(!Serial);
    // begin code
    Serial.println("Start!");
    other_function_one();
    unsafe_function(); // oops
    other_function_two();
    Serial.println("Done!");
}

If we run this code without FeatherFault, we would see the serial monitor output something like this:

Start!

After which the device hard faults, causing it to wait in an infinite loop until it is reset.

This behavior is extremely difficult to troubleshoot: as the developer, all we know is that the device failed between Start! and Done. Using more print statements, we could eventually narrow down the cause to unsafe_function—this process is time consuming, unreliable, and downright annoying. Instead, let's try the same code with FeatherFault activated:

void setup() {
    // Wait for serial to connect to the serial monitor
    Serial.begin(...);
    while(Serial);
    // Activate FeatherFault
    FeatherFault::PrintFault(Serial);
    FeatherFault::StartWDT(FeatherFault::WDTTimeout::WDT_8S);
    // begin code
    MARK;
    Serial.println("Start!");
    MARK;
    other_function_one();
    MARK;
    unsafe_function(); // oops
    MARK;
    other_function_two();
    MARK;
    Serial.println("Done!");
    MARK;
}

Running that sketch, we would see the following serial monitor output:

No fault
Start!

No fault here indicates that FeatherFault has not been triggered yet. We change that shortly by running unsafe_function(), causing a hard fault. Instead of waiting in an infinite loop, however, the board is immediately reset to the start of the sketch by FeatherFault. We can then open the serial monitor again:

Fault! Cause: HARDFAULT
Fault during recording: No
Line: 18
File: MySketch.ino
Failures since upload: 1
Start!

Since the FeatherFault was triggered by the hard fault, FeatherFault::PrintFault will print the last file and line number MARKed before the hard fault happened. In this case, line 18 of MySketch.ino indicates the MARK statement after other_function_one(), leading us to suspect that unsafe_function() is causing the issue. We can now focus on troubleshooting unsafe_function().

Additional Features

Getting Fault Data In The Sketch

While most projects should only need traces on the serial monitor, some (such as remote deployments) will need to log the data to other mediums. To do this, FeatherFault has the FeatherFault::DidFault and FeatherFault::GetFault functions to check if a fault has occurred, and to get the last fault trace. For more information on these functions, please see FeatherFault.h.

Getting Fault Data Without Serial

If a serial connection cannot be established while the sketch is running, but the board is able to communicate in bootloader mode, the recover_fault python script can download and read FeatherFault trace data using the bootloader. Simply follow the setup instructions contained in the script, reset the board into bootloader mode, and run:

python ./recover_fault.py recover <comport>

Running Code When The Device Faults

Some code may be needed to perform cleanup of external devices after FeatherFault causes an unexpected reset. There are two general method for this: a safe one, and an unsafe one. While the safe method is generally recommended, access to the state of the program may be needed during the fault,in which case the unsafe method is necessary.

Safe Method

The safe method uses FeatherFault::DidFault at the beginning of setup:

void setup() {
    ...
    if (FeatherFault::DidFault()) {
        // perform cleanup here
        cleanup_code();
    }
    ...
}

Since FeatherFault resets the board immediately upon failure, cleanup_code() will run every time FeatherFault is triggered. When writing the cleanup_code() routine, remember that the program state has been entirely cleared, and any devices or variables in the sketch must be initialized before they can be used (ex. Serial.begin must be called to use Serial). If access to a variable value before the device is reset is needed, please see the unsafe method below.

Unsafe Method

The unsafe method uses FeatherFault::SetCallback to register a function to be called before the device is reset:

volatile void cleanup_code() {
    // perform cleanup here
    // can also read global variables
}

void setup() {
    ...
    FeatherFault::SetCallback(cleanup_code);
    ...
}

cleanup_code() will be called after FeatherFault stores a trace, but before the device is reset—allowing it to access global variables and devices in the faulted state. Note that this implementation has a few major caveats:

  • The callback (cleanup_code) must be interrupt safe (cannot use delay, Serial, etc.).
  • The callback must be extremely careful when accessing memory outside of itself. All memory should be assumed corrupted unless proven otherwise. Pointers should be treated with extra caution.
  • The callback must execute in less time than the specified WDT timeout, or it will be reset by the watchdog timer.
  • If the callback itself faults, an infinite loop will be triggered.

Because of the above restrictions, it is highly recommended that the safe method is used wherever possible.

Implementation Notes

FeatherFault currently handles three failure modes: hanging, memory overflow, and hard fault. When any of these failure modes are triggered, FeatherFault will immediately write the information from the last MARK to flash memory, and cause a system reset. FeatherFault::PrintFault, FeatherFault::GetFault, and FeatherFault::DidFault read this flash memory to retrieve information regarding the last fault.

Hanging Detection

Hanging detection is implemented using the SAMD watchdog timer early warning interrupt. As a result, FeatherFault will not detect hanging unless FeatherFault::StartWDT is called somewhere in the beginning of the sketch. Note that similar to normal watchdog operation, FeatherFaults detection must be periodically resetting using MARK macro; this means that the MARK macro must be placed such that it is called at least periodically under the timeout specified. In long operations that cannot be MARKed (sleep being an example), use FeatherFault::StopWDT to disable the watchdog during that time.

Behind the scenes watchdog feeding is implemented in terms of a global atomic boolean which determines if the device should fault during the watchdog interrupt, as opposed to the standard register write found in SleepyDog and other libraries. This decision was made because feeding the WDT on the SAMD21 is extremely slow (1-5ms), which is unacceptable for the MARK macro (see #4). Note that due to this implementation, the watchdog interrupt happens regularly and may take an extended period of time (1-5ms), causing possible timing issues with other code.

Memory Overflow Detection

Memory overflow detection is implemented by checking the top of the heap against the top of the stack. If the stack is overwriting the heap, memory is assumed to be corrupted and the board is immediately reset. This check is performed inside the MARK macro.

Hard Fault Detection

Hard Fault detection is implemented using the existing hard fault interrupt vector built into ARM. This interrupt is normally defined as a infinite loop, however FeatherFault overrides this handler to allow for tracing and a graceful recovery. This feature is activated when FeatherFault is included in the sketch.

featherfault's People

Contributors

bgoto808 avatar kamocat avatar mirrorkeydev avatar prototypicalpro avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mirrorkeydev

featherfault's Issues

Integrate Register Saving from FeatherTrace

FeatherTrace saves registers pushed onto the stack to a register dump and takes a stacktrace, printing this information on reboot. While a full stacktrace as found in FeatherTrace requires changing compiler flags, saving the pushed registers would only require an inline assembly interrupt handler similar to the one found here. As a result, integrating this feature into FeatherFault would be a useful way to get around being unable to MARK some areas of code while still allowing FeatherFault to be Arduino IDE compatible (unlike FeatherTrace). An external program and the elf from the current build (hopefully compiled with -g3?) would be required to translate the saved address into a line number, but this would be useful information if the line was not MARKed.

The implementation of this ISR would be based on the link above, and would rely on the exception handling behavior in the Cortex-M0+.

Using FeatherFault with SleepyDog fails to compile

FeatherFault fails to compile when being used with the Adafruit SleepyDog Library. This issue is caused by both FeatherFault and SleepyDog defining WDT_Handler function, causing a multiple definition error during link time:

C:\Users\User\AppData\Local\Temp\arduino_build_638914\libraries\Adafruit_SleepyDog_Library\utility\WatchdogSAMD.cpp.o: In function `WDT_Handler':

C:\Users\User\Documents\Arduino\libraries\Adafruit_SleepyDog_Library\utility/WatchdogSAMD.cpp:167: multiple definition of `WDT_Handler'

C:\Users\User\AppData\Local\Temp\arduino_build_638914\libraries\FeatherFault\FeatherFault.cpp.o:C:\Users\zackp\OneDrive\Documents\Arduino\libraries\FeatherFault\src/FeatherFault.cpp:148: first defined here

collect2.exe: error: ld returned 1 exit status

FeatherFault needs to be able to respond to the watchdog timer, so removing WDT_Handler is out of the question. The only idea I have to resolve this issue is to implement the FeatherFault watchdog code in terms of the SleepyDog library--resulting in less-than-ideal reliability and another dependency.

For the moment I will mark this as wontfix, though I am open to other solutions.

WDT reset synchronization causes MARK to run slowly

Currently the the MARK macro takes nearly 5 milliseconds, which can add up quickly if used 20 times in a loop. I think most of this is in the string processing - extracting the filename from the FILE constant. If we move this step into the interrupt handler, we could reduce the time to copying and int and a pointer, plus the time to actually feed the watchdog. I expect this to take around 2 microseconds if the peripheral clock runs at 1MHz.

Here's my test code:

#include "FeatherFault.h"

void setup() {
  digitalWrite(13, LOW);
  delay(100);
  digitalWrite(13, HIGH);
  Serial.begin(115200);
  while(!Serial);
  FeatherFault::StartWDT(FeatherFault::WDTTimeout::WDT_8S); MARK;
  FeatherFault::PrintFault(Serial); MARK;
  Serial.flush(); MARK;
}

void loop() {
  long tic = micros();
  MARK;
  long toc = micros();
  Serial.println(toc-tic);
}

And a graph of the result
featherfault_MARK_time

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.