Coder Social home page Coder Social logo

arkanosis / bamrescue Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 247 KB

Utility to check Binary Sequence Alignment / Map (BAM) files for corruption and repair them

Home Page: https://bamrescue.arkanosis.net/

License: ISC License

Rust 92.04% Shell 7.22% Dockerfile 0.74%
bam bam-files corruption repair rescue bioinformatics

bamrescue's Introduction

bamrescue AUR deb License Build status

bamrescue is a command line utility to check Binary Sequence Alignment / Map (BAM) files for corruption and rescue as much data as possible from them in the event they happen to be corrupted.

asciicast

Installation

On ArchLinux and derivatives (Manjaro…)

A PKGBUILD is provided on AUR for ArchLinux and derivatives. It is only tested with an up-to-date ArchLinux.

# Get the PKGBUILD
git clone https://aur.archlinux.org/bamrescue.git

# Add the author's PGP key
gpg --recv-keys FA490B15D054C7E83F70B0408C145ABAC11FA702

# Build and install bamrescue
cd bamrescue
makepkg -si

Alternatively, you can install bamrescue using an AUR helper such as yay:

# Install bamrescue
yay -S bamrescue

On Debian and derivatives (Ubuntu, Mint…)

Pre-built packages are provided for Debian and derivatives. They are only tested with Debian 12 (Bookworm) and Ubuntu 24.04 LTS (Noble).

# Install prerequisites
sudo apt install curl gnupg

# Add the author's PGP key
curl -s https://arkanosis.net/jroquet.pub.asc | sudo tee /usr/share/keyrings/arkanosis.asc

# Add the author's apt stable channel to your apt sources
echo 'deb [arch=amd64 signed-by=/usr/share/keyrings/arkanosis.asc] https://apt.arkanosis.net/ stable main' | sudo tee /etc/apt/sources.list.d/arkanosis.list

# Update and install bamrescue
sudo apt update
sudo apt install bamrescue

In OCI containers (Docker, Podman…)

A Dockerfile is provided for Docker and alternatives. It is only tested with Podman 5.

To create an OCI image from the Dockerfile, run this command:

podman build --tag bamrescue:0.3.0 -f ./Dockerfile

To run an ephemeral container with the created image, run this command:

podman run --rm -it bamrescue:0.3.0 bamrescue --help

You can of course replace --help with the command / option of you choice.

Usage

Usage: bamrescue check [--quiet] [--threads=<threads>] <bamfile>
       bamrescue rescue [--threads=<threads>] <bamfile> <output>
       bamrescue -h | --help
       bamrescue --version

Commands:
    check                Check BAM file for corruption.
    rescue               Keep only non-corrupted blocks of BAM file.

Arguments:
    bamfile              BAM file to check or rescue.
    output               Rescued BAM file.

Options:
    -h, --help           Show this screen.
    -q, --quiet          Do not output statistics, stop at first error.
    --threads=<threads>  Number of threads to use, 0 for auto [default: 0].
    --version            Show version.

How it works

A BAM file is a BGZF file (specification), and as such is composed of a series of concatenated RFC1592-compliant gzip blocks (specification).

Each gzip block contains at most 64 KiB of data, including a CRC32 checksum of the uncompressed data which is used to check its integrity.

Additionally, since gzip blocks start with a gzip identifier (ie. 0x1f8b), a fixed gzip method (ie. 0x8) and fixed gzip flags (ie. 0x4), and bgzf blocks include both a bgzf identifier (ie. 0x4243), a fixed extra subfield length (ie. 0x2) and their own compressed size, it is possible to skip over corrupted blocks (at most 64 KiB) to the next non-corrupted block with limited complexity and acceptable reliability.

This property is used to rescue data from corrupted BAM files by keeping only their non-corrupted blocks, hopefully rescuing most reads.

Examples

A bam file of 40 MiB (which is very small by today standards) has been corrupted by two hard drive bad sectors. Most tools (including gzip) choke on the file at the first corrupted byte, meaning that up to 100% of the bam payload is considered lost depending on the tool.

Let's check the file using bamrescue:

$ bamrescue check samples/corrupted_payload.bam
bam file statistics:
   1870 bgzf blocks checked (117 MiB of bam payload)
      2 corrupted blocks found (0% of total)
     46 KiB of bam payload lost (0% of total)

Indeed, a whole hard drive bad sector typically amounts for 512 bytes lost, which is much smaller than an average bgzf block (which can be up to 64 KiB large).

Even though most tools would gave up on this file, it still contains almost 100% of non-corrupted bam payload, and the user probably wouldn't mind much if they could work only on that close-to-100% amount of data.

Let's rescue the non-corrupted payload (beware: this takes as much additional space on the disk as the original file):

$ bamrescue rescue samples/corrupted_payload.bam rescued_file.bam
bam file statistics:
   1870 bgzf blocks found (117 MiB of bam payload)
      2 corrupted blocks removed (0% of total)
     46 KiB of bam payload lost (0% of total)
   1868 non-corrupted blocks rescued (100% of total)
    111 MiB of bam payload rescued (100% of total)

The resulting bam file can now be used like if it never had been corrupted. Rescued data is validated using a CRC32 checksum, so it's not like ignoring errors and working on corrupted data (typical use of gzip to get garbage data from a corrupted bam file): it's working on (ridiculously) less, validated data.

Performance

bamrescue is very fast. Actually, it is even faster than gzip while doing more.

Here are some numbers for a 40 MiB, non-corrupted bam file:

Command Time Corruption detected
gzip -t 695 ms No
bamrescue check -q --threads=1 1181 ms No
bamrescue check -q --threads=2 661 ms No
bamrescue check -q --threads=4 338 ms No
bamrescue check --threads=1 1181 ms No
bamrescue check --threads=2 661 ms No
bamrescue check --threads=4 338 ms No

Chart

Here are some numbers for the same 40 MiB bam file, with two single-byte corruptions (at ~7 MiB and ~18 MiB, respectively):

Command Time Corruption detected Number of corrupted blocks reported Amount of data rescuable¹
gzip -t 93 ms Yes N/A 21 Mio (18%)
bamrescue check -q --threads=1 157 ms Yes N/A 21 Mio (18%)
bamrescue check -q --threads=2  91 ms Yes N/A 21 Mio (18%)
bamrescue check -q --threads=4  56 ms Yes N/A 21 Mio (18%)
bamrescue check --threads=1  1174 ms Yes 2 117 Mio (99.99%)
bamrescue check --threads=2  659 ms Yes 2 117 Mio (99.99%)
bamrescue check --threads=4  338 ms Yes 2 117 Mio (99.99%)

¹ uncompressed bam payload, rescued using gzip -d or bamrescue rescue

Chart

Note: these benchmarks have been run on an Intel Core i5-6500 CPU running Kubuntu 16.04.2 and rustc 1.18.0.

Caveats

bamrescue does not check whether the bam payload of the file is actually compliant with the bam specification. It only checks if it has not been corrupted after creation, using the error detection codes built in the gzip and bgzf formats. This means that as long as the tool used to create a bam file was compliant with the specification, the output of bamrescue will be as well, but bamrescue itself will do nothing to validate that compliance.

Compiling

Run cargo build --release in your working copy.

Contributing and reporting bugs

Contributions are welcome through GitHub pull requests.

Please report bugs and feature requests on GitHub issues.

License

bamrescue is copyright (C) 2017-2024 Jérémie Roquet [email protected] and licensed under the ISC license.

bamrescue's People

Contributors

alanhoyle avatar arkanosis avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

alanhoyle

bamrescue's Issues

Show a progress indicator

  • Add a progress bar to show the overall progress.
  • Update the number of bad blocks and lost bytes in real time.
  • Show an ETA.

Packaging

Package for:

  • ArchLinux: PKGBUILD in the AUR
  • Debian and derivatives: single binary .deb package in apt.arkanosis.net for Debian Bookworm, Ubuntu Noble
  • ArchLinux: binary in repository
  • Debian and derivatives: source .deb
  • RedHat and derivatives: single .rpm package in rpm.arkanosis.net for RHEL7, CentOS 7, Fedora 27, Fedora 28
  • OCI containers: OCI image in oci.arkanosis.net (or some third-party hosting platform)

Shell autocompletion

Add autocompletion for:

  • bash
  • zsh

Include the following completions:

  • commands;
  • options;
  • input files with the .bam extension.

Investigate faster inflate implementations

The currently used inflate implementation works pretty well, but is a lot slower than zlib. This means that unless we use multiple threads, bamrescue is slower than gzip. Switching to a state-of-the-art inflate implementation could make bamrescue as fast as gzip with a single thread, and much faster with multiple threads.

Detailed statistics to be used for optional repair

Optionally (?) output the offsets and size of unrecoverable corrupted blocks to a file (eg. bamrescue.$PID.log).

This file could then be used to get the non-corrupted blocks from the original bam file to then repair the corrupted bam file without having to copy the whole original file again (expect a few kiB of non-corrupted bgzf block, instead of a few GiB of bam file).

Rescue corrupted bam files

Rescue corrupted bam files

  • if the payload size is incorrect, it can be fixed as the CRC32 already validates the payload (100% reliable);
  • if there is no empty bgzf block at the end, one can be added (unreliable);
  • if there's a deflate or CRC32 error and the next block is correct, the block must be removed (reliable, lost one block);
  • if there's a deflate or CRC32 error and the next block is shifted, try to fix the bgzf data size (100% reliable if it works).

Provide a flatpak

Consider providing a flatpak, either on flahub or on a custom repo. Maybe also a snap?

Handle corruptions in gzip / bgzf headers

In addition to handling corruptions in gzip / bgzf payloads, handle them as well in headers.

There are three main cases to handle:

  • in place bitflip (eg. wrong identifier) ⇒ ignore / fix it;
  • frameshift bitflip (eg. wrong bgzf block size) ⇒ detect the next bgzf header and fix the previous block;
  • bad sector(s) ⇒ detect the next valid bgzf header and drop the previous blocks.

Detecting a bgzf header is quite simple: just look for the following bytes:

0x1f 0x8b // gzip identifier
0x8 // gzip method
0x4 // gzip flags (FEXTRA)
X X X X // modification time
X // gzip extra flags
X // operating system
X X // gzip extra field length
[
    X X // extra subfield identifier
    S S // extra subfield length
    X… // extra subfield (SS bytes)
]
0x42 0x43 // bgzf identifier
0x2 0x0 // bgzf extra subfield length

In pipelined mode, first check that the offset for the next bgzf block is correct. If not, try to fix it first (if the block size alone is corrupted, no data is lost, as the CRC32 can validate the inflated payload). If unable to fix it, drop as many blocks as required.

Web page

Single page website with at least:

  • Linux x86_64 binary
  • PGP signature
  • Minimal documentation (eg. GitHub's README.md)
  • Example use / output (eg. asciinema)
  • License
  • “Fork me on GitHub” link

Advanced rescue of correctable corruptions

Some corruptions are actually reliably correctable without too much computation:

  • if fixed header bytes are incorrect, they can be fixed as long as the CRC32 properly validates the payload (100% reliable as long as the bam usage is concerned);
  • if the payload size is incorrect, it can be fixed as the CRC32 already validates the payload (100% reliable);
  • if there's a deflate or CRC32 error and the next block is shifted, try to fix the bgzf data size (100% reliable if it works).

Parallel check

When several threads are available:

  • use one thread for bgzf block header decoding;
  • use a thread pool for bgzf block payload decoding.

This should help to make bamrescue faster than gzip -t (even later for the rescue command).

The futures branch might be of some inspiration.

Pseudo-code:

loop:
  if future.poll(header|payload+footer):
    stats += result
    if repairing:
      keep_block() // may have been fixed in the meantime, btw
  if read_header():
    future(pop(header|payload + footer)) // most likely valid, process with CRC32
    push(header|payload + footer)
  else:
    find_next_header_position()
    if next_header_was_a_little_corrupted(): // ie. almost fine
      fix_next_header()
    elif next_header_was_close(): // ie. bgzf block size was wrong
      fix_previous_header()
      pop(header|payload + footer) // was wrong, drop
      push(header|payload + footer) // corrected, CRC32 will be checked later
    else next_header_was_far(): // ie. big mess in the next block(s)
      keep(header|payload + footer) // probably wrong, but who knows... CRC32 will tell us

Make panics human-friendly

There aren't many panics left, hopefully, but given there're still a few unwrap in some places, it'd be better to catch them to help the end user to report them.

One option would be to use human-panic for this.

What's needed:

  • error message
  • full stack + version in a file
  • bug report URI (preferably bamrescue.arkanosis.net/report_bug which would redirect to GitHub issues, with the ability to change the actual target at any time)

Statistics in terms of reads instead of blocks / bytes

Parse the bam payload to be able to output the number of reads in the input bam file, as well as the number of recoverable / unrecoverable reads overall.

Also, give the genomic position of lost reads if the bam is sorted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.