Coder Social home page Coder Social logo

rebar's Introduction

rebar

All Contributors License GitHub issues Test CI Nightly CI

rebar is a REcombination BARcode detector!

Why rebar?

  1. rebar detects and visualizes genomic recombination.

    It follows the PHA4GE Guidance for Detecting and Characterizing SARS-CoV-2 Recombinants which outlines three steps:

    1. Assess the genomic evidence for recombination.
    2. Identify the breakpoint coordinates and parental regions.
    3. Classify sequences as designated or novel recombinant lineages.
  2. rebar peforms generalized clade assignment.

    While specifically designed for recombinants, rebar works on non-recombinants tool! It will report a sequence's closest known match in the dataset, as well any mutation conflicts that were observed. The linelist and visual outputs can be used to detect novel variants, such as the SARS-CoV-2 pango-designation process.

  3. rebar is for exploring hypotheses.

    The recombination search can be customized to test your hypotheses about which parents and genomic regions are recombining. If that sounds overwhelming, you can always just use the pre-configured datasets (ex. SARS-CoV-2) that are validated against known recombinants.

A plot of the breakpoints and parental regions for the recombinant SARS-CoV-2 lineage XBB.1.16. At the top are rectangles arranged side-by-side horizontally. These are colored and labelled by each parent (ex. BJ.1., CJ.1) and are intepreted as reading left to right, 5' to 3'. Below these regions are genomic annotations, which show the coordinates for each gene. At the bottom are horizontal tracks, where each row is a sample, and each column is a mutation. Mutations are colored according to which parent the recombination region derives from.

Install

rebar is a standalone binary file, we recommend conda or direct download.

conda install -c bioconda rebar
  • Please see the install docs for Windows, macOS, Docker, Singularity, and Conda.
  • Please see the compile docs for those interested in source compilation.

Usage

Custom Dataset

A small, test dataset (toy1) serves as a template for creating custom datasets, and for easer visualization of the method and output.

rebar dataset download --name toy1 --tag custom --output-dir dataset/toy1
rebar run --dataset-dir dataset/toy1 --populations "*" --mask 0,0 --min-length 3 --output-dir output/toy1
rebar plot  --run-dir output/toy1 --annotations dataset/toy1/annotations.tsv

SARS-CoV-2

Download a SARS-CoV-2 dataset, version-controlled to the date 2023-11-30 (try any date!).

rebar dataset download --name sars-cov-2 --tag 2023-11-30 --output-dir dataset/sars-cov-2/2023-11-30
rebar run --dataset-dir dataset/sars-cov-2/2023-11-30  --populations "AY.4.2*,BA.5.2,XBC.1.6*,XBB.1.5.1,XBL" --output-dir output/sars-cov-2
rebar plot --run-dir output/sars-cov-2 --annotations dataset/sars-cov-2/2023-11-30/annotations.tsv

Other

Please see the examples docs for more tutorials including:

  • Using your own alignment of genomes as input.
  • Testing specific parent combinations.
  • Performing a 'knockout' experiment.
  • Validating all populations in a dataset.

Please see the dataset and run docs for more methodology.

Output

Linelist

A linelist summary of results (ex. output/toy1/linelist.tsv).

strain validate validate_details population recombinant parents breakpoints edge_case unique_key regions genome_length dataset_name dataset_tag cli_version
population_A pass A false 20 toy1 custom 0.2.0
population_B pass B false 20 toy1 custom 0.2.0
population_C pass C false 20 toy1 custom 0.2.0
population_D pass D D A,B 12-12 false D_A_B_12-12 1-11|A,12-20|B 20 toy1 custom 0.2.0
population_E pass E E C,D 4-4 false E_C_D_4-4 1-3|C,4-20|D 20 toy1 custom 0.2.0

Plots

A visualization of substitutions, parental origins, and breakpoints (ex. output/toy1/plots/).

rebar plot of population D in dataset toy1

Barcodes

The discriminating sites with mutations between samples and their parents (ex. output/toy1/barcodes/).

coord origin Reference A B population_D
1 A A C T C
2 A A C T C
3 A A C T C
4 A A C T C
5 A A C T C
... ... ... ... ... ...

Credits

rebar is built and maintained by Katherine Eaton at the National Microbiology Laboratory (NML) of the Public Health Agency of Canada (PHAC).

This project follows the all-contributors specification (emoji key). Contributions of any kind welcome!


Katherine Eaton

πŸ’» πŸ“– 🎨 πŸ€” πŸš‡ 🚧

Special thanks go to the following people, who are instrumental to the design and data sources in rebar:


Lena Schimmel

πŸ€”

Cornelius Roemer

πŸ”£ πŸ”£ πŸ”£

Josh Levy

πŸ”£

Richard Neher

πŸ€”

Thanks go to the following people, who participated in the development of rebar and ncov-recombinant:


Yatish Turakhia

πŸ”£ πŸ€”

Angie Hinrichs

πŸ”£ πŸ€”

Benjamin Delisle

πŸ› ⚠️

Vani Priyadarsini Ikkurthi

πŸ› ⚠️

Mark Horsman

πŸ€” 🎨

Dan Fornika

πŸ€” ⚠️

Tara Newman
πŸ€” ⚠️

Adrian Zetner

πŸ”£ πŸ€”

Connor Chato

πŸ”£ πŸ€”

Matthew Wells

πŸ“¦

Andrea Tyler

πŸ”£

rebar's People

Contributors

ktmeaton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ktmeaton

rebar's Issues

How does rebar handle IUPAC ambiguous characters?

Hi Katherine,

Thank you rebar, it's working very nicely with a in-silico dataset (part of a quality assurance program in Australia).

I am getting some mixed results when I come include or exclude (eg majority base) ambiguous characters. How does rebar handle ambiguous bases?

Thanks!

explore: Find unknown recombinants in dataset

A current limitation in rebar is that you somewhat need to known which populations in the dataset are recombinants. I'd like to write a new subcommand maybe explore that will check all pairwise combinations in a dataset.

So it would work like: rebar explore --dataset-dir dataset/sars-cov-2 --output-dir output/explore/sars-cov-2.

And you could supply the --min-parents/--max-parents arguments like in rebar run. If --min-parents 2 and --max-parents 2, rebar would compare every population against every other pairwise combination. It would be a pretty hefty calculation, but could be a good assessment for Issue #1 (efficiency improvements).

cli: make phylogeny a new sub-command

In other analysis projects, I'm wishing for the recombination-aware phylogenetic methods I've written here. Pretty much all the functions, but primarily these:

  • phylogeny::get_common_ancestor
  • phylogeny::get_ancestors
  • phylogeny::get_descendants
  • phylogeny::get_recombinants

There might be value in creating a new subcommand rebar phylogeny so that we can do things like:

  1. rebar phylogeny --graph phylogeny.json --mrca XE,XG

plot: combine different breakpoints in one plot

Sometimes, sequence quality will cause isolates of the same recombinant to have slightly difference breakpoints, ex.

XCU_XBC.1_FL.23_22228-22576
XCU_XBC.1_FL.23_22330-22576

We really need to think of a way to combine these into one plot if desired...

Algorithm Deep Dive: XJ

  • XJ is designated as BA.1* and BA.2*.
  • rebar also finds evidence for BA.1 and XV (recursive recombination).
  • Hypotheses:
    • DesignatedRecombinant (BD.1, BA.2.65): score=57, conflict=1
    • NonRecursiveRecombinant (BD.1, BA.2.65): score=57, conflict=1
    • KnockoutRecombinant (BA.1: consensus of BD.1, XE, XL, XV): score=58, conflict=1

The fact that the Knockout BA.1 consensus is a mixture of BD.1 (BA.1.17.2.1), XE, and XL is suspicious.

Designated Recursive
image image

input: stream alignment and show progress bar

When working with large datasets (VirusSeq) it would be nice to monitor progress. We could also use multithreading for it (I think), if we don't mind the output being in a different order from the input.

benchmark: compilation flags for flamegraph

Refer to: c80a888, 967674e

  • Get ready for benchmarking and profiling with flamegraph!
  • Some reminder notes for myself, about what a nightmare it was to install perf for WSL2.

Perf

  1. Check WSL2 kernel version in powershell.

    wsl --version
  2. Download the source code for the matching kernel release: https://github.com/microsoft/WSL2-Linux-Kernel/releases

    Extraction takes a very long time! β˜•

    wget https://github.com/microsoft/WSL2-Linux-Kernel/archive/refs/tags/linux-msft-wsl-<VER>.tar.gz
    tar -xf linux-msft-wsl-<VER>.tar.gz
  3. Install the compilation dependencies.

    "To actually make sense of the perf record, and get the interactive menu, also install these on top of flex and bison to let perf demangle binaries": @tbarbette Source, + commands from @MondayCha.

    sudo apt update
    sudo apt install flex bison 
    sudo apt install libdwarf-dev libelf-dev libnuma-dev libunwind-dev \
        libnewt-dev libdwarf++0 libelf++0 libdw-dev libbfb0-dev \
        systemtap-sdt-dev libssl-dev libperl-dev python-dev-is-python3 \
        binutils-dev libiberty-dev libzstd-dev libcap-dev libbabeltrace-dev
  4. Compile and install perf.

    cd WSL2-Linux-Kernel-linux-msft-wsl-<VER>
    cd tools/perf
    make JOBS=1
    
    sudo cp perf /usr/local/bin
  5. Enable perf for unprivileged users.

    "Lower theΒ perf_event_paranoidΒ value in proc to an appropriate level for your environment. The most permissive value isΒ -1Β but may not be acceptable for your security needs etc..." Source

    "More information can be found at 'Perf events and tool security' document": https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html

    echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Flamegraph

  1. Intall flamegraph.

    cargo install flamegraph
  2. Build for flamegraph.

    cargo build --profile=flamegraph --target x86_64-unknown-linux-musl
  3. Profile.

    perf record target/x86_64-unknown-linux-musl/flamegraph/rebar run --dataset-dir dataset/sars-cov-2/2023-11-30 -- populations "*" --output-dir output/flamegraph
    
    flamegraph -o flamegraph_sars-cov-2.svg --perfdata perf.data

Troubleshooting

  1. Using on a pre-compiled binary doesn't yield informative names on the function names.

    "It seems like you might be stripping away the debug symbols? Do you maybe have strip = true somewhere in Cargo.toml or maybe in some other build script?" @xzaramurd Source

  2. flamegraph hangs after perf record due to an issue collapsing stacks. Might be solved with:

Algorithm Deep Dive: XD

I want to write documentation about how the algorithm works (ex. run.md) with a case study. SARS-CoV-2 recombinant XD often confuses me, so I'll work through some of the results here.

image

  • XD is designated as B.1.617.2* and BA.1*.
  • The "majority" parent is B.1.617.2*, as only about a ~3-5 kb section comes from a secondary parent.
  • Prior to designation, XD samples were classified as Delta 21J. However, the UShER phylogeny has them placed as BA.1.15 descendants. Probably because the ~3-5 kb is in the Spike, which is so mutation-rich.
Public UShER GISAID UShER
image image
  • rebar thinks B.1.617.2 and XS have more support.
  • Who is wrong, rebar or our prior knowledge? (Let's assume rebar for now, to critique the method)
  • Hypotheses:
    • DesignatedRecombinant (BA.1, B.1.617.2): score=20, conflict=18
    • NonRecursiveRecombinant: (BA.1, B.1.617.2* consensus of various AY.*, BA.1): score=41, conflict=8
    • RecursiveRecombinant: (XD, ???): No evidence
    • KnockoutRecombinant: (XS, B.1.617.2* consensus of various AY.*) score=35, conflict=7

These results tell me that:

  • The primary parent is not B.1.617.2 strict, a consensus of various AY.* has way higher scores/less conflict.
  • Non-Recursive Recombinant (BA.1, B.1.617.2) seems like it should be "best", with the highest score (41) and almost the lowest conflict (8).
  • However, there is a large conflict range (18 - 7 = 11) between hypotheses. In cases such as this, rebar prefers the hypothesis that minimizes conflict, rather than maximum support. This is why KnockoutRecombinant with XS was being picked as best. This decision needs to be re-assessed, as I never liked it in the first place.
  • This min_conflict strategy was originally developed to deal with XBB* recursive recombinants. Because often the original recombination (XBB=BJ.1 and CJ.1) would have the highest support but a LOT of conflict.

Speed regression

Somewhere between 4756dce and 967674e, I made changes that had significant speed impacts :( The sars-cov-2 dataset went down from 100 sequences/sec to < 1 seq/sec 😒 going to walk back those changes to identify the problem.

libssl.so.3: cannot open shared object file

Now that we're in the land of compiled languages (Rust) it's type to troubleshoot dynamic library linking on other systems.

error while loading shared libraries: libssl.so.3: cannot open shared object file: No such file or directory

export: linelist operator on single result not vec

Refer to: a4154f2

  • The linelist method expects a Vector of results as input.
  • We instead want it to be a singular result (consensus population and potential recombination).
  • This will allow us to write output in realtime as sequences are processed (ex. Issue #13).

Efficiency improvements

Currently, recombination detection is slow at 5 seconds / sequences. Multiprocessing helps (--threads) but certainly there is code efficiency improvements needed.

XV misclassified as XJ

XV samples are occasionally being classified as XJ. Despite the fact that the parents and breakpoints correctly match XV.

XV Classified

image

XJ Classified

image

Nextclade

image

XV Classified

image

XJ Classified

image

Add recombinant to plot

Currently (v0.2.0) the barcodes/plots only have the parents and the samples. If a sample is a known recombinant, it would be very helpful to have the recombinant population plotted as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.