Coder Social home page Coder Social logo

ratschlab / genome_graph_annotation Goto Github PK

View Code? Open in Web Editor NEW
4.0 24.0 0.0 1.76 MB

Sparse Binary Relation Representations for Genome Graph Annotation

License: GNU General Public License v3.0

CMake 1.22% C++ 94.90% Python 2.38% Shell 1.49%
binary-relations genome-graph graph-annotation binary-matrices sparse-matrices de-bruijn-graphs

genome_graph_annotation's Introduction

Genome Graph Annotation Schemes

Sparse binary relation representations for genome graphs annotations

Reference

Mikhail Karasikov, Harun Mustafa, Amir Joudaki, Sara Javadzadeh-No, Gunnar Rätsch, and André Kahles. Sparse Binary Relation Representations for Genome Graph Annotation. Apr 2020. Journal of Computational Biology, 27(4), 626-639. http://doi.org/10.1089/cmb.2019.0324

This repository implements the following schemes for representing graph annotation:

  • Column-major compressed
  • Row-major flat
  • Rainbowfish
  • BinRel-WT
  • BRWT
  • Multi-BRWT
  • ...

This repository is no longer maintained. Check out the MetaGraph project for a significantly optimized and scaled-up implementation of Multi-BRWT as well as many other graph annotation representations.

As an underlying graph structure, the following representations are implemented:

  1. Hash-based de Bruijn graph
  2. Complete de Bruijn graph, taking constant space

Comparison

The figures below show the final size of two compressed binary relations

  • Kingsford with 3.7 bln rows and 2,652 columns, density ~0.3%
  • Refseq (family) with 1 bln rows and 3,173 columns, density ~3.8%
Method Kingsford, Gb RefSeq, Gb
Column 36.6 80.2
Flat 41.2 121.6
BinRel-WT 49.6 N/A
BinRel-WT (sdsl) 31.4 150.6
Rainbowfish 23.2 136.6
BRWT 14.1 57.2
Multi-BRWT 9.9 43.6

Prerequisites

  • cmake 3.6.1
  • GNU GCC with C++17 (gcc-8 or higher) or LLVM Clang (clang-7 or higher)
  • HTSlib
  • boost
  • folly (optional)

All can be installed with brew or linuxbrew

For compiling with GNU GCC:

brew install gcc autoconf automake libtool cmake make htslib
brew install --build-from-source boost
(optional) brew install --build-from-source double-conversion gflags glog lz4 snappy zstd folly
brew install gcc@8

Then set the environment variables accordingly:

echo "\
# Use gcc-8 with cmake
export CC=\"\$(which gcc-8)\"
export CXX=\"\$(which g++-8)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )

For compiling with LLVM Clang:

brew install llvm libomp autoconf automake libtool cmake make htslib boost folly

Then set the environment variables accordingly:

echo "\
# OpenMP
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix libomp)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix libomp)/include\"
# Clang C++ flags
export LDFLAGS=\"\$LDFLAGS -L$(brew --prefix llvm)/lib -Wl,-rpath,$(brew --prefix llvm)/lib\"
export CPPFLAGS=\"\$CPPFLAGS -I$(brew --prefix llvm)/include\"
export CXXFLAGS=\"\$CXXFLAGS -stdlib=libc++\"
# Path to Clang
export PATH=\"$(brew --prefix llvm)/bin:\$PATH\"
# Use Clang with cmake
export CC=\"\$(which clang)\"
export CXX=\"\$(which clang++)\"
" >> $( [[ "$OSTYPE" == "darwin"* ]] && echo ~/.bash_profile || echo ~/.bashrc )

Compile

  1. git clone --recursive https://github.com/ratschlab/genome_graph_annotation
  2. make sure all submodules are downloaded: git submodule update --init --recursive
  3. install third-party libraries from external-libraries/ following the corresponding istructions
    or simply run the following script
git submodule update --init --recursive

pushd external-libraries/sdsl-lite
./install.sh $(pwd)
popd

pushd external-libraries/libmaus2
cmake -DCMAKE_INSTALL_PREFIX:PATH=$(pwd) .
make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
make install
popd
  1. go to the build directory mkdir -p build && cd build
  2. compile by cmake .. && make -j $(($(getconf _NPROCESSORS_ONLN) - 1))
  3. run unit tests ./unit_tests

Typical issues

  • Linking against dynamic libraries in Anaconda when compiling libmaus2
    • make sure that packages like Anaconda are not listed in the exported environment variables

Build types: cmake .. <arguments> where arguments are:

  • -DCMAKE_BUILD_TYPE=[Debug|Release|Profile] -- build modes (Release by default)
  • -DBUILD_STATIC=[ON|OFF] -- link statically (OFF by default)
  • -DWITH_AVX=[ON|OFF] -- compile with support for the avx instructions (ON by default)

Typical workflow

  1. Build de Bruijn graph from Fasta files, FastQ files, or KMC k-mer counters:
    ./annograph build
  2. Annotate graph using the column compressed annotation:
    ./annograph annotate
  3. Transform the built annotation to a different annotation scheme:
    ./annograph transform_anno
  4. Merge annotations (optional):
    ./annograph merge_anno
  5. Query annotated graph
    ./annograph classify

Example

./annograph build -k 12 -o tiny_example ../tests/data/tiny.fa

./annograph annotate -i tiny_example --anno-filename -o tiny_example ../tests/data/tiny.fa

./annograph classify -i tiny_example -a tiny_example.column.annodbg ../tests/data/tiny.fa

./annograph stats -a tiny_example tiny_example

Scripts

For real benchmarking scripts, see scripts.

Experiments on simulated matrices

Compressed simulated binary relation matrices can be generated using the script experiments/run_benchmarks.py. Given a column count $N_COLUMNS, the simulation mode $MODE three available simulation modes are

  1. norepl uniformly random matrix of size 1,000,000 x $N_COLUMNS,
  2. uniform_rows 200,000 rows of size $N_COLUMNS duplicated 5 times to form a 1,000,000 x $N_COLUMNS matrix, and
  3. uniform_columns $N_COLUMNS / 5 columns of size 1,000,000 duplicated 5 times to form a 1,000,000 x $N_COLUMNS matrix.

The compressor $METHOD can be one of: brwt, bin_rel_wt, bin_rel_wt_sdsl, column, rbfish, flat.

An experiment can then be run with the command

run_benchmarks.py $METHOD $MODE $N_COLUMNS $N_THREADS

when . is passed in place of $METHOD and/or $MODE, all methods/modes are run.

For method brwt, additional parameters can be passed at the end of the command. These can be one of

  1. --arity <N> generate BRWT of arity N, or
  2. --greedy 1 --relax <N> greedy optimization of column arrangement before construction of a BRWT of maximum arity N

All resulting matrices are saved to the simulate folder in the directory where the script is run.

Figures from manuscript

To reproduce the simulated matrix experiment results from the manuscript, run the following commands

for N_COLUMNS in 500 1000 3000; do
    run_benchmarks.py . . $N_COLUMNS $N_THREADS 
    run_benchmarks.py brwt $N_COLUMNS $N_THREADS --greedy 1 --relax 7
done

To plot all data for the figures in the experiments, run the command

run_benchmarks.py plot $N_COLUMNS

An alternative set of methods can be passed as subsequent arguments if desired, for example

run_benchmarks.py plot $N_COLUMNS brwt_arity_2 brwt_greedy_relax_6 bin_rel_wt

genome_graph_annotation's People

Contributors

hmusta avatar karasikov avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.