Coder Social home page Coder Social logo

hera-1's Introduction

Developed by BioTuring, hera is a bioinformatic tool that helps analyze RNA-seq data. With a single command line, hera provides:

  • Base-to-base alignment BAM file
  • Transcript abundance estimation
  • Fusion gene detection with fused sequence assemblies

Each process in hera was carefully organized and optimized in order to maximize the performance in term of time and accuracy.

Example data

We designed a test using 20 datasets from Synapse Dream Challenge SMC-RNA, each of which contains 60 million read pairs. The test was done on a 32-core machine running Ubuntu 14.04. The result is shown in the table below:

Transcriptome Transcriptome + Genome
Alignment
Mapped read 93.3860% 93.3871%
Memory 8GB 30GB
Abundance estimating in compare with truth
Spearman 0.9033 0.9057
Pearson 0.9951 0.9951
Gene fusion in compare with truth
True positive 0.6960
False negative 0.304
False positive 0.0595

Core algorithm

Alignment

In hera, alignment starts with a hash-based approach that is applied on all the reads to anchor gene fragments. Then, Needleman–Wunsch algorithm is used to fill in the gaps between these anchored seeds. With this approach, the mapping time is reduced without hurting the precision. An additional conversion of transcriptome takes place to generate genome coordinates from the original transcriptome coordinates. This step provides a much better accuracy for splicing detection than mapping data onto a reference genome.

In another hand, hera is still able to perform the common genome mapping due to the incompletion of available transcriptomics. Any reads that cannot be mapped properly on the transcriptome will be remapped to the genome later. The procedure for this case is the same as the transcriptome mapping except the hash-based method is replaced with the Burrow-Wheeler Transform algorithm.

Abundance estimation

Expectation–maximization algorithm is optimized with the SQUAREM procedure (Varadhan, R. & Roland, C. Scand. J. Stat. 35, 335–353 (2008)).

Fusion detection

In order to detect fusions, hera keeps track of abnormally mapped reads. Based on their potential fusion site, these reads are divided into several groups, each of which is assembled into a super contig. These contigs will be mapped back onto the reference genome and thereby reveal their fusion gene pairs.

Build requirements:

Install:

1. git clone https://github.com/kspham/hera.git
2. cd hera/
3. chmod +x build.sh
4. ./build.sh

Usage:

INDEX:

./hera/build/hera_build
        --fasta genome_sequence.fa (text file only)
        --gtf annotation_file.gtf
        --outdir path/to/output_directory
[OPTIONAL]
        --full_index

By default, hera need ~8GB for transcriptome indexing only. Full genome indexing needs ~30GB. You also can download indexed human genome file here: GRCh37.75, GRCh38.82

RUN:

./hera/build/hera quant -i path/to/index_directory [OPTIONAL] read1.fastq read2.fastq

[OPTIONAL]:
  -o [output directory] (default: ./)
  -t [number of running threads] (default: 1)
  -z [level of bam file compression (1 - 9)] (default: -1)
  -b [Number of boostrap] (default: 1)
  -w [Output bam file 0: true, 1: false] (defaut: 0)
  -f [Genome fasta file]
  1. Index directory: Directory contain index file from previous index step

  2. Genome fasta file: If not defined, genome mapping will be ignore. Mapping on transcriptome need ~8BG, but mapping with genome need ~30GB.

  3. Output file include:

  • abundance.tsv : Transcripts abundance estimation (tsv file)
  • abundance.h5 : Transcripts abundance estimation and boostrapping result (hdf5 file)
  • fusion.bedpe : Fusion detection result
  • transcript.bam : Alignment result

Third-party

hera includes some third-patry software:

Contacts

Please report any issues directly to the github issue tracker. Also, you can send your feedback to [email protected]

License

MIT license

Copyright (c) BioTuring Inc. 2017 All rights reserved. This Hera 1.0 version is freely accessible for both academic and industry users.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

hera-1's People

Contributors

ginnyaquarius avatar bioturing avatar

Watchers

James Cloos avatar Son Pham avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.