Coder Social home page Coder Social logo

shao-group / aletsch Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 1.0 10.05 MB

Assembler for multiple RNA-seq samples

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.66% M4 0.73% C++ 97.88% Python 0.73%
rna-seq transcriptome-assembly meta-assembly single-cell

aletsch's Introduction

install with bioconda Anaconda-Server Badge

Introduction

Aletsch implements an efficient algorithm to assemble multiple RNA-seq samples (or multiple cells for single-cell RNA-seq data). The datasets and scripts used to compare the performance of Aletsch with other assemblers are available at aletsch-test.

Version v1.1.1

We released Aletsch v1.1.1, a version that substantially improved the memory usage and running time over its previous version v1.1.0, while maintaining an identical assembly accuracy. The improvement was primarily made by fixing the incorrect use of bam-file queries and by removing PCR duplicates. Below we detail the memory usage and running time, both CPU-time and Wall-time (10 threads), of the two versions across all datasets we tested (see aletsch-test).

Memory Usage Comparison (in GB):

Dataset v1.1.1 v1.1.0
BK-H1 6.96 35.55
BK-H2 11.79 64.44
BK-H3 5.32 34.35
BK-M1 21.47 168.01
SC-H1&3 4.12 47.23
SC-H2 24.43 251.81
SC-M1 9.27 82.93

CPU And Wall-Clock Time Comparison (in minutes):

Dataset v1.1.1(CPU) v1.1.1(Wall) v1.1.0(CPU) v1.1.0(Wall)
BK-H1 219 27 541 53
BK-H2 923 96 1319 135
BK-H3 155 17 258 28
BK-M1 691 73 1464 169
SC-H1&3 186 21 167 20
SC-H2 1077 129 1530 183
SC-M1 382 44 441 52

Installation

Aletsch can be installed through conda or by compiling source (see INSTALLATION).

Usage

The usage of aletsch is:

./aletsch -i <input-bam-list> -o <output.gtf> [options]

We highly recommend to generate profiles for individual samples first:

./aletsch --profile -i <input-bam-list> -p <profile>
./aletsch -i <input-bam-list> -o <output.gtf> -p <profile> [options]

Format of Input and Output

Each line of input-bam-list describes a single sample, with 3 fields separated by space. The 3 fields are: alignment-file (in .bam format), index-alignment-file (in. bai format), and protocol. The index-file can be generated using samtools (e.g., samtools index ...). The protocol is chosen from the 5 options: single_end (for illumina single-end RNA-seq protocol), paired_end (for illumina paired-end RNA-seq protocol), pacbio_ccs (for PacBio Iso-Seq CCS reads), pacbio_sub (for PacBio Iso-Seq sub-reads), ont (for Oxford Nanopore RNA-seq). Aletsch will use different parameters / algorithms to process different data types.

Aletsch requires that each input alignment file is sorted; otherwise run samtools to sort it (samtools sort input.bam > input.sort.bam).

The assembled transcripts from all these samples will be written to output.gtf, in standard .gtf format.

Options

Aletsch provides several options for transcript assembly, supporting both its unique parameters and those required by the core algorithm of Scallop. For a detailed list, execute aletsch without arguments.

Parameters Type Default Value Description
--help Displays Aletsch usage information and exits.
--version Shows Aletsch version information and exits.
--profile Profiles individual samples and exits. Writes to files if -p is specified.
-l string Specifies chromosomes to assemble.
-L string Specifies a file containing a list of chromosomes to assemble.
-d string Output directory for individual sample transcripts. Directory must exist prior to execution.
-p string Directory for reading/saving individual sample profiles. Directory must exist prior to execution.
-t integer 10 Number of threads.
-c integer 200 Maximum number of splice graphs in a cluster, recommended as twice the number of samples.
-s float 0.2 Minimum similarity for combining two splice graphs.
  • If -l string or -L file option is provided, Aletsch assembles only the specified chromosomes; otherwise, it assembles all chromosomes.
  • Directories specified by -d and -p must exist before running Aletsch; the tool does not create directories.
  • With --profile, Aletsch infers profiles of individual samples, using the XS tag from input BAM files.

Scoring Transcripts with Pre-trained Model

Aletsch employs a random forest model for scoring transcripts, available for download from Zenodo. Use the provided Python script score.py with this model.

Dependencies

Required Python libraries: numPy, pandas, scikit-learn, joblib

  • Using pip:

    pip install numpy pandas scikit-learn joblib
  • Using conda (recommended):

    conda install numpy pandas scikit-learn joblib

Usage

Score transcripts with the syntax below:

python3 score.py -i <individual_gtf_dir> -m <pretrained_model.joblib> -c <num_of_samples> -p <min_probability_score> -o <output_score.csv>
Parameter Type Default Description
-i String Directory containing Aletsch's feature files(x.trstFeature.csv). This is the same directory where Aletsch outputs individual GTF files, as designated by the -d option in Aletsch's assembly process.
-m String Path to the pre-trained model file for scoring.
-c Integer Number of samples/cells
-p String 0.2 Minimum probability score threshold (range: 0 to 1).
-o String Output directory of scored .csv file.

Assuming a collection of $n$ samples, the directory <individual_gtf_dir> contains a total of $n+1$ feature files, enumerated from 0.trstFeature.csv through to n.trstFeature.csv. Files 0.trstFeature.csv to (n-1).trstFeature.csv correspond to feature files for individual samples, sequentially from the first to the last sample. The file n.trstFeature.csv is derived from the combined graph.

aletsch's People

Contributors

shaomingfu avatar

Watchers

 avatar  avatar

Forkers

rachelshiq

aletsch's Issues

Can I use aletsch to assemble single bulk rnaseq sample?

Hi,

I found that aletsch is much faster than scallop2, I want to known if I can use it to assemble single bulk-rnaseq sample (or multiple bulk-rnaseq sample merged bam) instead of using scallop2 ?
Why or why not?

Best,
Kun

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.