Coder Social home page Coder Social logo

tf-chan-lab / lafite Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 47.63 MB

tool for long read transcriptome assembly

License: Apache License 2.0

Jupyter Notebook 52.81% Python 46.97% CSS 0.23%
nanopore direct-rna-sequencing long-read transcriptome-assembly

lafite's Introduction

LAFITE

Low-abundance Aware Full-length Isoform clusTEr

Overview

LAFITE is designated to identify high-consensus full-length isoforms from Nanopore Direct RNA-seq data. LAFITE combines multiple features from reference annotation and DRS reads (TSS, TES, splicing junction, and read polyadenylation event) and is more sensitive to Low-abundance transcripts.

Prerequisites

Installation

To avoid potential conflicts, we recommend running LAFITE in a conda environment.

conda create -n LAFITE_env -c conda-forge python=3.11
conda activate LAFITE_env

# stable release
pip install LAFITE

# or the latest development version 
pip install git+https://github.com/TF-Chan-Lab/LAFITE

Usage

  1. Run minimap2 and samtools to generate alignment file in bam format

    minimap2 -ax splice -u f -k 14 -G 500000 --secondary=no REFERENCE_FA FASTQ > ALIGNMENT_SAM
    samtools view -bS ALIGNMENT_SAM|samtools sort - > ALIGNMENT_BAM
    

    LAFITE also supports other splicing-aware long read alignment tools.

  2. Run Nanopolish polya to generate read polyadenylation result (optional but recommend)
    Current long-read sequencing technologies (Nanopore cDNA/DRS or PacBio Iso-Seq) are all designed to capture RNA molecules with poly(A) tail. However, RNA fragmentation and pore blocking may bring a considerable part of truncated reads which will interfere downstream analysis. Therefore, LAFITE utilizes the read polyadenylation status reported by Nanopolish to filter reads that have completed the sequencing process.

    nanopolish index -d PATH_TO_FAST5 -s GUPPY_SEQUENCING_SUMMARY FASTQ
    nanopolish polya -t NUM_OF_THREADS -r FASTQ -b ALIGNMENT_BAM -g REFERENCE_FA > Nanopolish_PolyA_RES
    

    LAFITE also provides an alternative approach to estimate read polyadenylation status by scanning any poly(A) motifs that existed at the read 3'-end.

  3. Run LAFITE

    usage: lafite [-h] -b BAM [-B BEDTOOLS] -g GTF -f GENOME -o OUTPUT
              [-n MIN_COUNT_TSS_TES] [-i MIS_INTRON_LENGTH]
              [-c MIN_NOVEL_TRANS_COUNT] [-s MIN_SINGLE_EXON_COVERAGE]
              [-l MIN_SINGLE_EXON_LEN] [-L LABEL] [-p POLYA]
              [-m POLYA_MOTIF_FILE] [-r RELATIVE_ABUNDANCE_THRESHOLD]
              [-j SHORT_SJ_TAB] [-w SJ_CORRECTION_WINDOW] [--no_full_cleanup]
              [-t THREAD] [-T TSS_PEAK] [-d TSS_CUTOFF]
    
    Low-abundance Aware Full-length Isoform clusTEr
    
    optional arguments:
      -h, --help            show this help message and exit
      -b BAM                path to the alignment file in bam format
      -B BEDTOOLS           path to the executable bedtools
      -g GTF                path to the reference gene annotation in GTF format
      -f GENOME             path to the reference genome fasta
      -o OUTPUT             path to the output file
      -n MIN_COUNT_TSS_TES  minimum number of reads supporting a alternative TSS or TES, default: 3
      -i MIS_INTRON_LENGTH  length cutoff for correcting unexpected small intron, default: 150
      -c MIN_NOVEL_TRANS_COUNT
                            minimum occurrences required for a isoform from novel loci, default: 3
      -s MIN_SINGLE_EXON_COVERAGE
                            minimum read coverage required for a novel single-exon transcript, default: 4
      -l MIN_SINGLE_EXON_LEN
                            minimum length for single-exon transcript, default: 100
      -L LABEL              name prefix for output transcripts, default: LAFT
      -p POLYA              path to the file contains read Polyadenylation event
      -m POLYA_MOTIF_FILE   path to the polya motif file
      -r RELATIVE_ABUNDANCE_THRESHOLD
                            minimum abundance of the predicted multi-exon transcripts as a fraction of the
    						total transcript assembled at a given locus, default: 0.01
      -j SHORT_SJ_TAB       path to the short read splice junction file
      -w SJ_CORRECTION_WINDOW
                            edit distance to reference splicing site for splicing correction, default: 40
      --no_full_cleanup     keep all intermediate files
      -t THREAD             number of the threads, default: 4
      -T TSS_PEAK           path to the TSS peak file
      -d TSS_CUTOFF         minimum TSS distance for a transcript to be considered as a novel transcript
    
  • LAFITE can run with the following arguments:

    lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -p Nanopolish_PolyA_RES
    
  • LAFITE can also run without the result from nanoplish polya. Then, a Poly(A) motif list must be provided for the corresponding species.
    We have provided the Poly(A) motif list for human and mouse retrieved from Tian et al. .

    lafite -b ALIGNMENT_BAM -g REFERENCE_GTF -f REFERENCE_FA -o OUTPUT_GTF -t NUM_OF_THREADS -m POLYA_MOTIFS_OF_SPECIES
    
  • LAFITE accepts the TSS peaks from 5'-end CAGE data for identifying high-confidence TSSs. Users can prepare the TSS data in the following format where:

    • The first column is the chromosome name
    • The second column is the 0-based start position of the TSS peak
    • The third column is the 1-based end position of the TSS peak
    • The fourth column is the strand information
  • LAFITE also accepts the splicing junctions from Illumina short read RNA-seq data to proof the long reads. LAFITE supports the SJ.out.tab from STAR aligner. Users can also prepare the splicing junctions in the following format where:

    • The first column is the chromosome name
    • The second column is the 0-based start position of the splicing junction
    • The third column is the 1-based end position of the splicing junction
    • The fourth column is the strand information

Development

LAFITE was developed following the fastai/nbdev framework.

lafite's People

Contributors

loganylchen avatar zjzace avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

loganylchen

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.