Coder Social home page Coder Social logo

ryanyip-kat / flames Goto Github PK

View Code? Open in Web Editor NEW

This project forked from luyitian/flames

0.0 0.0 0.0 3.26 MB

Full-length transcriptome splicing and mutation analysis

License: GNU General Public License v3.0

Python 63.85% C++ 17.78% C 18.37%

flames's Introduction

FLAMES

Full-length transcriptome splicing and mutation analysis

Installation

The easiest way to install the dependencies for this package is using a conda environment. The scripts can then be cloned from git.

conda create -n FLAMES \
    python=2.7 samtools pysam minimap2 numpy editdistance \
    -c bioconda -c conda-forge
git clone https://github.com/LuyiTian/FLAMES.git

Usage for bulk data

Before using this software, remember to activate the FLAMES conda environment. The main scripts are the pipelines for single cells and bulk samples.

conda activate FLAMES
FLAMES/python/sc_long_pipeline.py --help
FLAMES/python/bulk_long_pipeline.py --help

An example has been included with a small subset of SIRV data in the examples folder.

PATH=$PWD/FLAMES/python:$PATH

cd examples/SIRV
bulk_long_pipeline.py \
    --gff3 data/SIRV_isoforms_multi-fasta-annotation_C_170612a.gtf \
    --genomefa data/SIRV_isoforms_multi-fasta_170612a.fasta \
    --outdir FLAMES_output \
    --config_file data/SIRV_config.json \
    --fq_dir data/fastq

# output data is in FLAMES_output
ls FLAMES_output

Usage for single cell data

For single cell, the first step is to find cell barcode in each long read. You can use a precompiled linux binary in src/bin/match_cell_barcode. If you encontered error, espically due to the different running environment. you can compile it from source code using the following command. Please use a C++ compiler that support C++11 features.

g++ -std=c++11 -lz -O2 -o match_cell_barcode ssw/ssw_cpp.cpp ssw/ssw.c match_cell_barcode.cpp kseq.h edit_dist.cpp

Then you can run ./match_cell_barcode without argument to print the help message. match_cell_barcode requires a folder that contains all fastq file (can be .gz file), a file name/path for the statistics of barcode matching, a csv file of cell barcode that used as reference and a file name/path to the output fastq.gz file. The cell barcode for 10x will be in filtered_feature_bc_matrix/barcodes.tsv.gz, please unzip the file and use barcodes.tsv as input. It also support scPipe cell barcode annotation file generated from sc_detect_bc function.

Next, after you get the fastq file from match_cell_barcode. you could run sc_long_pipeline.py with the following command.

usage: FLTSA [-h] -a GFF3 [-i INFQ] [-b INBAM] --outdir OUTDIR --genomefa
             GENOMEFA --minimap2_dir MINIMAP2_DIR [--config_file CONFIG_FILE]
             [--downsample_ratio DOWNSAMPLE_RATIO]

# semi-supervised isoform detection and annotation from long read data.
# output:
# outdir:
#   transcript_count.csv.gz   // transcript count matrix
#   isoform_annotated.filtered.gff3 // isoforms in gff3 format
#   transcript_assembly.fa // transcript sequence from the isoforms
#   align2genome.bam       // sorted bam file with reads aligned to genome
#   realign2transcript.bam // sorted realigned bam file using the
#                            transcript_assembly.fa as reference
#   tss_tes.bedgraph       // TSS TES enrichment for all reads (for QC)
################################################################

optional arguments:
  -h, --help            show this help message and exit
  -a GFF3, --gff3 GFF3  The gene annotation in gff3 format.
  -i INFQ, --infq INFQ  input fastq file.
  -b INBAM, --inbam INBAM
                        aligned bam file (should be sorted and indexed). it
                        will overwrite the `--infq` parameter and skip the
                        first alignment step
  --outdir OUTDIR, -o OUTDIR
                        directory to deposite all results in rootdir, use
                        absolute path
  --genomefa GENOMEFA, -f GENOMEFA
                        genome fasta file
  --minimap2_dir MINIMAP2_DIR, -m MINIMAP2_DIR
                        directory contains minimap2, k8 and paftools.js
                        program. k8 and paftools.js are used to convert gff3
                        to bed12.
  --config_file CONFIG_FILE, -c CONFIG_FILE
                        json configuration files (default
                        config_sclr_nanopore_default.json)
  --downsample_ratio DOWNSAMPLE_RATIO, -d DOWNSAMPLE_RATIO
                        downsampling ratio if performing downsampling analysis

configuration file

FLAMES provides a default set of parameters, but can be changed by the configuration JSON file. The pipeline_parameters section specifies which step to be excuated in the pipeline, by default you should go though all steps. The isoform_parameters section determines the results isoform detection, some key parameters include Min_sup_cnt which means transcript with less read aligned than Min_sup_cnt will be discarded, MAX_TS_DIST which will merge transcripts with the same intron chain and TSS/TES distance less than MAX_TS_DIST. strand_specific will specify whether the the read is strand specific, such as the reads are in the same strand as the mRNA (1) or the reverse complement (-1), or the reads are not strand specific (0), which means the method will determine the strand information based on reference annotation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.