Coder Social home page Coder Social logo

scinpas's Introduction

DOI

SCINPAS (Single Cell Identification of Novel PolyA Sites)

Description

SCINPAS is a nextflow pipeline that identifies previously known and novel polyA sites directly from single cell RNA sequencing data.

Workflow

general workflow

read classification into 5 categories

analyses

There are different layers of analyses. and hence you need to use the relevant parameters for running the pipeline. Please refer to this figure:

Requirements

  1. default directory is set as follows

  2. installation of nextflow and dependencies.

mamba create -n nf-env nextflow
  1. Data must be single cell 3'end RNA sequencing data. At the moment, the pipeline supports 10X genomics 3'end sequencing data.

  2. Make sure all scripts (python, nextflow) are located in the "src" folder.

  3. Make sure all mouse data (sample, negative controls, gtf and fasta) are located in "data/mouse" folder

  4. Make sure all human data (sample, negative controls, gtf and fasta) are located in "data/human" folder

  5. Make sure canonical_motives.csv are located in the "data" folder. (common to both human and mouse)

  6. Input file format must be: 10X_A_B.bam.(and 10X_A.B.bam.bai), where A and B are sample name parts.

  7. gtf file is named as: genes.gtf

  8. reference genome is named as: genome.fa (and genome.fa.fai)

  9. raw negative control must be structured as: *10X_A_BUmiRaw.bam (and *10X_A_BUmiRaw.bam.bai) There should be at least 1 letter before 10X to differentiate between input file, by default "A" is used.

  10. deduplicated negative control must be structured as: *10X_A_BUmiDedup.bam (and *10X_A_BUmiDedup.bam.bai) There should be at least 1 letter before 10X to differentiate between input file, by default "A" is used.

raw negative control refers to the raw bam file of one of sample data. (same data but named differently). deduplicated negative control refers to the UMI-tools deduplicated version of one of sample data.

  1. type1 and type2 parameter in nextflow.config file refers to cell type 1 and cell type 2 in the dataset you used. Type1 is the default cell type. e.g. spermatocyte. Type2 is cell type that you expect changes in average terminal exon length and/or the number of intronic polyA sites. e.g. elongating spermatid. This is only relevant if you do "cell_type_analysis".

  2. catalog.bed is needed for computing overlap between SCINPAS PAS and existing, known pA catalog.

Note: folder structures/locations, gtf file, catalog, reference genome and result folder can be changed in the nextflow.config file. However, input file format, negative control format variable names in nextflow.config should not be changed because downstream processes expect that name.

Note: if you do not have some input files (e.g. control, celltype annotation, catalog.bed), processes which need those files will not be executed. The rest of the processes will run.

Command line

Note: execution shown for slurm cluster. Create and select other profile as fit.

Once you made a conda environment and activated the environment (conda activate nf-env), traverse into src folder and run the nextflow command as follows:

  1. Running mouse samples:

    1.1. if you do not want to run analysis: nextflow run main.nf -profile slurm -resume --sample_type "mouse"

    1.2. if you want to do analysis but not (cell type specific and overlap analysis, gene_coverage): nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes"

    1.3. if you want to do analysis including cell type specific analysis: nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes" --cell_type_analysis "yes"

    1.4. if you want to do analysis including analysis related to overlap (comparison between SCINPAS-induced PAS and pre-exsting catalog): nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes" --overlap "yes"

    1.5. if you want to do analysis including analysis related to gene coverage: nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes" --g_coverage "yes"

    1.6. if you want to do all analysis: nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes" --cell_type_analysis "yes" --overlap "yes" --g_coverage "yes"

  2. Running human samples:

    2.1. if you do not want to run analysis: nextflow run main.nf -profile slurm -resume --sample_type "human"

    2.2. if you want to do analysis but not (cell type specific and overlap analysis, gene_coverage): nextflow run main.nf -profile slurm -resume --sample_type "human" --analysis "yes"

    2.3. if you want to do analysis including cell type specific analysis: nextflow run main.nf -profile slurm -resume --sample_type "human" --analysis "yes" --cell_type_analysis "yes"

    2.4. if you want to do analysis including analysis related to overlap (comparison between SCINPAS-induced PAS and pre-exsting catalog): nextflow run main.nf -profile slurm -resume --sample_type "human" --analysis "yes" --overlap "yes"

    2.5. if you want to do analysis including analysis related to gene coverage: nextflow run main.nf -profile slurm -resume --sample_type "human" --analysis "yes" --g_coverage "yes"

    2.6. if you want to do all analysis: nextflow run main.nf -profile slurm -resume --sample_type "human" --analysis "yes" --cell_type_analysis "yes" --overlap "yes" --g_coverage "yes"

  3. background running of the pipeline:

    By default, nextflow displays progression report to the screen. If you do not want that, you can run "nohup" parameter so that progresison report is saved in the log file. Example command line is:

    nohup nextflow run main.nf -profile slurm -resume --sample_type "mouse" --analysis "yes" --cell_type_analysis "yes" --overlap "yes" --g_coverage "yes"

  4. Note:

    Running SCINPAS pipeline on the login node is not recommended despite it assign jobs to computing node. This is because nexflow displays progression report on the screen which can consume i/o extensively on the login node. Hence, it is recommended to login to computing node and run the pipeline there.

For more nextflow commandline parameter options, refer to this website: https://www.nextflow.io/docs/latest

scinpas's People

Contributors

dominikburri avatar ymoon06 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

scinpas's Issues

include license

The repository needs a license. Check here for some pointers and select accordingly.
The license should be added in the root directory.

Limit parallel tasks on cluster

By default, the slurm executer does not have a limit on the number of tasks will be handled in a parallel manner. See here: https://www.nextflow.io/docs/latest/config.html.

It would good to set such a limit, e.g. queueSize = 256.

Though I don't know in which scope to set this, the documentation above can be helpful to resolve this. Likely needed as an own scope "executor" with queueSize as a setting.

Error that GTF file does not exist

Hi Youngbin and Dominik,

I tried running Nextflow on some public datasets and set the directory according to the requirements. However, I encountered an error indicating that the GTF file could not be found. I have verified that the path is correct and that the GTF file can be printed. Could you please help me resolve this issue?
image

Cheers,
Qian

Custom SAM tags not always with proper value type

At several places, SCINPAS adds custom SAM tags to mapped reads (from BAM files).
The custom tags are set with pysam.set_tag, see documentation here.

Since no value type is given, it is deduced. Sometimes this does not work and e.g. intended integers are set as strings.

This can be circumvented by setting value_type according to our type, which is either integer "i" or string "Z".
It can also be that the value provided is not correct, e.g. a string "1" is provided. This could be changed by providing an actual python integer by e.g. casting it with int("1").

Is SCINPAS compatible with STARsolo generated BAMs?

Hi Youngbin and Dominik,

Really neat approach! We're interested in trying out SCINPAS on some public scRNA-seq datasets to detect expression of a pre-defined list of target PAS. We've processed our data using STARsolo, but all mentions in the manuscript and README refer to CellRanger processed BAMs. Have you ever successfully ran STARsolo processed BAMs through the SCINPAS workflow? If so, do you have any recommendations/advice on parameters/run modes prior to input with SCINPAS?

I'm aware that STARsolo has been designed to mimic the CellRanger outputs/workflow, but just wanted to check in before we give it a go.

Cheers,
Sam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.