Coder Social home page Coder Social logo

sequana / rnaseq Goto Github PK

View Code? Open in Web Editor NEW
17.0 5.0 4.0 3.11 MB

RNA-seq, QC and differential analysis pipeline

License: BSD 3-Clause "New" or "Revised" License

Python 41.40% R 58.60%
sequana workflow rna-seq pipeline differential-analysis ngs

rnaseq's Introduction

SEQUANA

https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat) https://github.com/sequana/sequana/actions/workflows/main.yml/badge.svg?branch=main https://coveralls.io/repos/github/sequana/sequana/badge.svg?branch=main Documentation Status JOSS (journal of open source software) DOI Python 3.8 | 3.9 | 3.10 | 3.11 GitHub Issues
How to cite:

Citations are important for us to carry on developments. For Sequana library (including the pipelines), please use

Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, JOSS DOI doi:10.21105/joss.00352

For the genome coverage tool (sequana_coverage): Desvillechabrol et al, 2018: detection and characterization of genomic variations using running median and mixture models. GigaScience, 7(12), 2018. https://doi.org/10.1093/gigascience/giy110

For Sequanix: Desvillechabrol et al. Sequanix: A Dynamic Graphical Interface for Snakemake Workflows Bioinformatics, bty034, https://doi.org/10.1093/bioinformatics/bty034 Also available on bioRxiv (DOI: https://doi.org/10.1101/162701)

Sequana includes a set of pipelines related to NGS (new generation sequencing) including quality control, variant calling, coverage, taxonomy, transcriptomics. We also ship Sequanix, a graphical user interface for Snakemake pipelines.

Pipelines and tools available in the Sequana project
name/github description Latest Pypi version Test passing apptainers
sequana_pipetools Create and Manage Sequana pipeline Not required
sequana-wrappers Set of wrappers to build pipelines Not on pypi Not required
demultiplex Demultiplex your raw data License restriction
denovo denovo sequencing data
fastqc Get Sequencing Quality control
LORA Map sequences on target genome
mapper Map sequences on target genome
nanomerge Merge barcoded (or unbarcoded) nanopore fastq and reporting
pacbio_qc Pacbio quality control
ribofinder Find ribosomal content
rnaseq RNA-seq analysis
variant_calling Variant Calling
multicov Coverage (mapping)
laa Long read Amplicon Analysis
revcomp reverse complement of sequence data
downsampling downsample sequencing data Not required
depletion remove/select reads mapping a reference ย 
Pipelines not yet released
name/github description Latest Pypi version Test passing
trf Find repeats
multitax Taxonomy analysis

Please see the documentation for an up-to-date status and documentation.

Contributors

Maintaining Sequana would not have been possible without users and contributors. Each contribution has been an encouragement to pursue this project. Thanks to all:

https://contrib.rocks/image?repo=sequana/sequana

Changelog

Version Description
0.17.0
  • viz submodules: remove easydev and cleanup scipy imports
  • remove the substractor utility (use sequana_depletion pipeline instead)
  • remove get_max_gc_correlation function from bedtools. not used.
  • Major change in VCF reader (freebayes). Got rid of freebayes_bcf_filter redundant with freebayes_vcf_filter; replace scipy fisher test with own implementation. Remove useless VCF code.
  • Fixes rnadiff HTML report
  • speedup kegg enrichment using multiprocess
  • Allow sequana_taxonomy to download toydb and viruses_masking DBs from zenodo
0.16.9
  • Major fix on PCA and add batch effect plots in RNAdiff analysis
  • count matrix and DESeq2 output files' headers fixed with missing index (no impact on analysis but only for those willing to use the CSV files in excel)
  • Taxonomy revisited to save taxonomy.dat in gzipped CSV format.
0.16.8
  • update IEM for more testing
  • better handling of error in RNADiff
  • Add new methods for ribodesigner
0.16.7
  • Stable release (fix doc), deprecated.
0.16.6
  • Refactor IEM to make it more robust with more tests.
0.16.5
  • refactor to use pyproject instead of setuptools
  • remove pkg_resources (future deprecation)
  • remove unused requirements (cookiecutter, adjusttext, docutuils, mock, psutil, pykwalify)
  • cleanup resources (e.g. moving canvas/bar.py into viz)
0.16.4
  • hot fixes on RNAdiff reports and enrichments
0.16.3
0.16.2
  • save coverage PNG image (regression)
  • Update taxonomy/coverage standalone (regression) and more tests
0.16.1
  • hotfix missing module
0.16.0
  • add mpileup module
  • homogenization enrichment + fixup rnadiff
  • Complete refactoring of sequana coverage module. Allow sequana_coverage to handle small eukaryotes in a more memory efficient way.
  • use click for the sequana_taxonomy and sequana_coverage and sequana rnadiff command
  • Small fixup on homer, idr and phantom modules (for chipseq pipeline)
0.15.4
  • add plot for rnaseq/rnadiff
0.15.3
  • add sequana.viz.plotly module. use tqdm in bamtools module
  • KEGG API changed. We update sequana to use headless server and keep the feature of annotated and colored pathway.
  • Various improvements on KEGG enrichment including saving pathways, addition --comparison option in sequana sub-command, plotly plots, etc
0.15.2
  • ribodesigner can now accept an input fasta with no GFF assuming the fasta already contains the rRNA sequences
  • Fix IEM module when dealing with double indexing
  • Fix anchors in HTML reports (rnadiff module)
  • refactorise compare module to take several rnadiff results as input
  • enrichment improvements (export KEGG and GO as csv files
0.15.1
  • Fix creation of images directory in modules report
  • add missing test related to gff
  • Fix #804
0.15.0
  • add logo in reports
  • RNADiff reports can now use shrinkage or not (optional)
  • remove useless rules now in sequana-wrappers
  • update main README to add LORA in list of pipelines
  • Log2FC values are now shrinked log2FC values in volcano plot and report table. "NotShrinked" columns for Log2FC and Log2FCSE prior shrinkage are displayed in report table.
0.14.6
  • add fasta_and_gff_annotation module to correct fasta and gff given a vcf file.
  • add macs3 module to read output of macs3 peak detector.
  • add idr module to read results of idr analysis
  • add phantom module to compute phantom peaks
  • add homer module to read annotation files from annotatePeaks
0.14.5
0.14.4
  • hotfix bug on kegg colorised pathways
  • Fix the hover_name in rnadiff volcano plot to include the index/attribute.
  • pin snakemake to be >=7.16
0.14.3
  • new fisher metric in variant calling
  • ability to use several feature in rnaseq/rnadiff
  • pin several libaries due to regression during installs
0.14.2
  • Update ribodesigner
0.14.1
  • Kegg enrichment: add gene list 'all' and fix incomplete annotation case
  • New uniprot module for GO term enrichment and enrichment refactorisation (transparent for users)
0.14.0
  • pinned click>=8.1.0 due to API change (autocomplete)
  • moved tests around to decrease packaging from 16 to 4Mb
  • ribodesigner: new plots, clustering and notebook
0.13.X
  • Remove useless standalones or moved to main sequana command
  • Move sequana_lane_merging into a subcommand (sequana lane_merging)
  • General cleanup of documentation, test and links to pipelines
  • add new ribodesigner subcommand
0.12.X

rnaseq's People

Contributors

cokelaer avatar khourhin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rnaseq's Issues

explanation of the naming convention in the config and pipeline when using bowtie2_mapping rule

bowtie2_mapping (and other rules) are dynamic rules. This means that we can use them several
times in a pipeline. To do so, we must set their name as follows:

exec(open(sequana.modules["bowtie2_mapping_dynamic"], "r").read())
include: bowtie2_mapping_dynamic("ref", manager)

So here the actual rule is called bowtie2_mapping_ref. However, in the rule itself, the expected name in the configuration file is bowtie2_mapping. Therefore, the config file should bowtie2_mapping and not bowtie2_mapping_ref./

Error in indexing : do not create SAIndex output file

Hi,
Before August I could create the indexing of a new genome with a GFF file and a FA file when running sequana-rnaseq, but I tried last week and today with sequana version 0.9.5 and sequana_rnaseq 0.9.15 and I have the error in attached file, maybe due to --latency-wait.
What parameter should I change ?
slurm-31821486.txt

Thanks,
Pierre

improvements rnaseq pipeline

Future

oct 2021

  • use new sequana-wrappers

Those requested features are for the rnadiff analysis, not sequana_rnaseq:

  • if possible, provide resuls w/wo independent filtering
  • we using --force (rnadiff), we should suppress previous DGE results otherwise they will be added to the HTML reports
  • design add column 'alias name'

April/May/June 2021

  • if pvalue == 0, should set a value so that it can be seen in volcano plot
  • fastp tool to complement existing cutadapt trimming tool
  • add html entry point for the enrichment (if several comparisons) or several enrichments
  • refactorise sequana enrichment maybe to have syntax such as sequana enrichment panther"

march 2021

  • better filtering for multiqc
  • main summary.html should have more features/summary/plots
  • check rnaseqc gtf input [catch missing GTF in the main.py and rnaseq.rules]. added a converter in sequana
    • gtf input (from GFF) for the prokaryotes case
    • gtf input (from GFF) for the eukaryotes case
  • salmon for eukaryotes tested on mm10
  • check rnaseqc multiqc module . no need for the biomics fork anymore.

Jan 2021

  • BUG fix switch mark duplicates correctly for the qc and others
  • Better GFF handling with custom gff able to handle several feature types, sanity checks of user's choice on attribute and feature
  • Checked rna_sqc functionality and provide a gff2gtf parser in sequana.

Dec 2020

  • Fix issue of seg fault for bacterial genomes with star aligner
  • fastq_screen should work now. The only contaminants looked for is the phix. Other genome should be handled by the users (meaning build the indexing); fastq_screen searches for phix is now the default behaviour since the code should work out of the box
  • fix missing workflow image in the report.
  • add strandness plot in ./outputs directory and add the image in the summary plot
  • bowtie1/star/bowtie2 indexing are now stored in their own sub-directories
  • provide way to disable rRNA search
  • fix issue related to star index rule bug in sequana
  • rnadiff option is now set automatically to one_factor
  • add option --run to execute the pipeline without manual checking (batch mode)

Oct-Nov 2020

  • star index we may have warning.
    --genomeSAindexNbases 14 is too large for the genome size=4456448,
    which may cause seg-fault at the mapping step. Re-run genome generation with
    recommended --genomeSAindexNbases 10
  • a more generic title in the multiqc_config

Sept 2020

  • Add tolerance for feature_counts in the pipeline and config file after fixing sequana featurecounts functions (v0.9.17)

Aug 2020

  • do_indexing option is now pre-filled when instanciating the pipeline.
  • salmon option validateMappings is deprecated. to remove
  • salmon indexing included
  • refactorise the way feature counts are handled. Not in the onsuccess but a simpler code from @khourhin now included in sequana and this pipeline as of version 0.9.16 .

June/july 2020

  • Fix R1/R2 issue for rRNA
  • add mark duplicates in cluster config and set to False by default
  • add paired option for feature counts when paired data is provided.
  • add option to skip the fastqc on the raw data. This will be the default; The fastqc on the filtered data is kept by default.
  • cleanup the multiqc option to exclude fastqc_samples (to not clash with fastqc_filtered)

April-May 2020

  • if input genome size is >4billions Gb, the bowtie2 output extension are .bt2l (not .bt2) therefore, the sequana rule bowtie2_mapping should be updated and this pipeline as well.
  • add input to the rnadiff analysis in ./rnadiff
  • a faster --help option
  • a --from-project option to import existing pipeline
  • a HTML custom front page
  • add feature counts as a single file

Jan 2020 - April 2020

  • integrate the biomix scripts to make the link with the differential analysis
  • add feature counts in separate directory ready to use by rnadiff
  • integrate salmon

Dec 2019 - Jan 2020

  • fix the RNAseQC rule, which is brojen at the moment
  • check for rRNA feature name presence in the GFF
  • check for feature count type provide by the user
  • check config with schema
  • fix read tag
  • possiblity to switch off cutadapt
  • fixing the bowtie2 config/pipeline conflict name (see #3)
  • Fixing indexing issue: indexing is done even though not asked for or vice versa: when we set indexing to False, the pipeline fails with crypting message. We will provide a better handling of checking whether or not indexing is done.
  • include the schema file
  • parameter output-directory should be renamed output_directory in the multiqc section
  • handle the stdout correctly inb the fastqc rule, bowtie2, bowtie1
  • allow rRNA feature and/or files with meaningful error message if the 2 options conflict
  • better multiconfig report (text/title)

config file not configure properly for sequanix

INFO:sequanix:Creating form based on config file
'software_choice'
<string>:19: (WARNING/2) Bullet list ends without a blank line; unexpected unindent.
Traceback (most recent call last):
  File "/home/cokelaer/Work/github/forked/sequanix/sequanix/sequanix.py", line 749, in _update_sequana
    self.create_base_form()
 File "/home/cokelaer/Work/github/forked/sequanix/sequanix/sequanix.py", line 1141, in create_base_form
    rule_box = Ruleform(rule, contains, count, keywords, specials=specials)
  File "/home/cokelaer/Work/github/forked/sequanix/sequanix/widgets/widgets.py", line 108, in __init__
    option_widget = NumberOption(option, value)
  File "/home/cokelaer/Work/github/forked/sequanix/sequanix/widgets/widgets.py", line 281, in __init__
    self.number.setValue(value)
OverflowError: argument 1 overflowed: value must be in the range -2147483648 to 2147483647
Aborted (core dumped)

tolerance argument in get_most_probable_strand_consensus() function

Hi,
I think there is an error in rnasq.rules, line 439:
"probable_strand = fc.get_most_probable_strand_consensus(".", tolerance=tolerance)"
The get_most_probable_strand_consensus() function in featurescount package does not have any tolerance argument.
It's the get_most_probable_strand() function that does.
The pipeline thus throws me an error, as tolerance is an unknown argument.

Should I remove it in the rnaseq.rules file ?

Thanks,
Pierre

simplify pipeline

  • remove kraken
  • remove cutadapt
  • simplify the genome section for the end-user
  • make naming of rules consistent
  • Fix multiqc typo
  • Fix fastq_screen failure
  • cannot remove slurm files in cleaning part
  • create the proper featureCounts directory for sartools
  • remove clean_ngs
  • simplify entire pipeline

Error in rnaseq.rules

Hi Thomas,

I just came across a typo in rnaseq.rules, line 279:
Instead of
"adapter_tool = manager.config.cutadapt.software_choice",
which returns an error for now, it should be:
"adapter_tool = manager.config.trimming.software_choice"

Happy New Year !
Cheers,
Pierre

pipeline does not work if fastp rule is deactivate

#######################################################
# sofware__choice = ["atropos", "cutadapt", "fastp"]
trimming:
    software_choice: fastp
    do: false
MissingInputException in line 351 of /pasteur/zeus/projets/p02/Biomics/Bioinfo/InProgress/B6766/analysis/rnaseq/rnaseq.rules:
Missing input files for rule sample_rRNA:
D080322_N_CD4_TCR_100_S2/fastp/D080322_N_CD4_TCR_100_S2_R1_.fastp.fastq.gz

seg fault star mapping

Regression bug in the rnaseq pipeline: Not sure whether this is a star issue or a rnaseq pipeline issue

multiple feature counts

  • implement build of several feature counts for different feature
  • [ ]check non-overlapping feature

error in bowtie2_index.rules

Hi,
Running the new version of sequana_rnaseq, I had this error:
NameError in line 34 of /pasteur/homes/pilebury/miniconda3/envs/sequana/lib/python3.7/site-packages/sequana/rules/bowtie2_index/bowtie2_index.rules

Its due to an additionnal underscore in "__bowtie2_index__output_done". When I remove it it works perfectly.

Cheers,
Pierre

Improve output table csv file per comparison

Add to the output csv files COND1_vs_COND2_degs_DESeq2.csv:

  • annotation (present in rnadiff.csv but not in per comparison files)
  • Fix the mismatch between header and columns when csvs imported in Excel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.