qbic-pipelines / rnadeseq Goto Github PK

View Code? Open in Web Editor NEW

32.0 4.0 20.0 64.17 MB

Differential gene expression analysis and pathway analysis of RNAseq data

License: MIT License

Dockerfile 0.86% HTML 2.18% R 6.36% Python 9.50% Nextflow 72.45% CSS 0.36% TeX 8.30%

deseq2 nextflow rnaseq pipeline pathway-analysis

rnadeseq's Introduction

qbic-pipelines/rnadeseq

Downstream differential gene expression analysis with DESeq2 package.

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Documentation

The qbic-pipelines/rnadeseq pipeline comes with documentation about the pipeline, found in the docs/ directory:

Credits

qbic-pipelines/rnadeseq was written by Gisela Gabernet (@ggabernet), Silvia Morini (@silviamorins) and Oskar Wacker (@WackerO), at QBiC. The DESeq2 scripts were originally written by @qbicStefanC.

The pipeline structure is based on the template by the nf-core project. For more information, please check out the nf-core website.

If you would like to contribute to this pipeline, please see the contributing guidelines.

rnadeseq's People

Contributors

Stargazers

Watchers

Forkers

inambioinfo silviamorins d4straub apeltzer ggabernet laurencekuhl cgpu sk-sahu aipolly jonoave michalsharabi demar01 cwoehle wackero susijo louperelo hossein-fallahi ishaansharma

rnadeseq's Issues

DeSeq2 DE genes LogFC threshold

By default without threshold but possible to apply threshold by parameter in nextflow pipeline.

Filenames for RNAseq pipeline

So the pipeline runs properly, filenames need to be named after the sample QBiC code:

QXXXXNNNNN_whateveryoulike.ext

Otherwise MultiQC report and raw count tables will contain the wrong sample names! This is a pre-requisite for the pipeline.

Add this to docs and make everybody aware!

Add the offer as input file

The offer needs to be added as input file, as it is linked to in the last paragraph of the report (Summary and outlook).

Add DESeq2 versions to report

The package versions used by DESeq2 should be added to the report.
Add all the R packages to the "get _sotfware_versions" process

Pathview crash with some pathways

Pathview crashes with certain pathways cause the original KEGG graph for them are corrupt.
Solution will be to add a blakclist with these pathways: so far "mmu05206", "mmu04215"

Report corrections

Section 3.2.1 Percentages tabs correct typos.
go through the report text with Gisela
headers for tables, show as space no dots (ask Marie)
Mapping statistics: remove text about error rate.

DE tables ordered

order DE gene list by padj value.

Paragraph for Summary and Outlook

All our reports should contain this at the end, thus please add it to the template (there might be some small revisions in the future, but...)

"The results for all work packages, as described in the quote (give link to quote) can be found in this report. Further support for this project will be restricted to the results presented in this report (e.g. requests to update/manipulate figures and tables).
For further analysis (e.g. the re-analysis of the dataset) we will generate a new quote containing cost estimates."

Relevel automatically

Remove further diagnostics plots from report folder

Issue with loading DESeq2 package with singularity

Tests of the pipeline pass locally and on travis, but not on cfc.
Inside the singularity shell the deseq2 package cannot be loaded even though it is in the environment.yml

Remove heatmap for highest variance genes

remove from DESeq2

Show top 20 pathways DE

Corrections cluster profiler

Heatmaps show gene name not gene ID
- For this: normalized count tables should have EnsemblID and gene_name
PCA plot double check, legend title: Sample grouping. Check for better colour palette.

pathway analysis database version

It would be great to track the version of the pathway analysis database to reproduce the results

make quote param optional

Volcano plots have capped logFC on the x axis

Hi there,

Just a small issue regarding the Volcano plots, as they are capped on the x-axis (logFC) from -5 to 5, this sometimes leaves out genes with a higher logFC :

rnadeseq/assets/RNAseq_report.Rmd

Line 435 in abe7427

scale_x_continuous(limits = c(-5,5), breaks = c(-5:5)) +

I can have a look as it's a minor fix, I just want to keep a written trail.
Thanks and best,
Laurence

No legenda on the .png output images

In DESeq2/results/plots the .png pictures have no legenda, i.e. the sample names are not shown properly. In the corresponding .pdf files they are reported, but I fail to include these files in the report (i.e., they don't get shown).

Pathway analysis table gene names

The pathway analysis results table does not provide the gene names of the DE genes found in the pathway.

Folder structure REAC and KEGG

Add KEGG and REAC pathway results inside subfolders.

Add nf-core citation to report

filter columns on final table

How to name and explain better the filter columns of the final table.

Report folder gProfileR rename to pathway_analysis

Parameter --kegg_blacklist does not work

When adding a kegg pathway with --kegg_blacklist, the pipeline runs but ignores the parameter. When I hardcoded the pathway here it worked, so it is the append function here causing the issue.

Just creating a small issue so I have it noted somewhere, as soon as I have 5 minutes to test it out, I will fix this small bug :)
Thank you!
Laurence

normalized counts table header unmatching metadata conditions

The normalized counts table header names should match qbiccode + secondary name.
Otherwise downstream pathway analysis does not work.
Correct headers in DESeq2 script.

plotcounts plotting vst normalized counts

Vst normalized counts
In the future be able to choose vst or rlog depending on sizeFactor library deviation

Issue with whitespace in Heatmaps / Grouping names cannot have underscores

Gene list in report

If gene list is not provided as input, then the report does not show it.

Make quote optional

feature - The contrast list is not taken into account for boxplots

Hi there,

I am running the pipeline with 2 different conditions (treatment and patient) and have provided a contrast list :

factor  numerator       denominator
condition_treatment_type        A      B

refering only to treatment.
However, in my boxplots, my data is plotted according to condition and patient on the x axis.
Would it be logical to also plot them depending on the contrast list?
Best,
Laurence

Change param quote to offer

The parameter name is misleading, we are supposed to attach the offer, not the quote.

Update pipeline to template v13.3

nf-core/rnaseq salmon count table leads to failure

I used nf-core/rnaseq without STAR but with salmon. Also, I used a rather unusual .gff for bacterial with very limited information.

I could identify three problems:

salmon output is csv instead of tsv.
salmon produced rarely float numbers instead of integer. No idea why, this shouldnt happen imho.
maybe because I used a non-emsembl .gff in combination with --fc_group_features "transcript_id" when running nf-core/rnaseq, the header looked like that:

transcript_id	QMFCE006AD

instead of

Geneid	gene_name	QBICK031A9Aligned.sortedByCoord.out.bam

I solved the problem by making csv to tsv, round floats to integer and changed header to

Ensembl_ID	gene_name	QMFCE006AD

I am not sure the header is really required that way.

Print software versions

Rather print all versions from all tools used in the Rmarkdown report at the end of the report, e.g. SessionInfo() print at the end of report.

Add optional Pathway analysis section to report

The optional section pathway analysis should be added to the report

DE_list_DESeq2.tsv file missing

It would be nice to have this file (list containing only the DE genes), together with the final_list_DESeq2.tsv, as output of the DESeq.v2.7.R script.

complete path: DESeq2/zips/DESeq2/results/final/DE_list_DESeq2.tsv

I would like to discuss this, though, before making a pull request.

output tables

log2count table needs to be removed.
output tables: separate count tables from DE gene list tables.
PCA plots with conditions
Sample distance heatmap with Secondary name.
Table names of normalized counts, etc. Which name should they have: _secondary name.
Final table rename to: complete stats table.

pathway analysis with one DE

Soutions about that:

minimum number of DE genes to define a Deregulated pipelines: N genes (at least more than 1), default
SCS multiple testing correction in addition to the Benjamini Hochberg: both of them on the table, and selection of DE pathways based on SCS. Add the citation of the tool in the report.

Pathway analysis for metatranscriptomics

The Problem

The currently used program for pathway analysis gprofiler can't handle bacteria in general and applies only to isolates, i.e. a single species opposed to metatranscriptomics.

The Solution

Creating a community profile with HUMAnN2 (metaphlan2) and plot with krona
Calculating pathway abundances and presence/absence with HUMAnN2 (bowtie2, diamond) and identify significant different ones with MaAsLin2
Report KEGG orthologs, informative GO terms with HUMAnN2 and add significance values with MaAsLin2

All this software fits into the existing container without conflicts.

At least three independent analysis could be possible:

Differential gene expression, required: --rawcounts
Pathway analysis isolate/single species, required: --rawcounts, --species
Community composition and pathway analysis for bacterial communities, required: pre-processed reads

Required changes to workflow

Addition of software (trivial)
Achieve maximal flexibility: All parameters need to be optional, only exeption might be --metadata.
New inputs for metatranscriptome samples (feature/pathway abundance):
-- Pre-processed (optimally rRNA depleted) reads e.g. from nf-core/rnaseq v1.4+ (with parameters --remove_rRNA & --save_nonrRNA_reads), required for meta-pathway analysis
-- Optional: databases (nucleotide, protein & utilities), default: automated download
New inputs for paired metatranscriptome - metagenome samples (feature/pathway expression):
-- Pre-processed metagenomics reads e.g. from nf-core/rnaseq v1.4+
-- Either a manifest file to link samples or same sample names but different folders

Conclusion

This would be a major increase in code / parameters and output.
Pathway abundance (only metatranscriptome) would be the first step to implement, followed by addition of pathway expression analysis (RNA & DNA measures).

edit: added section "three independent analysis"
edit2: nf-core/rnaseq v1.4 pre-processing is only valid for environmental samples! For host - microbiome studies the host sequences have to be removed too!

If gene provided in --genelist is not in the differentially expressed genes, an error occures

Hi there,

When providing genes not differentially expressed in --genelist, the following error occures :

The following object is masked from ?package:S4Vectors?:
      space
  The following object is masked from ?package:stats?:
      lowess
  Registering fonts with R
  Attaching package: ?limma?
  The following object is masked from ?package:DESeq2?:
      plotMA
  The following object is masked from ?package:BiocGenerics?:
      plotMA
  Exiting.
  estimating size factors
  estimating dispersions
  gene-wise dispersion estimates
  mean-dispersion relationship
  final dispersion estimates
  fitting model and testing
  Warning messages:
  1: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  2: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  3: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  Error in counts(dds, normalized = normalized, replaced = replaced)[gene,  :
    subscript out of bounds
  Calls: plotCounts
  Execution halted

This is caused in DESeq2.R starting line 377 with plotCounts().
It would be great to :

Have an explicit error if the gene does not exist in the count tables
Plot the boxplot if the gene is present in the count tables even if not differentially expressed.

Thanks a lot!
Laurence

Clarify `--genelist` format

One per line is good, but that it has to be ENSEMBL genes wasn't clear to me ;-)

Colnames in "merged_count_table.txt"

Colnames in merged_count_table.txt need to be QBiC code + Aligned.sortedByCoord.out.
Open issue in RNAseq pipeline so they remove the "Aligned..." part
Check rnadeseq pipeline code still runs

Add MultiQC results to report folder

Add MultiQC html to report folder.
Remove multiQC data and plots (just leave necessary plots for results)
fastqc.zip optional

change name of repository to rnadeseq2-workflow

deseq2 contrasts

deseq2 should write a contrasts.tsv file when not provided

  [1] "DE_contrast_condition_treatment_50.CSF_vs_25.CSF"
  [1] "Number of genes in query:"
  [1] 997
  [1] "Number of pathways found:"
  integer(0)

which causes the pipeline to fail at pathway_analysis.R:

##############################################################################
  Pathview is an open source software package distributed under GNU General
  Public License version 3 (GPLv3). Details of GPLv3 is available at
  http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
  formally cite the original Pathview paper (not just mention it) in publications
  or products. For details, do citation("pathview") within R.
  
  The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
  license agreement (details at http://www.kegg.jp/kegg/legal.html).
  ##############################################################################
  
  No results to show

I think it's the condition here that needs to be checked

rnadeseq/bin/pathway_analysis.R

Line 204 in abe7427

if (nrow(pathway_gostres) > 0){ #if there are enriched pathways

I opened an issue just as a reminder, maybe it'll be a small task i can do at the hackathon :)
Best,
Laurence

Model design not correctly read in report

KEGG pathway optional

For commercial users it would be nice to be able to remove the KEGG pathway analysis.

qbic-pipelines / rnadeseq Goto Github PK

rnadeseq's Introduction

qbic-pipelines/rnadeseq

Introduction

Documentation

Credits

rnadeseq's People

Contributors

Stargazers

Watchers

Forkers

rnadeseq's Issues

The Problem

The Solution

At least three independent analysis could be possible:

Required changes to workflow

Conclusion

Recommend Projects

Recommend Topics

Recommend Org