Coder Social home page Coder Social logo

qbic-pipelines / rnadeseq Goto Github PK

View Code? Open in Web Editor NEW
32.0 5.0 20.0 64.14 MB

Differential gene expression analysis and pathway analysis of RNAseq data

License: MIT License

Dockerfile 0.69% HTML 2.37% R 5.33% Python 7.69% Nextflow 59.37% CSS 0.29% TeX 6.72% Groovy 17.54%
deseq2 nextflow rnaseq pipeline pathway-analysis

rnadeseq's People

Contributors

apeltzer avatar d4straub avatar ggabernet avatar jonoave avatar laurencekuhl avatar louperelo avatar qbicstefanc avatar silviamorins avatar susijo avatar wackero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rnadeseq's Issues

output tables

  • log2count table needs to be removed.
  • output tables: separate count tables from DE gene list tables.
  • PCA plots with conditions
  • Sample distance heatmap with Secondary name.
  • Table names of normalized counts, etc. Which name should they have: _secondary name.
  • Final table rename to: complete stats table.

Corrections cluster profiler

  • Heatmaps show gene name not gene ID
    • For this: normalized count tables should have EnsemblID and gene_name
  • PCA plot double check, legend title: Sample grouping. Check for better colour palette.

If gene provided in --genelist is not in the differentially expressed genes, an error occures

Hi there,

When providing genes not differentially expressed in --genelist, the following error occures :

The following object is masked from ?package:S4Vectors?:
      space
  The following object is masked from ?package:stats?:
      lowess
  Registering fonts with R
  Attaching package: ?limma?
  The following object is masked from ?package:DESeq2?:
      plotMA
  The following object is masked from ?package:BiocGenerics?:
      plotMA
  Exiting.
  estimating size factors
  estimating dispersions
  gene-wise dispersion estimates
  mean-dispersion relationship
  final dispersion estimates
  fitting model and testing
  Warning messages:
  1: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  2: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  3: In data.frame(count = cnts + pc, group = as.integer(group)) :
    NAs introduced by coercion
  Error in counts(dds, normalized = normalized, replaced = replaced)[gene,  :
    subscript out of bounds
  Calls: plotCounts
  Execution halted

This is caused in DESeq2.R starting line 377 with plotCounts().
It would be great to :

  1. Have an explicit error if the gene does not exist in the count tables
  2. Plot the boxplot if the gene is present in the count tables even if not differentially expressed.

Thanks a lot!
Laurence

Colnames in "merged_count_table.txt"

  • Colnames in merged_count_table.txt need to be QBiC code + Aligned.sortedByCoord.out.
  • Open issue in RNAseq pipeline so they remove the "Aligned..." part
  • Check rnadeseq pipeline code still runs

Volcano plots have capped logFC on the x axis

Hi there,

Just a small issue regarding the Volcano plots, as they are capped on the x-axis (logFC) from -5 to 5, this sometimes leaves out genes with a higher logFC :

scale_x_continuous(limits = c(-5,5), breaks = c(-5:5)) +

I can have a look as it's a minor fix, I just want to keep a written trail.
Thanks and best,
Laurence

Add the offer as input file

The offer needs to be added as input file, as it is linked to in the last paragraph of the report (Summary and outlook).

Paragraph for Summary and Outlook

All our reports should contain this at the end, thus please add it to the template (there might be some small revisions in the future, but...)

"The results for all work packages, as described in the quote (give link to quote) can be found in this report. Further support for this project will be restricted to the results presented in this report (e.g. requests to update/manipulate figures and tables).
For further analysis (e.g. the re-analysis of the dataset) we will generate a new quote containing cost estimates."

Add DESeq2 versions to report

  • The package versions used by DESeq2 should be added to the report.
  • Add all the R packages to the "get _sotfware_versions" process

Report corrections

  • Section 3.2.1 Percentages tabs correct typos.
  • go through the report text with Gisela
  • headers for tables, show as space no dots (ask Marie)
  • Mapping statistics: remove text about error rate.

Filenames for RNAseq pipeline

So the pipeline runs properly, filenames need to be named after the sample QBiC code:

QXXXXNNNNN_whateveryoulike.ext

Otherwise MultiQC report and raw count tables will contain the wrong sample names! This is a pre-requisite for the pipeline.

Add this to docs and make everybody aware!

DE_list_DESeq2.tsv file missing

It would be nice to have this file (list containing only the DE genes), together with the final_list_DESeq2.tsv, as output of the DESeq.v2.7.R script.

complete path: DESeq2/zips/DESeq2/results/final/DE_list_DESeq2.tsv

I would like to discuss this, though, before making a pull request.

Pipeline fails if 0 pathways found in one of the conditions

Hi there,

I am currently running the pipeline with treatment (control, 25% and 50% of the drug) as condition. There is no pathways found between 25% and 50% of the drug intake :

  [1] "DE_contrast_condition_treatment_50.CSF_vs_25.CSF"
  [1] "Number of genes in query:"
  [1] 997
  [1] "Number of pathways found:"
  integer(0)

which causes the pipeline to fail at pathway_analysis.R:

##############################################################################
  Pathview is an open source software package distributed under GNU General
  Public License version 3 (GPLv3). Details of GPLv3 is available at
  http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to
  formally cite the original Pathview paper (not just mention it) in publications
  or products. For details, do citation("pathview") within R.
  
  The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG
  license agreement (details at http://www.kegg.jp/kegg/legal.html).
  ##############################################################################
  
  No results to show

I think it's the condition here that needs to be checked

if (nrow(pathway_gostres) > 0){ #if there are enriched pathways

I opened an issue just as a reminder, maybe it'll be a small task i can do at the hackathon :)
Best,
Laurence

Parameter --kegg_blacklist does not work

When adding a kegg pathway with --kegg_blacklist, the pipeline runs but ignores the parameter. When I hardcoded the pathway here it worked, so it is the append function here causing the issue.

Just creating a small issue so I have it noted somewhere, as soon as I have 5 minutes to test it out, I will fix this small bug :)
Thank you!
Laurence

pathway analysis with one DE

Soutions about that:

  • minimum number of DE genes to define a Deregulated pipelines: N genes (at least more than 1), default
  • SCS multiple testing correction in addition to the Benjamini Hochberg: both of them on the table, and selection of DE pathways based on SCS. Add the citation of the tool in the report.

Print software versions

Rather print all versions from all tools used in the Rmarkdown report at the end of the report, e.g. SessionInfo() print at the end of report.

feature - The contrast list is not taken into account for boxplots

Hi there,

I am running the pipeline with 2 different conditions (treatment and patient) and have provided a contrast list :

factor  numerator       denominator
condition_treatment_type        A      B

refering only to treatment.
However, in my boxplots, my data is plotted according to condition and patient on the x axis.
Would it be logical to also plot them depending on the contrast list?
Best,
Laurence

Pathview crash with some pathways

  • Pathview crashes with certain pathways cause the original KEGG graph for them are corrupt.
  • Solution will be to add a blakclist with these pathways: so far "mmu05206", "mmu04215"

nf-core/rnaseq salmon count table leads to failure

I used nf-core/rnaseq without STAR but with salmon. Also, I used a rather unusual .gff for bacterial with very limited information.

I could identify three problems:

  • salmon output is csv instead of tsv.
  • salmon produced rarely float numbers instead of integer. No idea why, this shouldnt happen imho.
  • maybe because I used a non-emsembl .gff in combination with --fc_group_features "transcript_id" when running nf-core/rnaseq, the header looked like that:
transcript_id QMFCE006AD

instead of

Geneid gene_name QBICK031A9Aligned.sortedByCoord.out.bam

I solved the problem by making csv to tsv, round floats to integer and changed header to

Ensembl_ID gene_name QMFCE006AD

I am not sure the header is really required that way.

Pathway analysis for metatranscriptomics

The Problem

The currently used program for pathway analysis gprofiler can't handle bacteria in general and applies only to isolates, i.e. a single species opposed to metatranscriptomics.

The Solution

  • Creating a community profile with HUMAnN2 (metaphlan2) and plot with krona
  • Calculating pathway abundances and presence/absence with HUMAnN2 (bowtie2, diamond) and identify significant different ones with MaAsLin2
  • Report KEGG orthologs, informative GO terms with HUMAnN2 and add significance values with MaAsLin2

All this software fits into the existing container without conflicts.

At least three independent analysis could be possible:

  • Differential gene expression, required: --rawcounts
  • Pathway analysis isolate/single species, required: --rawcounts, --species
  • Community composition and pathway analysis for bacterial communities, required: pre-processed reads

Required changes to workflow

  1. Addition of software (trivial)
  2. Achieve maximal flexibility: All parameters need to be optional, only exeption might be --metadata.
  3. New inputs for metatranscriptome samples (feature/pathway abundance):
    -- Pre-processed (optimally rRNA depleted) reads e.g. from nf-core/rnaseq v1.4+ (with parameters --remove_rRNA & --save_nonrRNA_reads), required for meta-pathway analysis
    -- Optional: databases (nucleotide, protein & utilities), default: automated download
  4. New inputs for paired metatranscriptome - metagenome samples (feature/pathway expression):
    -- Pre-processed metagenomics reads e.g. from nf-core/rnaseq v1.4+
    -- Either a manifest file to link samples or same sample names but different folders

Conclusion

This would be a major increase in code / parameters and output.
Pathway abundance (only metatranscriptome) would be the first step to implement, followed by addition of pathway expression analysis (RNA & DNA measures).

edit: added section "three independent analysis"
edit2: nf-core/rnaseq v1.4 pre-processing is only valid for environmental samples! For host - microbiome studies the host sequences have to be removed too!

KEGG pathway optional

  • For commercial users it would be nice to be able to remove the KEGG pathway analysis.

No legenda on the .png output images

In DESeq2/results/plots the .png pictures have no legenda, i.e. the sample names are not shown properly. In the corresponding .pdf files they are reported, but I fail to include these files in the report (i.e., they don't get shown).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.