csoneson / armor Goto Github PK
View Code? Open in Web Editor NEWLight-weight Snakemake workflow for preprocessing and statistical analysis of RNA-seq data
License: MIT License
Light-weight Snakemake workflow for preprocessing and statistical analysis of RNA-seq data
License: MIT License
hi, for some reason my SE input is calling on trimgalorePE instead of SE?
Log:
Building DAG of jobs...
MissingInputException in line 297 of /home/ruhrig/ARMOR/Snakefile:
Missing input files for rule trimgalorePE:
/nfs/scratch/RNA-seq/Human/carrier/fastq/SRR3583291 SE MCF7_control_None.fastq.gz
Metadata.txt:
names type condition
SRR3583291 SE MCF7_control
SRR3583292 SE MCF7_RA
SRR3583293 SE BT474_control
SRR3583294 SE BT474_RA
Config:
metatxt: /nfs/scratch/ARMOR-runs/metadata_carrier.txt
design: "~ 0 + condition"
contrast: conditionMCF7_RA-conditionMCF7_control,conditionBT474_RA-conditionBT474_control
genesets: H,C5
ncores: 7
FASTQ: /nfs/scratch/RNA-seq/Human/carrier/fastq
fqext1: 1
fqext2: 2
fqsuffix: fastq
output: /nfs/scratch/ARMOR-runs/carrierout
useCondaR: True
Rbin: R
run_trimming: True
run_STAR: True
run_DRIMSeq: True
run_camera: False
Do we still need the groupvar
(specified in the config file)? It is only used to create the bwCond
column of the final SCEs, and in the end any column could be used there.
Hi, I'm not sure what this means? I've previously run the workflow successfully with another dataset. In this case I'm getting the following error:
RuleException:
CalledProcessError in line 526 of /home/ruhrig/ARMOR/Snakefile:
Command 'source /home/ruhrig/miniconda3/bin/activate '/home/ruhrig/ARMOR/.snakemake/conda/17597c04'; set -euo pipefail; R CMD BATCH --no-restore --no-save "--args se='/nfs/scratch/ARMOR-runs/williamsonout/outputR/tximeta_se.rds' organism='Homo_sapiens' design='~0+condition' contrast='UV_8_hr-Untreated,UV_24_hr-Untreated' rmdtemplate='scripts/edgeR_dge.Rmd' outputdir='/nfs/scratch/ARMOR-runs/williamsonout/outputR' outputfile='edgeR_dge.html'" scripts/run_render.R /nfs/scratch/ARMOR-runs/williamsonout/Rout/run_dge_edgeR.Rout' returned non-zero exit status 1.
File "/home/ruhrig/ARMOR/Snakefile", line 526, in __rule_edgeR
File "/home/ruhrig/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
[Tue Mar 19 00:19:26 2019]
Finished job 9.
34 of 62 steps (55%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I am a bit confused regarding parameters in the Reference annotation details
-Section of config.yaml
:
organism
, build
and release
? My GTF is depending on an denovo assembly of the transcriptome which got merged with an already existing annotation of the genome.My organism is Apis mellifera and I the annotation and the genome assembly is fetched from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1). How do I have to set the parameters in this case?
airway data?
In the configuration file, config.yaml
, says
## The maximal number of cores to use for FastQC, STAR, Salmon and DRIMSeq.
## Note that the actual number of cores available to Snakemake is determined by
## the --cores argument when it is invoked.
ncores: 1
and then in some rules you have:
threads:
config["ncores"]
I got confused a couple of moths ago when I didn't know snakemake
as I do know. As you say, with --cores
you specify the maximal number of cores used by snakemake
, so when your are running the rule star_SE
with snakemake --cores 12
and you specify in the config file ncores: 2
, you'll be running 6 instances of STAR
in parallel, but each instance will be using 2 cores, you won't be running STAR
in one file with 12 cores (one instance). I didn't get that from the config file. I'm saying this because it's important to know how many input files you have (approximately) to set ncores
properly and, I think, many times is better to set threads
independently in each rule because of this and also, because different software has different performance when running with multiple cores in parallel. My intention isn't that you change the behavior of this pipeline, It is rather to try to avoid confusing (unintentionally) some users (like me). Maybe you can add some information in the wiki
or in the config
file about this.
Use Travis CI?
Test different versions of config.yaml?
Use a predefined output to test, put it in github?
Ncpu
? Study this.In the config file, allow the user to specify an input directory (where the fastq.gz files are), as well as an output directory (recommended default .
) where the output will be generated.
Add bias parameters (and potentially also other arguments) to Salmon call
E.g., like in https://twitter.com/khanaziz84/status/1082203240752734208 (probably printing a success/failure message to the console is enough)
If we want to go this route, we could/should:
config.yaml
?I ran snakemake --use-conda
on the example dataset to test my installation.
I get the following error:
Quitting from lines 396-398 (edgeR_dge.Rmd)
Error in msigdbr(species = organism, category = genesets) :
please specify only one category at a time
Unless we can find a good solution to allow relative paths, stress that all paths must be either absolute or relative to the location of the Snakefile.
If not using conda R, need to set the path to the library you want to use. Also (important!), the library directory must exist, otherwise it will not be used by R.
.libPaths()[1]
useCondaR
(TRUE/FALSE) -> in the Snakefile, set variable with environment to use for the conda directive in the R rulesAdd system-dependent installation instructions
Export SummarizedExperiment
object (with data and DGE results) for use with e.g. iSEE.
• Somewhere (already in the tximeta
script?) we should/could set the rownames of the SEs to geneID__geneSymbol
or so (I'd prefer not to use .
as the separator since it may already be in the gene ID if it's versioned, but __
should not be in any of the individual components)
• When adding the edgeR/DRIMSeq results, we could add an additional column mlog10PValue
or something, containing -log10(PValue). This makes it easier to make volcano plots.
This could be e.g. new packages or packages for which one would like a specific version which is not available via conda. How do we handle this? Maybe one option would be to have an R script with dedicated code to install all such packages, and a rule to run this script and install the packages in the environment?
Dear Authors and Experts ARMOR,
I am a microbiologist, I am learning RNAseq analysis. However, when I run the file, I had the problem when the ARMOR was running:
Activating conda environment: /home/leo/ARMOR/.snakemake/conda/10da7374
Jul 01 15:58:02 ..... started STAR run
Jul 01 15:58:02 ..... loading genome
Jul 01 15:58:03 ..... started mapping
/bin/bash: line 1: 37248 Segmentation fault (core dumped) STAR --genomeDir /home/leo/ARMOR/E109_E453/E109_E453_starindex.idx --readFilesIn /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/FASTQtrimmed/E453_4_2_R1_val_1.fq.gz /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/FASTQtrimmed/E453_4_2_R2_val_2.fq.gz --runThreadN 18 --outFileNamePrefix /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/STAR/E453_4_2/E453_4_2_ --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c
[Wed Jul 1 15:58:05 2020]
Error in rule starPE:
jobid: 57
output: /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/STAR/E453_4_2/E453_4_2_Aligned.sortedByCoord.out.bam
log: /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/logs/STAR_E453_4_2.log (check log file(s) for error message)
conda-env: /home/leo/ARMOR/.snakemake/conda/10da7374
shell:
echo 'STAR version:
' > /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/logs/STAR_E453_4_2.log; STAR --version >> /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/logs/STAR_E453_4_2.log; STAR --genomeDir /home/leo/ARMOR/E109_E453/E109_E453_starindex.idx --readFilesIn /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/FASTQtrimmed/E453_4_2_R1_val_1.fq.gz /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/FASTQtrimmed/E453_4_2_R2_val_2.fq.gz --runThreadN 18 --outFileNamePrefix /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/STAR/E453_4_2/E453_4_2_ --outSAMtype BAM SortedByCoordinate --readFilesCommand gunzip -c
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job starPE since they might be corrupted:
/home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/STAR/E453_4_2/E453_4_2_Aligned.sortedByCoord.out.bam
[Wed Jul 1 16:04:31 2020]
Error in rule tximeta:
jobid: 62
output: /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/outputR/tximeta_se.rds
log: /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/Rout/tximeta_se.Rout (check log file(s) for error message)
conda-env: /home/leo/ARMOR/.snakemake/conda/9f6359b5
shell:
R CMD BATCH --no-restore --no-save "--args salmondir='/home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/salmon' json='/home/leo/ARMOR/E109_E453/E109_E453_saslmonindex.sidx.json' metafile='/home/leo/ARMOR/E109_E453/metadata_E109_E453.txt' outrds='/home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/outputR/tximeta_se.rds' annotation='Ensembl' organism='Ecoli'" scripts/run_tximeta.R /home/leo/ECOPAIR_RNAseq_rawdata/E109_453_RNAseq_output/Rout/tximeta_se.Rout
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Error! The Snakemake workflow aborted.
Complete log: /home/leo/ARMOR/.snakemake/log/2020-07-01T152546.626254.snakemake.log
I am not a bioinformatic expert. I do not know how to fix the problem,
When you have time, could you please let me know how to solve this problem?
Thank you in advance,
Kind regards,
Le Phuong.
Is it unnecessarily restrictive to assume that FASTQ files are of the form <sample>_{R1/R2}.fastq.gz
? I.e., should we allow the user to specify the {R1/R2}
part in the config file (just {1/2}
is also very common)?
We discussed this, and probably it's best to remove the version numbers from the environment.yaml
file, so that users will not be stuck with old versions. Any user can always specify the versions they want (to make their own analyses reproducible), and we can provide a list of the versions that we have tested the workflow with. In the Wiki, we can also show how to export an environment file from a specific conda environment.
First time trying to use the pipeline on a server.
OS is Linux Mint 18.1
Python 3.5.2
Snakemake installed (pip3), conda installed.
I'm trying to run the example set, but when I run snakemake --use-conda, I get the following error.
File "/usr/local/lib/python3.5/dist-packages/snakemake/init.py", line 624, in snakemake
batch=batch,
File "/usr/local/lib/python3.5/dist-packages/snakemake/workflow.py", line 728, in execute
quiet=list_conda_envs,
File "/usr/local/lib/python3.5/dist-packages/snakemake/dag.py", line 260, in create_conda_envs
env.create(dryrun)
File "/usr/local/lib/python3.5/dist-packages/snakemake/conda.py", line 227, in create
conda = Conda(self._singularity_img)
File "/usr/local/lib/python3.5/dist-packages/snakemake/conda.py", line 338, in new
inst.init(singularity_img=singularity_img)
File "/usr/local/lib/python3.5/dist-packages/snakemake/conda.py", line 352, in init
self.info = json.loads(shell.check_output(self._get_cmd("conda info --json")))
File "/usr/lib/python3.5/json/init.py", line 312, in loads
s.class.name))
TypeError: the JSON object must be str, not 'bytes'
Any ideas on what could be wrong?
Provide a link to some small example data that can be used to test that the workflow works and does what it is intended to do (perhaps also a way to check the output).
Do we need to include pandoc and pandoc-citeproc in the environment file, to make sure that the Rmarkdown files will compile?
When running ARMOR on some data I get an error when it is trying to generate the bigwig files.
Error in rule bigwig: jobid: 51 output: Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bw log: Jin/output/logs/bigwig_p63KO35.log (check log file(s) for error message) conda-env: /home/ryan/ARMOR/.snakemake/conda/74957150 shell: echo 'bedtools version: ' > Jin/output/logs/bigwig_p63KO35.log; bedtools --version >> Jin/output/logs/bigwig_p63KO35.log; bedtools genomecov -split -ibam Jin/output/STAR/p63KO35/p63KO35_Aligned.sortedByCoord.out.bam -bg | sort -k1,1 -k2,2n > Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bedGraph; bedGraphToBigWig Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bedGraph /home/ryan/ARMOR/reference/human/STARIndex/Homo_sapiens.GRCh38.V30.STAR.idx/chrNameLength.txt Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bw; rm -f Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bedGraph
In the terminal i see:
Jin/output/STARbigwig/p63KO35_Aligned.sortedByCoord.out.bedGraph is not case-sensitive sorted at line 38881834. Please use "sort -k1,1 -k2,2n" with LC_COLLATE=C, or bedSort and try again.
when looking at that specific line, we see:
`sed -n '38881833,38881835p' p63KO35_Aligned.sortedByCoord.out.bedGraph
chrY 57215886 57215994 1
GL000008.2 83526 83545 1
GL000008.2 85565 85566 1
`
I am using the Genome sequence, primary assembly (GRCh38) from gencode.
Not sure how to proceed. Thanks!
Hi,
after successfully running the pipeline for human and zebrafish, I've encountered some problems with a bacterial species. After going through (hopefully all) issues here that were already encountered and solved on github, I can't really find a solution for this one, so maybe you can help?
The error occurs at tximeta, here's an error Output, apparently there is a problem with .makesplicing. Any ideas on what may be causing this?
args <- (commandArgs(trailingOnly = TRUE))
for (i in seq_len(length(args))) {
eval(parse(text = args[[i]]))
suppressPackageStartupMessages({
library(dplyr)
library(tximport)
library(tximeta)
library(SingleCellExperiment)
print(salmondir)
[1] "example_data/output/salmon"
print(json)
[1] "example_data/reference/SalmonIndex/Cg.sidx.json"
print(metafile)
[1] "example_data/metadata.txt"
print(outrds)
[1] "example_data/output/outputR/tximeta_se.rds"
print(annotation)
[1] "Gencode"
print(organism)
[1] "Corynebacterium_glutamicum"Load json linkedTxome
loadLinkedTxome(json)
saving linkedTxome in bfc (first time)Read metadata
metadata <- read.delim(metafile, header = TRUE, as.is = TRUE, sep = "\t")
List Salmon directories
salmonfiles <- paste0(salmondir, "/", metadata$names, "/quant.sf")
names(salmonfiles) <- metadata$namesAdd file column to metadata and import annotated abundances
In transcript level
coldata <- cbind(metadata, files = salmonfiles, stringsAsFactors = FALSE)
st <- tximeta::tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4
found matching linked transcriptome:
[ GENCODE - Corynebacterium glutamicum - release 1 ]
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon
Calls: ... makeTxDbFromGFF -> makeTxDbFromGRanges -> .make_splicings
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
Execution halted
best
Anze
Make a summary document describing the provided workflow in words (with references)
Extend checks of input variables (design
, contrast
, etc) in run_render.R
/generate_report.R
. In addition to just checking that these are character scalars, check that they are formatted correctly and give an informative error message otherwise.
Change the default annotation catalog from Ensembl to Gencode. Avoids the need to merge cDNA and ncRNA fasta files, and ensures that the gtf and fasta file have the same set of transcripts.
Affects at least the following files:
README.md
(instructions for downloading reference files)config.yaml
(remove cdna and ncrna, use only txome)Snakefile
(no need to merge cDNA and ncRNA files)generate_tx2gene.R
(transcript ID format is different)run_dge_edgeR.R
and run_dtu_drimseq.R
(set ignoreAfterBar=TRUE
in tximport
call - alternatively run Salmon indexing with gencode
flag)To make things more general (allow as many species as possible), we could allow both Ensembl and Gencode (potentially requiring the user to merge the cDNA and ncRNA files on their own beforehand, if they want to use both). Then, in both cases we would only require an input txome
, and remove the cDNA/ncRNA merging.
Hi all,
I just git cloned a copy of ARMOR and invoked the workflow with
snakemake --cores 10
It came to an error :
rule tximeta:
input: example_data/output/Rout/pkginstall_state.txt, example_data/output/salmon/SRR1039508/quant.sf, example_data/output/salmon/SRR1039509/quant.sf, example_data/output/salmon/SRR1039512/quant.sf, example_data/output/salmon/SRR1039513/quant.sf, example_data/metadata.txt, example_data/reference/SalmonIndex/Homo_sapiens.GRCh38.93_0.8.2/hash.bin, example_data/reference/SalmonIndex/Homo_sapiens.GRCh38.93_0.8.2.json, scripts/run_tximeta.R
output: example_data/output/outputR/tximeta_se.rds
log: example_data/output/Rout/tximeta_se.Rout
jobid: 44
Error in job tximeta while creating output file example_data/output/outputR/tximeta_se.rds.
RuleException:
CalledProcessError in line 499 of /mnt/mcfiles/rlyu/Projects/Snakemake_projects/ARMOR/ARMOR/Snakefile:
Command '/usr/bin/R CMD BATCH --no-restore --no-save "--args salmondir='example_data/output/salmon' json='example_data/reference/SalmonIndex/Homo_sapiens.GRCh38.93_0.8.2.json' metafile='example_data/metadata.txt' outrds='example_data/output/outputR/tximeta_se.rds' annotation='Ensembl' organism='Homo_sapiens'" scripts/run_tximeta.R example_data/output/Rout/tximeta_se.Rout' returned non-zero exit status 1.
File "/mnt/mcfiles/rlyu/Projects/Snakemake_projects/ARMOR/ARMOR/Snakefile", line 499, in __rule_tximeta
File "/mnt/mcfiles/rlyu/Software/miniconda/miniconda3/envs/ARMOR/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Exiting because a job execution failed. Look above for error message
I looked at the Rout, it was like,
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... Error in readLines(curl(my_url)) :
Timeout was reached: Connection timed out after 10001 milliseconds
Error in readLines(curl(my_url)) :
Timeout was reached: Connection timed out after 10001 milliseconds
Error in readLines(curl(my_url)) :
Timeout was reached: Connection timed out after 10000 milliseconds
Error in readLines(curl(my_url)) :
Timeout was reached: Connection timed out after 10000 milliseconds
Error in readLines(curl(my_url)) :
Timeout was reached: Connection timed out after 10000 milliseconds
FAIL
OK
...
sg <- summarizeToGene(st)
loading existing EnsDb created: 2019-03-28 04:55:12
obtaining transcript-to-gene mapping from TxDb
Error in tximport::summarizeToGene(txi, tx2gene, varReduce = varReduce, :
unused argument (varReduce = varReduce)
Calls: summarizeToGene
Execution halted
Please let me know if you have any idea about how to fix it.
Thanks,
Provide a docker container with the workflow?
Consider moving instructions from README to a Wiki. The wiki can also include a Methods document, describing the provided workflow in words (with references).
By adding the conda environment to each rule, a user can either activate the environment and run snakemake
, or directly run snakemake --use-conda
.
Use tximeta
to retrieve reference/add gene information. This will affect run_dge_edger.R
and run_dtu_drimseq.R
. Also, there may not be the need for generate_tx2gene.R
anymore.
Add a rule to be run after setting up the config file, but before running the whole workflow. The rule could be "optional, but recommended", and should do things like:
pkginstall
rule)I just replaced my repo with the new one to get the new -ncore update. But now when I try to run it, it hangs at this step:
(base) ruhrig@ubuntu-ruhrig:~/ARMOR$ snakemake --use-conda
/home/ruhrig/miniconda3/lib/python3.6/site-packages/snakemake/workflow.py:34: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
import snakemake.wrapper
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 DRIMSeq
1 all
4 bigwig
1 edgeR
8 fastqc
8 fastqc2
1 linkedtxome
1 multiqc
1 pkginstall
4 salmonPE
1 salmonindex
1 shiny
4 starPE
4 staridx
1 starindex
4 trimgalorePE
1 tximeta
46
[Wed Mar 20 16:57:42 2019]
rule pkginstall:
input: scripts/install_pkgs.R
output: example_data/output/Rout/pkginstall_state.txt
log: example_data/output/Rout/install_pkgs.Rout
jobid: 31
priority: 50
Activating conda environment: /home/ruhrig/ARMOR/.snakemake/conda/17597c04
^CTerminating processes on user request, this might take some time.
^CCancelling snakemake on user request.
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ruhrig/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 40, in _python_exit
t.join()
File "/home/ruhrig/miniconda3/lib/python3.6/threading.py", line 1056, in join
self._wait_for_tstate_lock()
File "/home/ruhrig/miniconda3/lib/python3.6/threading.py", line 1072, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
Currently, the bigwig
rule is not run (it should be a dependency of the shiny rule). If invoked explicitly, it gives the following error:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.