Coder Social home page Coder Social logo

zavolanlab / zarp Goto Github PK

View Code? Open in Web Editor NEW
32.0 6.0 5.0 24.35 MB

The Zavolab Automated RNA-seq Pipeline

License: Apache License 2.0

Python 81.63% Shell 18.37%
bioinformatics rna-seq snakemake conda docker fair high-throughput pipeline reproducible-research singularity

zarp's Introduction

ci GitHub license DOI:biorxiv DOI:zenodo DOI:workflowhub

ZARP (Zavolab Automated RNA-seq Pipeline) is a generic RNA-Seq analysis workflow that allows users to process and analyze Illumina short-read sequencing libraries with minimum effort. Better yet: With our companion ZARP-cli command line interface, you can start ZARP runs with the simplest and most intuitive commands.

RNA-seq analysis doesn't get simpler than that!

ZARP relies on publicly available bioinformatics tools and currently handles single or paired-end stranded bulk RNA-seq data. The workflow is developed in Snakemake, a widely used workflow management system in the bioinformatics community.

ZARP will pre-process, align and quantify your single- or paired-end stranded bulk RNA-seq sequencing libraries with publicly available state-of-the-art bioinformatics tools. ZARP's browser-based rich reports and visualitations will give you meaningful initial insights in the quality and composition of your sequencing experiments - fast and simple. Whether you are an experimentalist struggling with large scale data analysis or an experienced bioinformatician, when there's RNA-seq data to analyze, just zarp 'em!

Note: For a more detailed description of each step, please refer to the workflow documentation.

Requirements

The workflow has been tested on:

  • CentOS 7.5
  • Debian 10
  • Ubuntu 16.04, 18.04

NOTE: Currently, we only support Linux execution.

Installation

IMPORTANT: Rather than installing the ZARP workflow as described in this section, we recommend installing ZARP-cli for most use cases! If you follow its installation instructions, you can skip the instructions below.

1. Clone the repository

Go to the desired directory/folder on your file system, then clone/get the repository and move into the respective directory with:

git clone https://github.com/zavolanlab/zarp.git
cd zarp

2. Conda and Mamba installation

Workflow dependencies can be conveniently installed with the Conda package manager. We recommend that you install Miniconda for your system (Linux). Be sure to select Python 3 option. The workflow was built and tested with miniconda 4.7.12. Other versions are not guaranteed to work as expected.

Given that Miniconda has been installed and is available in the current shell the first dependency for ZARP is the Mamba package manager, which needs to be installed in the base conda environment with:

conda install mamba -n base -c conda-forge

3. Dependencies installation

For improved reproducibility and reusability of the workflow, each individual step of the workflow runs either in its own Singularity container or in its own Conda virtual environemnt. As a consequence, running this workflow has very few individual dependencies. The container execution requires Singularity to be installed on the system where the workflow is executed. As the functional installation of Singularity requires root privileges, and Conda currently only provides Singularity for Linux architectures, the installation instructions are slightly different depending on your system/setup:

For most users

If you do not have root privileges on the machine you want to run the workflow on or if you do not have a Linux machine, please install Singularity separately and in privileged mode, depending on your system. You may have to ask an authorized person (e.g., a systems administrator) to do that. This will almost certainly be required if you want to run the workflow on a high-performance computing (HPC) cluster.

NOTE: The workflow has been tested with the following Singularity versions:

  • v2.6.2
  • v3.5.2

After installing Singularity, install the remaining dependencies with:

mamba env create -f install/environment.yml

As root user on Linux

If you have a Linux machine, as well as root privileges, (e.g., if you plan to run the workflow on your own computer), you can execute the following command to include Singularity in the Conda environment:

mamba env update -f install/environment.root.yml

4. Activate environment

Activate the Conda environment with:

conda activate zarp

Extra installation steps (optional)

5. Non-essential dependencies installation

Most tests have additional dependencies. If you are planning to run tests, you will need to install these by executing the following command in your active Conda environment:

mamba env update -f install/environment.dev.yml

6. Successful installation tests

We have prepared several tests to check the integrity of the workflow and its components. These can be found in subdirectories of the tests/ directory. The most critical of these tests enable you to execute the entire workflow on a set of small example input files. Note that for this and other tests to complete successfully, additional dependencies need to be installed. Execute one of the following commands to run the test workflow on your local machine:

  • Test workflow on local machine with Singularity:
bash tests/test_integration_workflow/test.local.sh
  • Test workflow on local machine with Conda:
bash tests/test_integration_workflow_with_conda/test.local.sh

Execute one of the following commands to run the test workflow on a Slurm-managed high-performance computing (HPC) cluster:

  • Test workflow with Singularity:
bash tests/test_integration_workflow/test.slurm.sh
  • Test workflow with Conda:
bash tests/test_integration_workflow_with_conda/test.slurm.sh

NOTE: Depending on the configuration of your Slurm installation you may need to adapt file slurm-config.json (located directly under profiles directory) and the arguments to options --cores and --jobs in the file config.yaml of a respective profile. Consult the manual of your workload manager as well as the section of the Snakemake manual dealing with profiles.

Running the workflow on your own samples

Running ZARP with ZARP-cli (recommended)

Head over to the ZARP-cli to learn how to start ZARP runs with very simple commands, like:

zarp SRR23590181

Running ZARP without ZARP-cli

  1. Assuming that your current directory is the workflow repository's root directory, create a directory for your workflow run and move into it with:

    mkdir config/my_run
    cd config/my_run
  2. Create an empty sample table and a workflow configuration file:

    touch samples.tsv
    touch config.yaml
  3. Use your editor of choice to populate these files with appropriate values. Have a look at the examples in the tests/ directory to see what the files should look like, specifically:

  4. Create a runner script. Pick one of the following choices for either local or cluster execution. Before execution of the respective command, you need to remember to update the argument of the --singularity-args option of a respective profile (file: profiles/{profile}/config.yaml) so that it contains a comma-separated list of all directories containing input data files (samples and any annotation files etc) required for your run.

    Runner script for local execution:

    cat << "EOF" > run.sh
    #!/bin/bash
    
    snakemake \
        --profile="../../profiles/local-singularity" \
        --configfile="config.yaml"
    
    EOF

    OR

    Runner script for Slurm cluster exection (note that you may need to modify the arguments to --jobs and --cores in the file: profiles/slurm-singularity/config.yaml depending on your HPC and workload manager configuration):

    cat << "EOF" > run.sh
    #!/bin/bash
    mkdir -p logs/cluster_log
    snakemake \
        --profile="../profiles/slurm-singularity" \
        --configfile="config.yaml"
    EOF

    Note: When running the pipeline with conda you should use local-conda and slurm-conda profiles instead.

    Note: The slurm profiles are adapted to a cluster that uses the quality-of-service (QOS) keyword. If QOS is not supported by your slurm instance, you have to remove all the lines with "qos" in profiles/slurm-config.json.

  5. Start your workflow run:

    bash run.sh

Sample downloads from SRA

An independent Snakemake workflow workflow/rules/sra_download.smk is included for the download of sequencing libraries from the Sequence Read Archive and conversion into FASTQ.

The workflow expects the following parameters in the configuration file:

  • samples, a sample table (tsv) with column sample containing SRR identifiers (ERR and DRR are also supported), see example.
  • outdir, an output directory
  • samples_out, a pointer to a modified sample table with the locations of the corresponding FASTQ files
  • cluster_log_dir, the cluster log directory.

For executing the example with Conda environments, one can use the following command (from within the activated zarp Conda environment):

snakemake --snakefile="workflow/rules/sra_download.smk" \
          --profile="profiles/local-conda" \
          --config samples="tests/input_files/sra_samples.tsv" \
                   outdir="results/sra_downloads" \
                   samples_out="results/sra_downloads/sra_samples.out.tsv" \
                   log_dir="logs" \
                   cluster_log_dir="logs/cluster_log"

Alternatively, change the argument to --profile from local-conda to local-singularity to execute the workflow steps within Singularity containers.

After successful execution, results/sra_downloads/sra_samples.out.tsv should contain:

sample  fq1     fq2
SRR18552868     results/sra_downloads/compress/SRR18552868/SRR18552868.fastq.gz 
SRR18549672     results/sra_downloads/compress/SRR18549672/SRR18549672_1.fastq.gz       results/sra_downloads/compress/SRR18549672/SRR18549672_2.fastq.gz
ERR2248142      results/sra_downloads/compress/ERR2248142/ERR2248142.fastq.gz 

Metadata completion with HTSinfer

An independent Snakemake workflow workflow/rules/htsinfer.smk that populates the samples.tsv required by ZARP with the sample specific parameters seqmode, f1_3p, f2_3p, organism, libtype and index_size. Those parameters are inferred from the provided fastq.gz files by HTSinfer.

Note: The workflow uses the implicit temporary directory from snakemake, which is called with resources.tmpdir.

The workflow expects the following config:

  • samples, a sample table (tsv) with column sample containing sample identifiers, as well as columns fq1 and fq2 containing the paths to the input fastq files see example here. If the table contains further ZARP compatible columns (see pipeline documentation), the values specified there by the user are given priority over htsinfer's results.
  • outdir, an output directory
  • samples_out, path to a modified sample table with inferred parameters
  • records, set to 100000 per default

For executing the example one can use the following (with activated zarp environment):

cd tests/test_htsinfer_workflow
snakemake \
    --snakefile="../../workflow/rules/htsinfer.smk" \
    --restart-times=0 \
    --profile="../../profiles/local-singularity" \
    --config outdir="results" \
             samples="../input_files/htsinfer_samples.tsv" \
             samples_out="samples_htsinfer.tsv" \
    --notemp \
    --keep-incomplete

However, this call will exit with an error, as not all parameters can be inferred from the example files. The argument --keep-incomplete makes sure the samples_htsinfer.tsv file can nevertheless be inspected.

After successful execution - if all parameters could be either inferred or were specified by the user - [OUTDIR]/[SAMPLES_OUT] should contain a populated table with parameters seqmode, f1_3p, f2_3p, organism, libtype and index_size for all input samples as described in the pipeline documentation.

zarp's People

Contributors

angrymaciek avatar balajtimate avatar dominikburri avatar fgypas avatar magmir71 avatar mkatsanto avatar mzavolan avatar nielsschlusser avatar ninsch3000 avatar noepozzan avatar uniqueg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

zarp's Issues

docs: add changelog

Set up changelog (in CHANGELOG.md in root directory), describing v1.0.0 features in bullet point format

chore: set up automatic versioning

When consistently using semantic/conventional commits, it is possible to automatically create releases, for example, whenever we merge into the main branch.

This should cover the following:

  • Determine whether a release should bump the major, minor or patch version (this requires strict following of the conventional commits specs!)
  • Tag and describe the release
  • Update the changelog (see #5)

Possibly we could also configure this to publish updates to Zenodo and other outlets (e.g., WorkflowHub, Dockstore) via web hooks, but I'm not sure.

Github actions too slow

Is your feature request related to a problem? Please describe.
Github actions ci test creates multiple times the same environment and creates different jobs for tasks that can be in one job.
Mamba implementation may make it faster.

Describe the solution you'd like
Mamba implementation and all tasks in one job, while creating the zarp env once

Additional context
https://github.com/conda-incubator/setup-miniconda

Infer sample metadata from data itself with HTSinfer

Is your feature request related to a problem? Please describe.

Currently, metadata for each sample has to be determined and put in a sample table, which is tedious.

Describe the solution you'd like

We can call HTSinfer and have the tool sniff the missing metadata. As HTSinfer does not estimate fragment size distributions, for single-ended libraries we should set a "reasonable" default here, if the user does not provide them.

Note that this is a meta-issue, as it will require several changes, at least the following (prepare one issue each for the following):

  • Publish HTSinfer on Conda and BioContainers.
  • Create a rule that runs HTSinfer
  • Parse original sample table and call HTSinfer rule accordingly (no need to call it with all functionalities for all samples if metadata was already supplied; rule or bare Python code)
  • Prepare an updated sample table (or dictionary that Snakemake can access?) from the output (rule or bare Python code)
  • Read values from updated table or dictionary across entire pipeline

Additional context

Originally, we were thinking of integrating HTSinfer into ZARP-cli. However, as it is potentially a long-running tool, it is better to run it as part of the pipeline in its own environment.

unset temporary directory for map genome STAR?

This is rather a note than an issue.

The provided tests/input_files/rule_config.yaml contains specification for the temporary directory in rule map_genome_star (and pe_map_genome_star):

--outTmpDir: '/tmp/star_tmp'

I guess this temporary directory is fine in linux systems and for local execution but I ran into problems when running it on slurm.
Before I did not have problems with this, but recently I got - for some samples only - errors with the temporary directory (unfortunately I don't have the error message anymore, it got overwritten by new, successful runs).
I tried to

  • switch /tmp/star_tmp to $TMPDIR/star_tmp within the rule config,
  • switch it to the rule map_genome_star in workflow/rules/single_end.snakefile.smk
  • try instead of $TMPDIR the actual value on the cluster

but with all the same error message that the temporary directory is not accessible. Binding the temp dir in the singularity-args did not help.

I ended up to completely remove the specification of the temp directory and instead use the default one: "the temp directory will default to outFileNamePrefix_STARtmp" (from STAR manual 2.7.10a), which resolved the issue.
This is not ideal, as it is now within the results directory and does not use the clusters temp directories.
Since I don't know whether the problem is with the cluster, Singularity (e.g. binding paths), or the snakemake set-up, I cannot propose a solution and rather ask for help.

bug: local test with singularity fails?

Describe the bug
I have cloned the repo fresh and followed the instructions from the README to run a local test with singularity containers.

To Reproduce
Follow the testing as in the README.

Expected behavior
Clean pass

Screenshots

I cannot post the screnshot not to disclose sciCORE dirtree. Please check Slack

Fastqc report after adapter trimming

Is your feature request related to a problem? Please describe.
The current fastqc is applied to the raw fastq files. However it is still relevant for the user to see what the quality of reads is or if there are still overrepresented sequences after the trimming.

Describe the solution you'd like
Introduce a second rule fastqc_trimmed, where the trimmed fastq file is used as input.

chore: inspect the required resource scaling

Is your feature request related to a problem? Please describe.
We have encountered a problem with cutadapt crashing due to low memory on some datasets; we should check how all the steps scale with the input data size and account for that accordingly in the pipeline.

Scale time resources

Is your feature request related to a problem? Please describe.
Each rule has a static requirement of time needed to run. However real running times scale with the input file sizes. On top of that, in the existing implementation, specific queues provided in the scicore slurm system are specified in the slurm profile, which might cause incompatibilities with other slurm systems.

Describe the solution you'd like

  • Make use of time variable within each rule. Scale the estimate with the input sizes.

Based on some existing runs, I can fit by least squares the size - runtimes and use the coefficient for scaling. Not sure if this will always work (Multiple inputs, multiprocessing optimisation). Still as we have more information on successful runs we could improve the accuracy of these estimates.

  • Check if the attempt keyword can be used to increase the time requirement upon restart.

Turns out there is no access to the attempt parameter in the params field. The plan is to first calculate the time estimate in the resources and then interface that to the params field, so that we can then feed it to the cluster.json

  • Check if the time estimate is translated correctly in terms of the queueing system.
    For the 6hour queue this works fine. Jobs that require less than 30 minutes are still going to the 6hour queue, which is specific to this slurm instance and should be fine in general.

Implementation tasks:

  • Add params and resources time parameters and use standard times first
  • Fix the time parameter to be scaled to the input sizes by finding coefficient of scaling for a few rules (ensure the concept works).
  • Fix the time parameter to be scaled to the input sizes by finding coefficient of scaling for each rule
  • Remove specific rule specifications in cluster.json files
  • Run standard tests
  • Run some real dataset to ensure the estimates are reasonable
  • Check efficiency score of the workflow to observe if it has increased

Describe alternatives you've considered
If all the above do. not work focus on eliminating the queue specification from the cluster.json

cutadapt output is inconsistent while running in singularity and conda

@dominikburri discovered this issue while debugging today.

At the end of the workflow we get different output files for cutadapt depending which execution strategy we use, (on the test data) only because one column with version number is added to the output table. It is just a format issue, the actual result values are the same.

This is strange however since we specify the same version of cutadapt for both conda and singularity. Maybe they are pulling different patch versions? or different builds?

Anyway, should be easy to fix by freezing exactly the same version of the software for both apporaches.
This could potentially solve the problem we discovered in #16 (comment) I think this might be the same thing manifested in a different place.

FYI @fgypas

Binding path error

Describe the bug
The singularity slurm test fails with error: unable to add "{path}" to mount list: destination must be an absolute path

To Reproduce
Run zarp/tests/test_integration_workflow/test.slurm.sh

Expected behavior
The workflow fails with error like in description

feat: Automatically increase compute resources if insufficient

Currently for HPC (Slurm) execution, the compute resources like memory and time are specified per job in profiles/slurm-config.json.

If the execution fails because a job needs more resources one needs to adjust the above file and re-run the pipeline manually, which is not very effective.

Snakemake provides a functionality to re-run failed jobs: https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html?highlight=attempt#resources

What it requires:

  • the option restart-times has to be added to the snakemake command with the number of attempts
  • each affected job should include the parameter attempt with which the memory and/or time limits can be adjusted (e.g. multiplying attempt with memory in MB)

In this way, jobs that fail because of insufficient resources are re-submitted with more resources.

Support for unstranded data

This issue is a transferal from the previous issue on gitlab (number 89).

Original description

It seems that the v0.1 milestone will support only stranded data.
We need to make sure that in later versions we support unstranded data as well.

Comments

@fgypas
It seems that we might not support it in the end, so closing for now.

@uniqueg
I disagree strongly and think it should be part of the published version. The very first test with external real-world data (SARS-related samples) demonstrated that there are still plenty of relevant unstranded libraries around. Besides, whether a library is stranded or not is not something that is (unfortunately) typically reported when uploading samples to SRA, so whenever anyone wants to run any sample from there, it's a gamble we will accept it (after checking first with another pipeline) or not. That's a really poor user experience...

@mzavolan
Well.. I think the generating unstranded data makes no sense, especially now. When analyzing such data people make choices that are not warranted and introduce errors for sure (e.g. cumulating reads from plus and minus strand, and of course, discarding regions where there is transcription in both directions. I don't want to spend time implementing and testing choices like this. How do you want to proceed?

@uniqueg
I understand the sentiment but then that argument applies more or less to any data that were obtained with inferior protocols or outdated technology. And to my taste that depends a little bit too much on personal opinion and what kind of resources you have access to. Unstranded data has been proven useful in the past, so I can't really see how analyzing them makes no sense, despite all the obvious drawbacks.
Anyway, how about we leave this issue open and defer the discussion until we know how to proceed with Rhea? If we want to allow our users to analyze samples straight from SRA, I think Rhea should handle the vast majority of samples on there, and that would probably mean it should handle unstranded libraries. If we don't care too much about that particular use case and mostly concern ourselves with how Rhea can serve ourselves, then we should probably drop it for the reasons mentioned.

@mkatsanto
This feature will not be implemented in the near future according to our scope. If the public requires it we can implement it in later versions.

@uniqueg
Do if reviewers ask for that, otherwise wait until users ask for it.

take a closer look at the "notemp" flag in snakemake profiles

An issue we discovered with @mkatsanto today while debugging:

We are not sure the notemp flag should be set in the snakemake profiles. However, if we get rid of it then during the pipeline execution multiqc fails - it does not find files which it expects... They are probably removes due to the temp() mechanism.

We need to take a closer look @ what needs to be specified for that rule as required....

feat: create .snakemake-workflow-catalog.yml

Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility

Describe the solution you'd like
create the file, checkout instructions https://snakemake.github.io/snakemake-workflow-catalog/?rules=true

usage:
  mandatory-flags: # optional definition of additional flags
    desc: # describe your flags here in a few sentences (they will be inserted below the example commands)
    flags: # put your flags here
  software-stack-deployment: # definition of software deployment method (at least one of conda, singularity, or singularity+conda)
    conda: true # whether pipeline works with --use-conda
    singularity: true # whether pipeline works with --use-singularity
    singularity+conda: true # whether pipeline works with --use-singularity --use-conda
  report: true # add this to confirm that the workflow allows to use 'snakemake --report report.zip' to generate a report containing all results and explanations
                

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

bug: htsinfer: ValueError on max function

Describe the bug
running the htsinfer workflow fails on some SRA samples.
Seems to be a problem when no read_lengths is there, so the max function has nothing to operate on.

Tried with this sample table and SRA download workflow (which downloads the fq.gz successfully).

sample	fq1	fq2
SRR18223821		
SRR18125741		
SRR19162082		
SRR19770954		
SRR701798		

Screenshots
Parts of the error within htsinfer_to_tsv.stderr.log

    tparams["index_size"] = max([int(i) for i in read_lengths if i is not None])

ValueError: max() arg is an empty sequence

refactor: remove {seqmode} wildcard if applicable

Is your feature request related to a problem? Please describe.
As sample names have to be unique, there could not be one sample with two different sequencing modes. That means the seqmode doesn't have to appear in filenames, and consequently the wildcard {seqmode} is redundant. However, whether this is true has to be checked first.

Describe the solution you'd like
Check whether it might be possible to remove seqmode from filenames. If yes, implement.

Describe alternatives you've considered
Maybe this theory is wrong and we need seqmode as a wildcard.

Additional context
At least in some places it should be possible to remove the wildcard

rule star_rpm with FileNotFoundError

This is more of a note, if others experience similar behaviour, so that they can rule out that this is the problem.

Describe the bug
Workflow fails on rule star_rpm with FileNotFoundError.

To Reproduce
Steps to reproduce the behavior:

  1. create a zarp run, but keep config and results directories separate.
  2. Start zarp with Singularity from base repo directory.
  3. Workflow will likely fail on star_rpm
  4. Remove shadow: "full" from the rule, restart.
  5. Workflow will run through.

Expected behavior
The shadow keyword takes relative paths to the working directory. Maybe a more meaningful error message would help, or a check for the file locations could be done prior to the start.

Additional context
The shadow keyword reduces the disk space footprint on the workflow and makes it possible to delete whole temp directories.
But it requires that all the files are sub directories of the working directory.
Maybe removing this altogether is less hassle?

chore: snakemake version update

Is your feature request related to a problem? Please describe.
There has been a new major Snakemake release since we previously pushed this project: 7.0.0+.
Should we update @mkatsanto?

Describe the solution you'd like
If yes - it would require changing two lines in the environment recipes, should be easy...

bugfix: update & pull new version of the TIN score repo

TIN score script takes too much memory, scicore team needed to kill the cluster jobs...
It seems that there is still some problem with the parallelization.
REPO: https://github.com/zavolanlab/tin-score-calculation

Once this is fixed we should bump up the version and create a new release in the TIN score repo + push to pypi (automatic in the CI) + upload to biocontainers. Then we should refer to that new version in our zarp main Snakefile here.

CC: @mkatsanto @dominikburri

Normalize results by fraction of reads in some MultiQC plots

Is your feature request related to a problem? Please describe.
Some MultiQC plots (e.g., from FastQC, tin-score-calculation, STAR), report absolute numbers of reads rather than fractions of reads. While it is useful to know the sequencing depths of the different samples/libraries, it is perhaps not very informative/helpful to represent these differences in multiple plots, as this makes it hard to compare key metrics across samples, e.g., the fraction of (uniquely) mapped reads across samples. This is especially problematic when these plots are exported and put in publications as static images.

Describe the solution you'd like
Have one plot that reports sequencing depths/library sizes for each sample as bar plots. Have all other plots report fractions of reads instead.

Additional context
It may not be possible to configure MultiQC in a way that read fractions are visualized instead of read numbers. In that case, it is probably too much of a hassle to implement the desired solution.

Choose different SRA samples for SRA download pipeline test

Is your feature request related to a problem? Please describe.
The two samples specified in the current test file tests/input_files/sra_samples.tsv are not suitable for testing zarp, as the first one is nanopore seq (zarp only supports illumina) and the second is metagenomic, which makes it impossible to test the organism inference of the htsinfer pipeline, which is supposed to be run after the sra pipeline.

Describe the solution you'd like
Put different SRA numbers in the sra_samples.tsv file, e.g. some from mouse.

Describe alternatives you've considered
SRAs from E.coli, or C.elegans, or D.melanogaster, or S.cerevisiae

Additional context
Add any other context or screenshots about the feature request here.

feat: replace snakemake pkg manager: conda -> mamba

Since we use snakemake version > 6.0.0 the preferable frontend package manager for building internal, per-rule envs is now mamba, not conda.

I have tried to tacke it quickly but, strangely enough, when I specify to use mamba then our output files of cutadapt do not match the expected checksums... (see: #16 (comment)). Therefore for now our profiles specify --conda-frontend conda.

It might be nice to replace conda with mamba, eventually...

bug: local test with conda fails?

Describe the bug
Running bash tests/test_integration_workflow_with_conda/test.local.sh exits with

WorkflowError:
Python package jinja2 must be installed to create reports.
  File "/home/christina/miniconda3/envs/zarp/lib/python3.7/site-packages/snakemake/report/__init__.py", line 594, in auto_report

This happens at the step

+ snakemake --snakefile=../../workflow/Snakefile --configfile=../input_files/config.yaml --report=snakemake_report.html

although the environment.yml for zarp conda env specifies jinja2=2.11.2

To Reproduce
Steps to reproduce the behavior:

  1. Follow installation instructions in zarp README to create the conda env (I did a fresh creation of the conda env root & update with dev)
  2. conda activate zarp
  3. run bash tests/test_integration_workflow_with_conda/test.local.sh

Expected behavior
exit status 0.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

tempdir specification required for STAR-related rules

In a very specific test case where @dominikburri cloned the repo on his local machine and tried to execute the pipeline with singularity contaniers two STAR related rules failed. The input data were on our login node, mounted to his OS. STAR could not find a tempdir and required an additional flag.

I think this could be fixed in the profiles with the --default-resources keyword for snakemake : we would need to specify tmpdir= just there.

CC @mkatsanto

Run C.elegans samples

Test the workflow on another organism, further using the SRA download. E.g. could try C.elegans, samples from project SRP330447

bug: results subdir architecture bug?

While @meric466 run the test data me and @mkatsanto discovered that there probably is a bug in the results directory architecture.
Under results/samples we have per sample results directory. However, besides each sample s we have another directory s.se which contains only sort_genomic_alignments_samtools subdir. We are probably not evaluating wildcards properly at some rule...

chore: create an example run

Created on GitLab by @mkatsanto, initially assigned to @meric466 :

Create some real sample run along with the reports and visualisations.
Input data, configs, intermediate data & output should be made available on Zenodo and linked to the documentation

Slurm QOS parameter

Is your feature request related to a problem? Please describe.
Not every slurm cluster support QOS but the default configuration specifies one:

"qos": "6hours",

This will cause errors on clusters without QOS.

Describe the solution you'd like
Ideally, each slurm cluster can handle the workflow without modifications.
The QOS assignment above could be removed from the default slurm configuration, and an additional one could be created for clusters with QOS.

Describe alternatives you've considered
Alternatively, note the current behaviour with QOS in the readme or in some other format (e.g. FAQ).

Snakemake update

Describe the bug
Update in snakemake is crucial as it will deal with slurm execution.
Currently there are too many sacct calls in slurm exectution, which might cause issues to the execution system.
Snakemake-Profiles/slurm#81

Combine conda installations into one

Is your feature request related to a problem? Please describe.
When installing zarp, one usually wants to test the workflow with the provided tests.
But this requires the additional installation of the dev environment: https://github.com/zavolanlab/zarp#5-non-essential-dependencies-installation
In particular, bedtools is missing from the "standard" environment.

Describe the solution you'd like
One conda installation should be sufficient to run the provided tests.

Describe alternatives you've considered
Keep the dev environment but explain in more detail in the README the behaviour.

Du not require logo & URL for MultiQC report

The user should have the ability to choose not to supply a custom logo and URL for the MultiQC report when running ZARP. The outcome should be that no logo and URL are provided in the report.

To achieve this, the user should be able to specify empty strings for these parameters. To further improve user experience, the report_logo and report_url parameters could/should be made entirely optional in config.yaml (then taking on empty strings as defaults).

Another question in this context is whether, in the current implementation, a user is allowed to specify an HTTP(S) URL pointing to a logo, rather than the logo itself. It would just be nice to test this out while implementing the above, so that we can document that behavior more accurately.

feat: sra download

Download run files from SRA.

Describe the solution you'd like
This download should NOT happen within core ZARP. Instead, a separate (small) snakemake pipeline will for now be created within the workflow/rules/ directory, which can be called by ZARP-cli. At a later stage that pipeline might be migrated elsewhere.

Inputs:

  • SRR identifier OR SRA run table

Outputs:

  • directory containing all downloaded SRR files

See related zarp-cli issue

outdated gitignore

Describe the bug

gitignore is outdated

To Reproduce
Steps to reproduce the behavior:

  1. create a new run within config.
  2. all run files are not ignored in git.

Expected behavior
All results and config files from own samples should be ignored.

possible fix:
The lines with run in .gitignore (lines 329 & 330) should be replaced by config.

chore: migrate future issues from GitLab

We have some ongoing work/discussion in a few issues that we could come back to in the furure.
It would be very nice to have them transferred into here.
Should we transfer them manually or is there an automatic way?

doc: config/README.md

Is your feature request related to a problem? Please describe.
For snakemake workflow catalogue visibility

Describe the solution you'd like
move section "running the workflow on your own samples" to config/README.md

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Download genome resources

The genome sequence and annotation file for the species from which the RNA-seq data comes are needed as inputs to zarp.

We use the reference files from ENSEMBL.

The user supplies the species name (e.g. homo_sapiens) or taxon ID (e.g. 9606) and the chromosome sequences that compose the karyotype (no haplotypes) are downloaded, along with the corresponding annotation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.