chanzuckerberg / idseq-workflows Goto Github PK

Portable WDL workflows for IDseq production pipelines

License: MIT License

Makefile 0.07% WDL 16.34% Shell 0.20% Python 79.28% Dockerfile 1.34% Awk 0.08% Jupyter Notebook 2.69%

idseq-workflows's Introduction

idseq-workflows - portable IDseq production pipeline logic

Please see https://github.com/chanzuckerberg/czid-workflows for CZ ID workflows. This repository is no longer maintained.

Infectious Disease Sequencing Platform

IDseq is a hypothesis-free global software platform that helps scientists identify pathogens in metagenomic sequencing data.

Discover - Identify the pathogen landscape
Detect - Monitor and review potential outbreaks
Decipher - Find potential infecting organisms in large datasets

IDseq is a collaborative open project of Chan Zuckerberg Initiative and Chan Zuckerberg Biohub.

Running these workflows

This repository contains WDL workflows that the IDseq platform uses in production. See Running WDL workflows locally to get started with them.

CI/CD

We use GitHub Actions for CI/CD. Lint and unit tests run on GitHub from jobs in .github/workflows/wdl-ci.yml (triggered on every commit).

idseq-workflows's People

Stargazers

Watchers

Forkers

flying-polarbear jonason91 isabella232 lynnlangit truwl genostack free-soellingeraj yyw-informatics vikash84 grunwaldlab khileshchauhan leether

idseq-workflows's Issues

Pipeline continue downloading essential files at every run

Hey @rzlim08, I successfully installed latest idseq-workflow . Now I run the following command but primer, genome and other files start downloading every time and taking hours and hours to run.

time miniwdl run --verbose idseq-workflows/consensus-genome/run.wdl docker_image_id=idseq-consensus-genome fastqs_0= SARSCoV2_firstBatch/S11_L001_R1_001.fastq.gz fastqs_1= SARSCoV2_firstBatch/S11_L001_R2_001.fastq.gz sample= S11 technology=Illumina ref_fasta=s3://idseq-public-references/consensus-genome/MN908947.3.fa -i idseq-workflows/consensus-genome/test/local_test.yml --debug

In fact, I have all files already downloaded in /tmp/miniwdl_download_cache/files/s3/idseq-public-references/_consensus-genome but pipeline start downloading these all again and abort with the error (sometime kraken_coronavirus_db_only.tar.gz file not found and sometime hg38.fa.gz file not found). For this I manually pasted essential files. But nothing worked.

PS: I always run export MINIWDL__DOWNLOAD_CACHE__DIR=/tmp/miniwdl_download_cache prior to run main command (mentioned above).

Kindly help

The previous version of idseq-workflow was working fine on my workstation but I am facing difficulties in its latest update.

Running test locally fail

Hi everybody,

I am working on the packaging of idseq-dag on Debian. But is failling the tests.

Beyond that, I am trying to run the tests locally, but I've this error:

sudo python3 -m unittest tests/test_samples_on_local_steps.py 
E{"time": "2020-04-21T22:09:03.245", "data": {"event": "ctx_exec", "context_name": "command.make_dirs", "uid": "3f3533f1285b", "values": {"path": "/mnt/idseq/results/star_out/257549"}, "duration_ms": 13}, "thread": "MainThread", "pid": 257549, "level": "INFO"}
{"time": "2020-04-21T22:09:03.246", "data": {"event": "ctx_exec", "context_name": "command.make_dirs", "uid": "1441993ea123", "values": {"path": "/mnt/idseq/ref"}, "duration_ms": 0}, "thread": "MainThread", "pid": 257549, "level": "INFO"}
E
======================================================================
ERROR: test_all_local_steps (tests.test_samples_on_local_steps.TestSamplesOnLocalSteps)
----------------------------------------------------------------------
TypeError: test_all_local_steps() missing 3 required positional arguments: 'dag_file', 'test_bundle', and 'output_dir_s3'

======================================================================
ERROR: test_many_samples (tests.test_samples_on_local_steps.TestSamplesOnLocalSteps)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/tests/test_samples_on_local_steps.py", line 37, in test_many_samples
    self.test_all_local_steps(dag_file, bundle, output_dir_s3)
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/tests/test_samples_on_local_steps.py", line 55, in test_all_local_steps
    step_class, step_name, dag_file, test_bundle, output_dir_s3)
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/tests/test_utils.py", line 84, in run_step_and_match_outputs
    test_bundle, output_dir_s3)
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/tests/idseq_step_setup.py", line 95, in get_test_step_object
    result_dir_local)
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/idseq_dag/engine/pipeline_flow.py", line 173, in fetch_input_files_from_s3
    output_file = idseq_dag.util.s3.fetch_from_s3(s3_file, local_dir, allow_s3mi=True)
  File "/home/eamanu/Debian/idseq-dag/github/idseq-dag/idseq_dag/util/s3.py", line 299, in fetch_from_s3
    if is_reference or os.path.abspath(dst).startswith(config["REF_DIR"]):
TypeError: startswith first arg must be str or a tuple of str, not NoneType

----------------------------------------------------------------------
Ran 2 tests in 0.017s

FAILED (errors=2)

Looking on the code seems like PipelineFlow set the config['REF_DIR']

idseq_dag.util.s3.config["REF_DIR"] = self.ref_dir_local

but that configuration must be saved inside on the PipelineFlow to have persistence
of that dict (or some different way), for that reason whe the test run fetch_from_s3
config is set by default.

Unable to Fetch Some Archives

Hi @mlin yesterday I cloned updated idseq-workflows but got an error while running the following command:
docker build -t idseq-consensus-genome idseq-workflows/consensus-genome

The error is at 12/17 step:

RUN apt-get install -y python3-cffi python3-h5py python3-intervaltree python3-edlib muscle git
 ---> Running in 819cc4b4d14a

Get:41 http://us-west-2.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 python3-h5py amd64 2.10.0-2build2 [873 kB]
Get:42 http://us-west-2.ec2.archive.ubuntu.com/ubuntu focal/main amd64 python3-sortedcontainers all 2.1.0-2 [27.3 kB]
Get:43 http://us-west-2.ec2.archive.ubuntu.com/ubuntu focal/universe amd64 python3-intervaltree all 3.0.2-1.1 [22.4 kB]
Fetched 17.3 MB in 46s (377 kB/s)
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/g/git/git-man_2.25.1-1ubuntu3.1_all.deb  404  Not Found [IP: 34.210.25.51 80]
E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/g/git/git_2.25.1-1ubuntu3.1_amd64.deb  404  Not Found [IP: 34.210.25.51 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
The command '/bin/sh -c apt-get install -y python3-cffi python3-h5py python3-intervaltree python3-edlib muscle git' returned a non-zero code: 100

My workstation is fully updated and upgraded, I retried after installing the packages manually but still got the same error. I have attached the detailed error file.

My Workstation's Specs are:

OS Name: Ubuntu 21.04
OS Type: 64-bit

Thank you in Advance.

Ideseq_error_Samiah.txt

possible issue with INDEL calling by samtools

We recently discovered a bug in our code that by default we were leaving out most of the INDELs in vcf.

I see idseq is not enforcing the samtools version, so I'm not sure whether it would be the case for the version that is used here. FYR

Size of VMs needed for full metagenomics test

To run the workflow on the full metagenomics databases used by IDseq, we recommend starting with an Amazon EC2 r5d.24xlarge - you may want to say that this EC2 instance size includes

96 vCPUs
~ 8 GB RAM
4 * 900 GiB NVMe SSD instance storage (or ~ 3.6 TB total).
AWS on-demand for this instance type is currently ~ $ 7 USD/hr.

Also which GCP instance type (and size) would be recommended for this test? - https://cloud.google.com/compute/docs/machine-types

Error while Running Test Example - Consensus Genome

I successfully installed miniwdl and all other dependencies according to the steps mentioned on GitHub page . Now when I tried to run consensus genome test example using the command:

miniwdl run --verbose consensus-genome/run.wdl docker_image_id=idseq-consensus-genome fastqs_0=idseq-workflows/consensus-genome/test/sample_sars-cov-2_paired_r1.fastq.gz fastqs_1=idseq-workflows/consensus-genome/test/sample_sars-cov-2_paired_r2.fastq.gz sample=sample_sars-cov-2_paired technology=Illumina -i idseq-workflows/consensus-genome/test/local_test.yml

I am getting the following error:

miniwdl-run docker task rejected, desired state shutdown: invalid bind mount source, must be an absolute path: /tmp/miniwdl_download_cache/files/s3/idseq-public-references/_consensus-genome/human_chr1.fa :: error: "RuntimeError", dir: "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_9Jul2021/idseq-workflows/20210713_122226_consensus_genome", from_dir: "/home/samiahkanwar/Desktop/AKU_System/IDSeqPipeline_9Jul2021/idseq-workflows/20210713_122226_consensus_genome/call-RemoveHost"

Kindly help me in this regard. I will be available for providing further information

Question - workflow language selection evaluation?

I noticed that you're switching from a homebrew DAG processor to WDL and was wondering if you did an evaluation of the various workflow languages/processors as part of that process. If so, is that evaluation available anywhere?

I've got a project that will be tackling a similar evaluation soon.

add test cases for maximum e-value filter on alignment results

Assertion: The maximum e-value for alignments in IDseq is 1.

Implementation Details:
The maximum e-value threshold filter is applied in two different locations within the code base:

For short read alignments, the filter is applied inside the iterate_m8() function in the .m8 utils.
For contig alignments, the filter is applied using filters in PipelineStepBlastContigs.

We expect that there may be alignments with e-values > 1 in the initial alignment files (gsnap.m8, rapsearch2.m8, gsnap.blast.m8, rapsearch2.blast.m8).
The filter is then applied to the raw .m8 results when parsing for the top hits. There should never be e-values > 1 in the following files:

gsnap.deduped.m8
rapsearch2.deduped.m8
gsnap.blast.top.m8
rapsearch2.blast.top.m8

This was implemented as part of chanzuckerberg/czid-dag#309

Test Sample:
This was tested on staging using benchmark sample UnAmbiguouslyMapped_ds.gut. In particular: staging sample ID 19379 was run prior to the fix, staging sample ID 19361 was run after the fix.

For exampe, in sample 19361,
gsnap.m8 has 32 rows with e-value > 1, but gsnap.deduped.m8 has zero.
rapsearch2.m8 has 45 rows with e-value > 1, but rapsearch2.deduped.m8 has zero.
rapsearch2.blast.m8 has 5172 rows with e-value > 1, but rapsearch2.blast.top.m8 has zero.