sunbeam-labs / sunbeam Goto Github PK

A robust, extensible metagenomics pipeline

Home Page: http://sunbeam.readthedocs.io

Python 77.95% Shell 19.16% Dockerfile 2.89%

metagenomics snakemake reproducible-research

sunbeam's Introduction

Sunbeam: a robust, extensible metagenomic sequencing pipeline

Sunbeam is a pipeline written in snakemake that simplifies and automates many of the steps in metagenomic sequencing analysis. It uses conda to manage dependencies, so it doesn't have pre-existing dependencies or admin privileges, and can be deployed on most Linux workstations and clusters. Sunbeam was designed to be modular and extensible, allowing anyone to build off the core functionality. To read more, check out our paper in Microbiome.

Sunbeam currently automates the following tasks:

Quality control, including adapter trimming, host read removal, and quality filtering;
Taxonomic assignment of reads to databases using Kraken (sbx_kraken);
Assembly of reads into contigs using Megahit (sbx_assembly);
Contig annotation using BLAST[n/p/x] and Diamond (sbx_assembly);
Mapping to reference genomes (sbx_mapping)
ORF prediction using Prodigal (sbx_assembly).

More extensions can be found at https://github.com/sunbeam-labs.

To get started, see our documentation!

If you use the Sunbeam pipeline in your research, please cite:

EL Clarke, LJ Taylor, C Zhao, A Connell, J Lee, FD Bushman, K Bittinger. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome 7:46 (2019)

See how people are using Sunbeam:

Shi, Z et al. Segmented Filamentous Bacteria Prevent and Cure Rotavirus Infection. Cell 179, 644-658.e13 (2019).
Abbas, AA et al. Redondoviridae, a Family of Small, Circular DNA Viruses of the Human Oro-Respiratory Tract Associated with Periodontitis and Critical Illness. Cell Host Microbe 25, 719–729 (2019).
Leiby, JS et al. Lack of detection of a human placenta microbiome in samples from preterm and term deliveries. Microbiome 6, 1–11 (2018).

Contributors

Erik Clarke (@eclarke)
Chunyu Zhao (@zhaoc1)
Jesse Connell (@ressy)
Louis Taylor (@louiejtaylor)
Charlie Bushman (@ulthran)
Kyle Bittinger (@kylebittinger)

sunbeam's People

Contributors

Stargazers

Watchers

Forkers

zhaoc1 arwaabbas demis001 louiejtaylor pythseq samesense changrong1023 solna drmoca scottdaniel khillion ximenshaoshao bioyliu junglee0713 hughcross kilaza marubel alarawms monicawei12 seedpcseed sneuensc leipzig zhji0426 balakulandai bioinfomatic kdbrumfield biotovarx atchon yuanlizhanshi novapyth vincenzopennone yyxql levlitichev jaylanliu jahernayeem xiangrong131 pastvir silky naomiwilson

sunbeam's Issues

Make Cutadapt optional

In the QC step of the pipeline, there is removal of custom sequences (defined in the config file as Cutadapt's fwd and rev adaptors) that are introduced in the Bushman lab cDNA synthesis workup of RNA samples. These sequences would be expected to be found at both 5' and 3' ends of cDNA. In some cases, these sequences can form long concatemers (again a product of the Bushman lab cDNA synthesis step) and are therefore of little value to be analyzed in downstream steps.

However, these sequences would not be considered wetside artifacts in sequencing of DNA samples as they are not deliberately introduced at any point in the library preparation.

In a typical DNA sample, it appears that these sequences are identified in 10% of reads (probably just by chance). Currently, these reads are discarded. However, in the case of DNA samples, there is no reason to discard these reads and therefore losing 10% of "real" data.

Some suggestions to address separate treatment of DNA versus cDNA samples:

Make cutadapt an optional additional parameter to call on reads AFTER they have been quality trimmed and paired.
Only trim the sequences, don't discard the reads.

This would also be important in cDNA submitted by other users who don't necessarily use the same protocol as we do.

Document IGV utilities

There's been an extensive amount of work from @ressy getting read mapping and associated visualizations working. Those should be documented in the Readme so people know how to use them.

Error in custom_removal when data_fp is a file

TypeError: string indices must be integers

This arises because _build_samples_from_file in sunbeamlib produces a different struct than _build_samples_from_dir.

Temporarily commenting the test for this out of the test.sh script to resolve other issues.

Option to test in non-temp directory

While writing tests and creating test data, we would like to inspect some of the intermediate files.

Requesting a new feature -- if a directory is passed to the test.sh script, then the test output should be written there. Otherwise, the output should be written to a temp file and removed, as it is now.

Remove bushman lab-specific notes from documentation

better defaults

We should have better default values on the config file to avoid the series of errors relating to paths.

keep requirements.txt up to date

Ensure all the package versions are correct (bioconda idba instead of eclarke)

Decontam NameError: 'The name 'human_index_fp' is unknown in this context.'

NameError while running decontamination. Trace per Erik's request:

194 of 2691 steps (7%) done
rule decontam_human:
    input: sunbeam_output/qc/paired/SSND_R1.fastq, sunbeam_output/qc/paired/SSND_R2.fastq
    output: sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq
    log: sunbeam_output/qc/log/decontam-human/SSND_summary.json
    wildcards: sample=SSND

Error in job decontam_human while creating output files sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq.
RuleException:
NameError in line 29 of sunbeam/rules/qc/decontaminate.rules:
The name 'human_index_fp' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
  File "sunbeam/rules/qc/decontaminate.rules", line 29, in __rule_decontam_human
  File "~/miniconda3/envs/sunbeam/lib/python3.5/concurrent/futures/thread.py", line 55, in run

Replace MGA with simpler ORF finder

Version 1.0.0 Roadmap

Putative feature list for release 1.0:

Replace IDBA with Megahit
Replace MGA with Prodigal
IGV implemented as demonstration module
Documentation on http://sunbeam.readthedocs.org
Compatibility with most recent Snakemake version

version `GLIBC_2.18' not found (required by kz)

When I run the testing script from the stable branch as suggested in the tutorial, I got an error when running rules remove_low_complexity. I attached the error log file test_all.err.txt, related to #123

Following up on 20180328:
I figured out this is a issue of kcomplexity conda package. So I conda remove kcomplexity and install the kcomplesity from the git repo, as a temporary walk around.

And here is the related issue for rust.

add --maxseqlength to vserach command in final_contigs

As suggested in the title, the default --maxseqlength of vserach 50,000, and we don't want to filter those long contigs out. Add --maxseqlength 10000000000.

IGV now on bioconda

IGV is now on bioconda so we should use that rather than our custom install script

using custom config files

It looks like specifying a config file in the Snakefile on line 25 prevents the --configfile command from working correctly: it ignores this flag and continues parsing example_config.yml. Can we revert back to the previous behavior (specifying configfile is required, raising an error if not specified) until we figure out a workaround?

Functional tests are breaking

The testing system I had cobbled together for this pipeline isn't working currently. We should figure out what's wrong with it and fix it so that pull requests can be checked and merged automatically.

Mapping input should be QC output

Currently the mapping rules take the raw data files as input. Instead they should use the QC'd data as other steps already use.

mask_low_complexity failed

Error in rule mask_low_complexity:
jobid: 0
output: /home/guanxian/sunbeam_output/qc/masked/D5_R2.fastq.gz

RuleException:
KeyError in line 92 of /gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules:
'mask_low_complexity'
File "/gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules", line 92, in __rule_mask_low_complexity
File "/home/guanxian/miniconda3/envs/sunbeam/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135544.313766.snakemake.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135417.663390.snakemake.log

cap3 and error with empty contig files

If IDBA-ud does not generate a contig, then the followup program cap3 (which needs the contigs from IDBA-ud to further process them) gives an error.
Perhaps always creating a file from IDBA-ud can help solve this problem.

Roadmap for future development

Soliciting comments from @kylebittinger and @zhaoc1

We have bare-bones functional testing up and working right now. I would like to get things a little bit more formal for future development, since we now have active users and we need to worry about things breaking during our updates.

Highest priority:

No absolute paths or references to our particular development environment (usernames, paths, etc) should be committed to the repository.
All bash scripts must have set -e enabled and they should not be committed if it is missing or commented out.
Python code must be indented with spaces, not tabs. I don't really care either way, but it has to be consistent across our codebase, so I'm just picking the side most used currently.
Nothing will be merged to master unless it passes functional testing on Travis.
All non-trivial code should be refactored into a function that goes into the sunbeam/sunbeam package. We should test this using a service like Landscape to ensure code correctness and well-formattedness.

Architectural changes:

Versioning and release system. I want to follow semantic versioning practices for this project.
We need a defined final number of artifacts (annotations, contigs, qc, and read classifications) with a defined folder hierarchy. This will be what's tested and what defines our "API", so to speak. Changes to this will require a bump in the version number.

Mapping rules keep unaligned reads

As I have it written right now, bowtie2 keeps all reads in its output files, even unaligned ones, so the total size can be much bigger than it needs to be. It should default to leaving those out but use a Snakemake parameter and Sunbeam configuration option to control it explicitly.

Mapping config section has numerous missing keys

The following keys need to be added in the default config file in sunbeamlib so that sunbeam_init works, and we should also see what caused the tests not to pick this up:

keep_unaligned
threads
igv_fp
igv_prefs

error in test.sh produced erroneous passing on travis when it was actually broken

There seems to be an update or change to kraken-build where this step fails due to a prompt. For whatever reason, this only happens locally, not on Travis.

https://github.com/eclarke/sunbeam/blob/51fbb55549258dfb373ea8f145ad4e24de76ad9d/tests/test.sh#L55

Erroneous removal of .gz from filename

If cutadapt is skipped, gzipped files lose their .gz during this move:
https://github.com/eclarke/sunbeam/blob/51fbb55549258dfb373ea8f145ad4e24de76ad9d/rules/qc/qc.rules#L40

This causes errors downstream because further rules expect plaintext and receive gzipped.

IGV truncates display of input names for long filenames

The default IGV preferences don't give enough screen space on the left panel to display long input filenames, so multiple similar names can't be distinguished. Setting the panel width explicitly as a preference would fix this.

IGV images don't always show full genome

I'm seeing some cases where the auto-generated IGV images don't quite show the full genome, as though it's been zoomed in slightly in the toolbar in the IGV GUI. It looks like I can fix this by explicitly saying goto <sequence_id>:1-<sequence_length> every time in the commands right after loading the genome fasta file, rather than just an optional goto <sequence_id> for multiple segments/chromosomes only.

Mapping output directory holds too many files

When the number of references and samples is high, there are far too many files written directly to the mapping output directory. We should split that up into sections (like the qc directory has, for example).

remove bbtools dependencies

Switch to rust-bio-tools' fastq-filter

Workflow Error for Certain Rules

Currently encountering the same error in calling a few rules related to contig annotation.
Specifically:

"Workflow Error:Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards."

This error appears when individually calling at least the following rules. This occurs even when valid paths are given to nucleotide and protein databases in the config file.
find_genes_mga
run_blastn
run_blastp
run_blastx

However, when calling the rule all_annotate, there is no error even though that rule includes run_blastn.

IGV image generation fails when running rule in parallel

The igv_snapshot rule calls a function that assumes a constant X server number for running IGV, so if that rule is used in parallel with --threads, it fails. I should be able to fix that by letting it use the first available X server number each time.

Workaround for error during "waiting for missing files"

Sometimes Snakemake raises a MissingFilesError, and then encounters another error, when handling missing files. This prevents the job from returning and prevents the workflow from continuing. It seems to be a bug because re-running it often completes without error. We need to know why this occurs, but the fix may actually need to happen on Snakemake's end.

Rules fail under latest snakemake version (3.13.2)

When run under the latest snakemake available from the bioconda Anaconda channel, multiple rules fail because some of the snakemake objects have changed. Version 3.13.2 fails, but 3.13.0 still works.

Install still tries to create conda environment if it already exists

In my environment, currently with conda 4.3.33, install.sh doesn't detect an already-existing environment.

Line 25 greps for the environment name:

conda env list | grep -Fxq $SUNBEAM_ENV_NAME

But conda env list gives paths as well as names, so the grep doesn't match any lines. For example:

$ conda env list
# conda environments:
#
ExampleProject      /home/jesse/miniconda3/envs/JesseProject
circonspect         /home/jesse/miniconda3/envs/circonspect
gcc5                /home/jesse/miniconda3/envs/gcc5

I don't remember this happening until recently. Did conda's output change, maybe?

Cluster tips and tricks

We should have a section in the documentation pertaining to running things on clusters, including tips and tricks like -w90.

Readthedocs "Edit on Github" link is broken

Leads to the /master/ branch which I don't believe exists anymore

Installation error

Error near the end of the install.sh progression, although it appears not to have prevented install (I'll double check to make sure everything works)

install.sh: line 50: Solving: command not found
2018-03-16 19:46:26 UTC [    error] Error in ~~sunbeam/install.sh in function debug_capture on line 50

Snakemake version compatibility ('Namedlist' object has no attribute 'readline')

Compatibility issue with snakemake version 3.13.2, workaroung by forcing 3.13.0 (conda install snakemake=3.13.0)

Top-level traceback:
Error in job parse_genes_mga while creating [output files]
RuleException:
AttributeError in line 39 of ~/sunbeam/rules/annotation/orf.rules:
'Namedlist' object has no attribute 'readline'

Mapping rules assume paired-end fastq files

The way I have it written, the bowtie2_align rule assumes FASTQ input with paired-end reads in separate files. Ideally it would just do the right thing based on what's in the Samples dict.

read samples from barcode failed for custom_removal rule

the Samples dictionary get from reading barcode files had empty value (directory to the files), thus it actually failed for rule custom_removal. need to fix this.

Conda metadata issue on PMACS cluster only

Conda debug or update gets stalled on the PMACS cluster.
The prompt gets hung up on
Fetching package metadata ...

Note, this is not an issue on microb120 and 191. I am reading more about this, and it seems this issue can be due to some proxy settings on the cluster?

Do you have a quick solution for this?

Consistency in requiring gzipped fastq files

All rules should take gzipped fastq files and output gzipped fastq files (unless they're intermediate steps, in which case the uncompressed fastq outputs should be marked with the temp() snakemake rule)

Default config file shouldn't point to mindb

This is confusing to new users and is only applicable for Travis

Intermediate files in assembly (IDBA-UD)

Can the end user request for saving the intermediate files in the assembly step (IDBA_UD). These intermediate files can aid in finding reads that are mapped to contigs (or not) and can be valuable for downstream processing.

Support mapping of reads to assembled contigs

Right now the mapping rules only align to a set of existing fasta files provided as input to Sunbeam. It would be useful to also allow the mapping of reads to the contigs created by the assembly section. We should add this as a new feature in the mapping section.

Tests fail if IGV is already installed

In this case, the igv.sh file is never written inside the local/ directory, and the test script fails.

Documentation for Updating Sunbeam

The Readme file shows conda env -d sunbeam to remove the existing Sunbeam environment, but it looks like the correct syntax should be conda env remove sunbeam.

Remove local/clark submodule

This isn't something we need anymore (moving to Kraken)

mga stopped

mga stopped when processing an empty "final-contigs.fa" file. This is the updated sunbeam version.

error for filter_reads when qsub jobs

When I qsub all_decontam to respublica, I got the following error messages from filter_reads rule:

"Error occurred during initialization of VM
Cannot create VM thread. Out of system resources."

After googling the error message, this seems to be a java version issue.

My experience with respublica is that it doesn't allowing passing the LD_LIBRARY_PATH for security issue, and this will cause error when the local java version is different from the conda environment java version. One way to walk around this, thanks to @ressy, is to explicitly setting LD_LIBRARY_PATH="$CONDA_PREFIX/lib64". However, I am not sure whether this is the reason for our error message.

Since we are filtering reads based by ids, shall we just add a python script to sunbeamlib to do the work?

mapping rules are single-threaded

The mapping rules that call bowtie2 and samtools aren't using those programs' multithreading support, so they could run much faster than they do right now.

Detecting and working around core dumps

When a tool called from the shell undergoes a core dump (due to running out of memory?) Snakemake doesn't detect it as an error. This may be because the tool does not return for some reason. It could also be due to an issue in the job submission process on the cluster

Possible workarounds:

Detect when a tool dumps core in bash
Detect when a tool dumps core in a node and have qsub/bsub react appropriately

sunbeam-labs / sunbeam Goto Github PK

sunbeam's Introduction

Sunbeam: a robust, extensible metagenomic sequencing pipeline

Contributors

sunbeam's People

Contributors

Stargazers

Watchers

Forkers

sunbeam's Issues

Recommend Projects

Recommend Topics

Recommend Org