Coder Social home page Coder Social logo

sunbeam-labs / sunbeam Goto Github PK

View Code? Open in Web Editor NEW
163.0 9.0 39.0 21.33 MB

A robust, extensible metagenomics pipeline

Home Page: http://sunbeam.readthedocs.io

Python 77.95% Shell 19.16% Dockerfile 2.89%
metagenomics snakemake reproducible-research

sunbeam's Introduction

Sunbeam: a robust, extensible metagenomic sequencing pipeline

Tests Documentation Status Release DockerHub DOI:10.1186/s40168-019-0658-x

Sunbeam is a pipeline written in snakemake that simplifies and automates many of the steps in metagenomic sequencing analysis. It uses conda to manage dependencies, so it doesn't have pre-existing dependencies or admin privileges, and can be deployed on most Linux workstations and clusters. Sunbeam was designed to be modular and extensible, allowing anyone to build off the core functionality. To read more, check out our paper in Microbiome.

Sunbeam currently automates the following tasks:

More extensions can be found at https://github.com/sunbeam-labs.

To get started, see our documentation!

If you use the Sunbeam pipeline in your research, please cite:

EL Clarke, LJ Taylor, C Zhao, A Connell, J Lee, FD Bushman, K Bittinger. Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments. Microbiome 7:46 (2019)

See how people are using Sunbeam:


Contributors

sunbeam's People

Contributors

ctanes avatar eclarke avatar khillion avatar kylebittinger avatar leipzig avatar louiejtaylor avatar naomiwilson avatar ressy avatar scottdaniel avatar ulthran avatar zhaoc1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sunbeam's Issues

Make Cutadapt optional

In the QC step of the pipeline, there is removal of custom sequences (defined in the config file as Cutadapt's fwd and rev adaptors) that are introduced in the Bushman lab cDNA synthesis workup of RNA samples. These sequences would be expected to be found at both 5' and 3' ends of cDNA. In some cases, these sequences can form long concatemers (again a product of the Bushman lab cDNA synthesis step) and are therefore of little value to be analyzed in downstream steps.

However, these sequences would not be considered wetside artifacts in sequencing of DNA samples as they are not deliberately introduced at any point in the library preparation.

In a typical DNA sample, it appears that these sequences are identified in 10% of reads (probably just by chance). Currently, these reads are discarded. However, in the case of DNA samples, there is no reason to discard these reads and therefore losing 10% of "real" data.

Some suggestions to address separate treatment of DNA versus cDNA samples:

  1. Make cutadapt an optional additional parameter to call on reads AFTER they have been quality trimmed and paired.
  2. Only trim the sequences, don't discard the reads.

This would also be important in cDNA submitted by other users who don't necessarily use the same protocol as we do.

Document IGV utilities

There's been an extensive amount of work from @ressy getting read mapping and associated visualizations working. Those should be documented in the Readme so people know how to use them.

Error in custom_removal when data_fp is a file

TypeError: string indices must be integers

This arises because _build_samples_from_file in sunbeamlib produces a different struct than _build_samples_from_dir.

Temporarily commenting the test for this out of the test.sh script to resolve other issues.

Option to test in non-temp directory

While writing tests and creating test data, we would like to inspect some of the intermediate files.

Requesting a new feature -- if a directory is passed to the test.sh script, then the test output should be written there. Otherwise, the output should be written to a temp file and removed, as it is now.

better defaults

We should have better default values on the config file to avoid the series of errors relating to paths.

Decontam NameError: 'The name 'human_index_fp' is unknown in this context.'

NameError while running decontamination. Trace per Erik's request:

194 of 2691 steps (7%) done
rule decontam_human:
    input: sunbeam_output/qc/paired/SSND_R1.fastq, sunbeam_output/qc/paired/SSND_R2.fastq
    output: sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq
    log: sunbeam_output/qc/log/decontam-human/SSND_summary.json
    wildcards: sample=SSND

Error in job decontam_human while creating output files sunbeam_output/qc/decontam-human/SSND_R1.fastq, sunbeam_output/qc/decontam-human/SSND_R2.fastq.
RuleException:
NameError in line 29 of sunbeam/rules/qc/decontaminate.rules:
The name 'human_index_fp' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
  File "sunbeam/rules/qc/decontaminate.rules", line 29, in __rule_decontam_human
  File "~/miniconda3/envs/sunbeam/lib/python3.5/concurrent/futures/thread.py", line 55, in run

version `GLIBC_2.18' not found (required by kz)

When I run the testing script from the stable branch as suggested in the tutorial, I got an error when running rules remove_low_complexity. I attached the error log file test_all.err.txt, related to #123

Following up on 20180328:
I figured out this is a issue of kcomplexity conda package. So I conda remove kcomplexity and install the kcomplesity from the git repo, as a temporary walk around.

And here is the related issue for rust.

IGV now on bioconda

IGV is now on bioconda so we should use that rather than our custom install script

using custom config files

It looks like specifying a config file in the Snakefile on line 25 prevents the --configfile command from working correctly: it ignores this flag and continues parsing example_config.yml. Can we revert back to the previous behavior (specifying configfile is required, raising an error if not specified) until we figure out a workaround?

Functional tests are breaking

The testing system I had cobbled together for this pipeline isn't working currently. We should figure out what's wrong with it and fix it so that pull requests can be checked and merged automatically.

mask_low_complexity failed

Error in rule mask_low_complexity:
jobid: 0
output: /home/guanxian/sunbeam_output/qc/masked/D5_R2.fastq.gz

RuleException:
KeyError in line 92 of /gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules:
'mask_low_complexity'
File "/gpfs/fs02/home/guanxian/sunbeam/rules/qc/qc.rules", line 92, in __rule_mask_low_complexity
File "/home/guanxian/miniconda3/envs/sunbeam/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135544.313766.snakemake.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/fs02/home/guanxian/sunbeam/.snakemake/log/2018-03-01T135417.663390.snakemake.log

cap3 and error with empty contig files

If IDBA-ud does not generate a contig, then the followup program cap3 (which needs the contigs from IDBA-ud to further process them) gives an error.
Perhaps always creating a file from IDBA-ud can help solve this problem.

Roadmap for future development

Soliciting comments from @kylebittinger and @zhaoc1

We have bare-bones functional testing up and working right now. I would like to get things a little bit more formal for future development, since we now have active users and we need to worry about things breaking during our updates.

Highest priority:

  • No absolute paths or references to our particular development environment (usernames, paths, etc) should be committed to the repository.
  • All bash scripts must have set -e enabled and they should not be committed if it is missing or commented out.
  • Python code must be indented with spaces, not tabs. I don't really care either way, but it has to be consistent across our codebase, so I'm just picking the side most used currently.
  • Nothing will be merged to master unless it passes functional testing on Travis.
  • All non-trivial code should be refactored into a function that goes into the sunbeam/sunbeam package. We should test this using a service like Landscape to ensure code correctness and well-formattedness.

Architectural changes:

  • Versioning and release system. I want to follow semantic versioning practices for this project.
  • We need a defined final number of artifacts (annotations, contigs, qc, and read classifications) with a defined folder hierarchy. This will be what's tested and what defines our "API", so to speak. Changes to this will require a bump in the version number.

Mapping rules keep unaligned reads

As I have it written right now, bowtie2 keeps all reads in its output files, even unaligned ones, so the total size can be much bigger than it needs to be. It should default to leaving those out but use a Snakemake parameter and Sunbeam configuration option to control it explicitly.

Mapping config section has numerous missing keys

The following keys need to be added in the default config file in sunbeamlib so that sunbeam_init works, and we should also see what caused the tests not to pick this up:

  • keep_unaligned
  • threads
  • igv_fp
  • igv_prefs

IGV truncates display of input names for long filenames

The default IGV preferences don't give enough screen space on the left panel to display long input filenames, so multiple similar names can't be distinguished. Setting the panel width explicitly as a preference would fix this.

IGV images don't always show full genome

I'm seeing some cases where the auto-generated IGV images don't quite show the full genome, as though it's been zoomed in slightly in the toolbar in the IGV GUI. It looks like I can fix this by explicitly saying goto <sequence_id>:1-<sequence_length> every time in the commands right after loading the genome fasta file, rather than just an optional goto <sequence_id> for multiple segments/chromosomes only.

Mapping output directory holds too many files

When the number of references and samples is high, there are far too many files written directly to the mapping output directory. We should split that up into sections (like the qc directory has, for example).

Workflow Error for Certain Rules

Currently encountering the same error in calling a few rules related to contig annotation.
Specifically:

"Workflow Error:Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards."

This error appears when individually calling at least the following rules. This occurs even when valid paths are given to nucleotide and protein databases in the config file.
find_genes_mga
run_blastn
run_blastp
run_blastx

However, when calling the rule all_annotate, there is no error even though that rule includes run_blastn.

IGV image generation fails when running rule in parallel

The igv_snapshot rule calls a function that assumes a constant X server number for running IGV, so if that rule is used in parallel with --threads, it fails. I should be able to fix that by letting it use the first available X server number each time.

Workaround for error during "waiting for missing files"

Sometimes Snakemake raises a MissingFilesError, and then encounters another error, when handling missing files. This prevents the job from returning and prevents the workflow from continuing. It seems to be a bug because re-running it often completes without error. We need to know why this occurs, but the fix may actually need to happen on Snakemake's end.

Rules fail under latest snakemake version (3.13.2)

When run under the latest snakemake available from the bioconda Anaconda channel, multiple rules fail because some of the snakemake objects have changed. Version 3.13.2 fails, but 3.13.0 still works.

Install still tries to create conda environment if it already exists

In my environment, currently with conda 4.3.33, install.sh doesn't detect an already-existing environment.

Line 25 greps for the environment name:

conda env list | grep -Fxq $SUNBEAM_ENV_NAME

But conda env list gives paths as well as names, so the grep doesn't match any lines. For example:

$ conda env list
# conda environments:
#
ExampleProject      /home/jesse/miniconda3/envs/JesseProject
circonspect         /home/jesse/miniconda3/envs/circonspect
gcc5                /home/jesse/miniconda3/envs/gcc5

I don't remember this happening until recently. Did conda's output change, maybe?

Cluster tips and tricks

We should have a section in the documentation pertaining to running things on clusters, including tips and tricks like -w90.

Installation error

Error near the end of the install.sh progression, although it appears not to have prevented install (I'll double check to make sure everything works)

install.sh: line 50: Solving: command not found
2018-03-16 19:46:26 UTC [    error] Error in ~~sunbeam/install.sh in function debug_capture on line 50

Snakemake version compatibility ('Namedlist' object has no attribute 'readline')

Compatibility issue with snakemake version 3.13.2, workaroung by forcing 3.13.0 (conda install snakemake=3.13.0)

Top-level traceback:
Error in job parse_genes_mga while creating [output files]
RuleException:
AttributeError in line 39 of ~/sunbeam/rules/annotation/orf.rules:
'Namedlist' object has no attribute 'readline'

Mapping rules assume paired-end fastq files

The way I have it written, the bowtie2_align rule assumes FASTQ input with paired-end reads in separate files. Ideally it would just do the right thing based on what's in the Samples dict.

Conda metadata issue on PMACS cluster only

Conda debug or update gets stalled on the PMACS cluster.
The prompt gets hung up on
Fetching package metadata ...

Note, this is not an issue on microb120 and 191. I am reading more about this, and it seems this issue can be due to some proxy settings on the cluster?

Do you have a quick solution for this?

Consistency in requiring gzipped fastq files

All rules should take gzipped fastq files and output gzipped fastq files (unless they're intermediate steps, in which case the uncompressed fastq outputs should be marked with the temp() snakemake rule)

Intermediate files in assembly (IDBA-UD)

Can the end user request for saving the intermediate files in the assembly step (IDBA_UD). These intermediate files can aid in finding reads that are mapped to contigs (or not) and can be valuable for downstream processing.

Support mapping of reads to assembled contigs

Right now the mapping rules only align to a set of existing fasta files provided as input to Sunbeam. It would be useful to also allow the mapping of reads to the contigs created by the assembly section. We should add this as a new feature in the mapping section.

Documentation for Updating Sunbeam

The Readme file shows conda env -d sunbeam to remove the existing Sunbeam environment, but it looks like the correct syntax should be conda env remove sunbeam.

mga stopped

mga stopped when processing an empty "final-contigs.fa" file. This is the updated sunbeam version.
screen shot 2017-09-22 at 1 23 55 pm

error for filter_reads when qsub jobs

When I qsub all_decontam to respublica, I got the following error messages from filter_reads rule:

"Error occurred during initialization of VM
Cannot create VM thread. Out of system resources."

After googling the error message, this seems to be a java version issue.

My experience with respublica is that it doesn't allowing passing the LD_LIBRARY_PATH for security issue, and this will cause error when the local java version is different from the conda environment java version. One way to walk around this, thanks to @ressy, is to explicitly setting LD_LIBRARY_PATH="$CONDA_PREFIX/lib64". However, I am not sure whether this is the reason for our error message.

Since we are filtering reads based by ids, shall we just add a python script to sunbeamlib to do the work?

mapping rules are single-threaded

The mapping rules that call bowtie2 and samtools aren't using those programs' multithreading support, so they could run much faster than they do right now.

Detecting and working around core dumps

When a tool called from the shell undergoes a core dump (due to running out of memory?) Snakemake doesn't detect it as an error. This may be because the tool does not return for some reason. It could also be due to an issue in the job submission process on the cluster

Possible workarounds:

  • Detect when a tool dumps core in bash
  • Detect when a tool dumps core in a node and have qsub/bsub react appropriately

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.