erictleung / dada2hpcpipe Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 2.0 95 KB

:ocean: 16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster using SLURM

License: MIT License

Makefile 7.25% R 83.72% Shell 9.03%

microbiome microbiome-analysis 16s dada2 slurm amplicon-sequencing performance-cluster workflow pipeline

dada2hpcpipe's Introduction

dada2HPCPipe

16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster.

This repository contains essentially wrapper functions around DADA2 functions in order to streamline the workflow for cluster computing.

This package is meant to serve two purposes: be an R package and give structure to an analysis. The project aims to follow an R package structure, which can be downloaded and installed as such. Additionally, the users is expected to download this repository and run make and slurm commands to run scripts.

Table of Conents

Installation
Description
Overview of Makefile
Development Setup
Troubleshooting

Installation

install.packages("devtools")
devtools::install_github("erictleung/dada2HPCPipe")

Description

This DADA2 workflow stems from the DADA2 tutorial and big data tutorial. You can find more information about the DADA2 package from its publication or from GitHub.

Overview of Makefile

clean                          Remove data from test_data/, download/, and refs/
condar                         Install R and essential packages
dl-ref-dbs                     Download 16S reference databases (SILVA,RDP,GG)
help                           Help page for Makefile
install                        Install and update dada2HPCPipe package in R
setup                          Setup development environment with Conda
test                           Run DADA2 workflow with Mothur MiSeq test data

Development Setup

Here are instructions on how to get started on ExaCloud and setting up the development environment needed to run the DADA2 workflow.

Package Management

Interactive Session

To run an interactive session, run the following:

srun --pty /usr/bin/bash

This will allow you to test your code and workflow without worrying about stressing out the head coordinating node.

Setup

Follow the instructions listed in this document to setup a modern development environment on the cluster. This isn't necessary if your development environment is on a cluster where you have root access or you're implementing this workflow locally.

Briefly, following the instructions linked above will give you the following:

Miniconda, Python package and virtual environment management
Linuxbrew, non-root package management on Linux systems

For this R workflow, you'll only really need to install Miniconda and R essentials. The Anaconda environment has build an r-essential package with R and most used R packages for data science.

Linuxbrew is useful to supplement commands and other software tools you might want under package management control.

In summary to setup the dependencies for DADA2, run the following.

make setup
make condar

The make setup runs the following

# Download and install Miniconda
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh

# Say yes to adding Miniconda to .bash_profile

# Remove install file
rm Miniconda2-latest-Linux-x86_64.sh

To see changes, you'll first need to exit the cluster and log back in.

The make condar installs R and other essential R packages, which is laid out as the following

# Install R and relevant packages
conda install r-essentials

# For maintenance and update of all packages
conda update r-essentials

# For updating a single particular R package, replace XXXX
conda update r-XXXX

Slurm Workload Manager

Slurm is the resource manager that I'll focus on for this workflow. Slurm stands for "Simple Linux Utility for Resource Management."

An example script might be this.

$ cat first_script.sh
#!/bin/bash
# Template for simple SLURM script
# SBATCH --job-name="Job Name"
# SBATCH --partition=exacloud

srun hostname
srun pwd
srun hostinfo

The quick answer on sbatch vs srun can be found here.

Below are some useful commands to use within Slurm using the script above.

# Submit your script, first_script.sh
sbatch first_script.sh

# Look at jobs in the queue
squeue
squeue -u $USER # Take a look at your specific jobs

You can use this website to help generate Slurm scripts. It is designed for another cluster, but it should at least help with the initial drafting of a submit script you'll want to use.

For more general resources on using Slurm, check out here, here, and here.

Source: http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf

Troubleshooting

Installing this package says it has trouble installing Bioconductor packages

There are two solutions for this. From within R, you can run the following

setRepositories(ind=1:2)

which will tell R to also include Bioconductor packages in its package search. See https://stackoverflow.com/a/34617938/6873133 for more information.

Additionally, you can install Bioconductor manually using the following within R

# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()

and then using biocLite() to install the missing packages. See http://bioconductor.org/install/.

How do I update my packages?

For regular R package (i.e. non-Bioconductor packages), use conda from the terminal.

# XXX is the package name
conda install r-XXX

# For example, installing XML
conda install r-xml

But for Bioconductor packages, use biocLite() from within R.

# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")

# E.g. installing DESeq2
biocLite("DESeq2")

dada2hpcpipe's People

Contributors

Stargazers

Watchers

Forkers

lakarstens

dada2hpcpipe's Issues

Update README for R package and workflow hybrid

Update README with

Instructions on installing package
Description on what this R package and hybrid workflow means and will do
Description on how to use workflow

Consolidate scripts to use `rslurm` R package

Currently, there are bash scripts running R scripts. It may be more convenient to consolidate the scripts to be just R scripts. This may increase overhead as there will be an additional dependency of the rslurm package, but this may be a powerful way to run SLURM jobs e.g. easily read in job statistics from within R.

Vignette: https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm.html

Add submits/ directory with submit scripts

Description

Create list of smaller submit scripts to perform steps of pipeline to allow user feedback and manual checks

Proposal

Need submit scripts to be able to do:

Check software requirements will work so pipeline can proceed
Perform quality checks and output quality plots to view
Filter and trim sequences
Perform main DADA2 denoising with error learning, dereplication, merge paired ends, construct sequence table, remove chimeras
Assign taxonomy using table of taxonomy database

Change saving RData objects to RDS objects

.RData objects load workspaces, however, to make it more precise on what files and objects are saved from each step, .rds should be used.

See https://stackoverflow.com/questions/21370132/r-data-formats-rdata-rda-rds-etc

Make it easy to hand off to phyloseq

May want to put this function within either helper_functions.R or dada2_pipeline.R. I'm leaning towards it being in the helper_functions.R file as this step is extra and not part of the main pipeline for DADA2.

http://benjjneb.github.io/dada2/tutorial.html#bonus-handoff-to-phyloseq

Go through minimal R package checklist

https://simplystatistics.org/2015/10/14/minimal-r-package-check-list/

Create Make rule to verify data format

Create a Makefile rule that will double check the format of the data.

Data Format Specifications

For now, only take paired-end read samples
Check that files are .fastq or .fastq.gz
Check naming scheme of files

User Story

As a user, I want to be able to be able to type the following

make verify

so that I can have a check that at least loading data won't be a problem.

Save file of keeping track of changes from workflow

Examples:

Names of files
Number of files
Number of raw reads input to workflow
Number of reads after trimming and filtering
Sequence table of read lengths and quantity of each
Number of chimeras and percent of reads left

First take: http://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)
##        input filtered denoised merged tabled nonchim
## F3D0    7793     7113     7113   6600   6600    6588
## F3D1    5869     5299     5299   5078   5078    5067
## F3D141  5958     5463     5463   5047   5047    4928
## F3D142  3183     2914     2914   2663   2663    2600
## F3D143  3178     2941     2941   2575   2575    2550
## F3D144  4827     4312     4312   3668   3668    3527

Add feedback to user in various functions

There is no feedback on what scripts are doing. Good to have messages within the functions to let user know what is going on.

The feedback is written as an R script, but they should be within the functions itself.

Replace for loop fastqPairedFilter with filterAndTrim

Why reinvent the wheel when I can use a built-in function.

Found here https://github.com/erictleung/dada2HPCPipe/blob/master/R/input_and_qc.R#L69-L84

See discussion on "fastqcFilter vs fastqFairedFilter vs filterAndTrim" benjjneb/dada2#311

Also, current tutorial for V1.8 shows the use of filterAndTrim https://benjjneb.github.io/dada2/tutorial.html#filter-and-trim.

Add Travis CI badge to README

[![Build Status](https://travis-ci.org/erictleung/dada2HPCPipe.svg?branch=master)](https://travis-ci.org/erictleung/dada2HPCPipe)

Failed installation of Rcpp

Description

DADA2 requires Rcpp, but can't install Rcpp.

Error

Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/users/leunge/rpkgs/3.4/Rcpp/libs/Rcpp.so':
  /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/users/leunge/rpkgs/3.4/Rcpp/libs/Rcpp.so)

Progress to Solution

Create Makevars File

Someone has similar issues installing data.table (https://github.com/Rdatatable/data.table/issues/2055). A possible solution based on this answer (https://github.com/Rdatatable/data.table/issues/2055#issuecomment-289213983) would be to create an ~/.R/Makevars file to specify the correct versions and paths of the compilers I need for the package.

Similar discussion to force R to use particular versions of g++ (https://stackoverflow.com/questions/23414448/r-makevars-file-to-overwrite-r-cmds-default-g-options)

More on start up files, refer to http://stat.ethz.ch/R-manual/R-patched/library/base/html/Startup.html.

Fix Missing GLIBCXX File

The above error says I'm missing GLIBCXX_3.4.20. From another source (https://askubuntu.com/q/575505), I can take a look at these using the following (modified to work on the cluster):

$ strings /usr/lib64/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH

As shown, GLIBCXX_3.4.20 is missing. This is, however, from the root usr directory.

Similar issue, but using Anaconda (https://github.com/ContinuumIO/anaconda-issues/issues/483).

Edit: Looking at ~/packages/miniconda/pkgs (I've set this path personally after installing miniconda), you can find a directory libgcc-5.2.0-0/lib/ and then you can run the following:

$ strings libstdc++.so.6 | grep GLIBCXX
GLIBCXX_DEBUG_MESSAGE_LENGTH
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21

Note the inclusion of GLIBCXX_3.4.20. I'll need to figure out how this gets into the PATH to be called.

Add comments in test_1_input_quality.R on what to remove

Lots of the code in this file is to create fake mapping file for the purposes of using the software. There needs to be clear comments on what can be removed in a real analysis.

Add unit tests to functions

Let Travis CI validate and test most functions in this package to make sure they are doing what we think they are doing.

Add Makefile to download taxonomic reference data

Makefile to download reference data for Silva, RDP, GreenGenes, UNITE (fungal)
Add test in verification for checking at least one taxonomic reference database included before executing workflow
Check URL links are still valid (e.g https://stackoverflow.com/questions/25135347/how-to-check-status-of-urls-from-text-file-using-bash-shell-script)
Use MD5 checking for databases

For test example, follow data analysis structure

Create analysis structure similar to one found here https://www.maxmasnick.com/analysis-org/

Directories to have:

data (data to be used in the analysis)
results (results like plots of quality and errors and others)
intermediate (intermediary data to be used between steps)
metadata (like mapping files and other)
src (R scripts)

Will need to update the Makefile to put the example data in a particular space.

The submit scripts can be put into the root of this directory. So right now, move the submit scripts out of the test/ directory up one directory and have it in line with the README for the scripts/ directory.

Look into parallelizing analysis with lots of samples

See https://stackoverflow.com/a/38797947/6873133 for how to write multiple RDS files and read them into a list.

See the Big Data steps for the pipeline.

Update installation script to download packages if not found

https://stackoverflow.com/questions/9341635/check-for-installed-packages-before-running-install-packages

E.g.:

requiredPackages = c('plyr','ggplot2','ggtern')
for (p in requiredPackages){
  if(!require(p, character.only = TRUE)) install.packages(p)
}

Test SLURM scripts

There are SLURM scripts in the scripts/test/ directory of the project. These need to be tested on the cluster.

test_install
test_input_quality
test_filter_trim_err
Rest of pipeline

After filtering and trimming, the rest of the pipeline doesn't require much manual input.

Create configuration file for user input parameters

Currently, configuration and parameters are buried within the R scripts run by Slurm scripts. Need to create higher-level configuration file to use so that you don't need to modify the R scripts themselves.

https://stackoverflow.com/a/5276466/6873133

Filter and Trim

Here are the possible parameters to input into the filtering and trim step.

filterAndTrim(fwd, filt, rev = NULL, filt.rev = NULL, compress = TRUE,
  truncQ = 2, truncLen = 0, trimLeft = 0, maxLen = Inf, minLen = 20,
  maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, primer.fwd = NULL,
  matchIDs = FALSE, id.sep = "\\s", id.field = NULL,
  multithread = FALSE, n = 1e+05, OMP = TRUE, verbose = FALSE)

There are a total of 22 different parameters, a subset of which would require some changes.

The paramters you'd likely change are:

truncLen for forward and reverse truncation length
maxN for max number of N values after truncation
maxEE for maximum number of expected errors after truncation

For provenance of what's going on, the verbose parameter should be TRUE to output status messages as the function filters and trims.

For the rest of the paramters that are unused, but potentially used, we may need to build into the configuration file to allow other paramters with the same naming scheme as the ones in the function itself.

Configuration File

The configuration file will be in a YAML format. We can read in this file using the yaml package.

library(yaml)
config <- yaml.load_file("config.yml")

Overview

There appears to be only one step that requires intervention (i.e. human-in-the-loop), which is the quality step. The user will need to decide on the quality cut-offs that are sufficient for the data.

After the quality decision, the rest of the workflow can be automated.

Dereplication: pretty standard and default maximum number of sequences to dereplicate seems reasonable and have no reason to change.
Merge paired reads: may want to split up to own step, in case there are issues with overlap
Construct sequence table: makeSequenceTable() requires no other paramter
Remove chimeras: just need paramter for the method of removing chimeras, "consensus" vs "per-sample"
Track reads through pipeline: can be done with all the other steps
Assign taxonomy: just needs to specify the dataset with the taxonomic information

As noted with the filterAndTrim function, there needs to be an option in the configuration file that allows that use of unmentioned paramters. The reason to not list out all the paramters is because it is overly verbose and unnecessary most of the time. However, in the even that those parameters need to be used, then there is option for it to be used.

Document descriptions of data in workspace data

Various workspace files will be saved (*.RData files), so there needs to be documentation on what those files encompass.

Save file of parameters used and versioning

Once the analysis has completed, this workflow should save out a file with information to reproduce the analysis.

sessionInfo.txt

Print out a file with results from the command sessionInfo() in R.

~~parameters.json~~ parameters.yml (this file should be used as input)

JSON file with functions as keys with the corresponding parameter inputs as the values
Parameters used in trimAndFilter() function
Parameters used in dada() function
Parameters used in various other functions

Edit Use YAML file instead of JSON file for output parameter file.

Can't see function names when package load

Probably an issue with the NAMESPACE file. Will probably need to run devtools with Roxygen documentation several times.

See http://r-pkgs.had.co.nz/namespace.html

Add Travis CI to do unit tests on functions

TODO:

Check that package builds correctly
Install dada2HPCPipe package correctly and load successfully
Run through code workflow
Automate code linting with lintr https://github.com/jimhester/lintr

Resources:

Draft:

language: r
cache: packages
bioc_packages:
  - dada2
  - phyloseq
r_packages:
  - dplyr

Document main pipeline functions

do_error_learn() (#22)
do_sample_infer_step() (#23)
do_taxonomy_step()
do_track_dada2()

Look into using covr for code coverage

https://github.com/jimhester/covr

Have ability to try different parameters

I might want to try out different parameters for the filtering and trimming, use different samples, use a different taxonomic database, etc.

erictleung / dada2hpcpipe Goto Github PK

dada2hpcpipe's Introduction

dada2HPCPipe

Installation

Description

Overview of Makefile

Development Setup

Package Management

Slurm Workload Manager

Troubleshooting

dada2hpcpipe's People

Contributors

Stargazers

Watchers

Forkers

dada2hpcpipe's Issues

Description

Proposal

Data Format Specifications

User Story

Description

Error

Progress to Solution

Filter and Trim

Configuration File

Overview

Recommend Projects

Recommend Topics

Recommend Org