Coder Social home page Coder Social logo

erictleung / dada2hpcpipe Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 2.0 95 KB

:ocean: 16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster using SLURM

License: MIT License

Makefile 7.25% R 83.72% Shell 9.03%
microbiome microbiome-analysis 16s dada2 slurm amplicon-sequencing performance-cluster workflow pipeline

dada2hpcpipe's Introduction

dada2HPCPipe

Build Status

16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster.

This repository contains essentially wrapper functions around DADA2 functions in order to streamline the workflow for cluster computing.

This package is meant to serve two purposes: be an R package and give structure to an analysis. The project aims to follow an R package structure, which can be downloaded and installed as such. Additionally, the users is expected to download this repository and run make and slurm commands to run scripts.

Table of Conents

Installation

install.packages("devtools")
devtools::install_github("erictleung/dada2HPCPipe")

Description

This DADA2 workflow stems from the DADA2 tutorial and big data tutorial. You can find more information about the DADA2 package from its publication or from GitHub.

Overview of Makefile

clean                          Remove data from test_data/, download/, and refs/
condar                         Install R and essential packages
dl-ref-dbs                     Download 16S reference databases (SILVA,RDP,GG)
help                           Help page for Makefile
install                        Install and update dada2HPCPipe package in R
setup                          Setup development environment with Conda
test                           Run DADA2 workflow with Mothur MiSeq test data

Development Setup

Here are instructions on how to get started on ExaCloud and setting up the development environment needed to run the DADA2 workflow.

Package Management

Interactive Session

To run an interactive session, run the following:

srun --pty /usr/bin/bash

This will allow you to test your code and workflow without worrying about stressing out the head coordinating node.

Setup

Follow the instructions listed in this document to setup a modern development environment on the cluster. This isn't necessary if your development environment is on a cluster where you have root access or you're implementing this workflow locally.

Briefly, following the instructions linked above will give you the following:

  • Miniconda, Python package and virtual environment management
  • Linuxbrew, non-root package management on Linux systems

For this R workflow, you'll only really need to install Miniconda and R essentials. The Anaconda environment has build an r-essential package with R and most used R packages for data science.

Linuxbrew is useful to supplement commands and other software tools you might want under package management control.

In summary to setup the dependencies for DADA2, run the following.

make setup
make condar

The make setup runs the following

# Download and install Miniconda
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh

# Say yes to adding Miniconda to .bash_profile

# Remove install file
rm Miniconda2-latest-Linux-x86_64.sh

To see changes, you'll first need to exit the cluster and log back in.

The make condar installs R and other essential R packages, which is laid out as the following

# Install R and relevant packages
conda install r-essentials

# For maintenance and update of all packages
conda update r-essentials

# For updating a single particular R package, replace XXXX
conda update r-XXXX

Slurm Workload Manager

Slurm is the resource manager that I'll focus on for this workflow. Slurm stands for "Simple Linux Utility for Resource Management."

An example script might be this.

$ cat first_script.sh
#!/bin/bash
# Template for simple SLURM script
# SBATCH --job-name="Job Name"
# SBATCH --partition=exacloud

srun hostname
srun pwd
srun hostinfo

The quick answer on sbatch vs srun can be found here.

Below are some useful commands to use within Slurm using the script above.

# Submit your script, first_script.sh
sbatch first_script.sh

# Look at jobs in the queue
squeue
squeue -u $USER # Take a look at your specific jobs

You can use this website to help generate Slurm scripts. It is designed for another cluster, but it should at least help with the initial drafting of a submit script you'll want to use.

For more general resources on using Slurm, check out here, here, and here.

Source: http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf

Troubleshooting

Installing this package says it has trouble installing Bioconductor packages

There are two solutions for this. From within R, you can run the following

setRepositories(ind=1:2)

which will tell R to also include Bioconductor packages in its package search. See https://stackoverflow.com/a/34617938/6873133 for more information.

Additionally, you can install Bioconductor manually using the following within R

# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()

and then using biocLite() to install the missing packages. See http://bioconductor.org/install/.

How do I update my packages?

For regular R package (i.e. non-Bioconductor packages), use conda from the terminal.

# XXX is the package name
conda install r-XXX

# For example, installing XML
conda install r-xml

But for Bioconductor packages, use biocLite() from within R.

# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")

# E.g. installing DESeq2
biocLite("DESeq2")

dada2hpcpipe's People

Contributors

erictleung avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

lakarstens

dada2hpcpipe's Issues

Add submits/ directory with submit scripts

Description

Create list of smaller submit scripts to perform steps of pipeline to allow user feedback and manual checks

Proposal

Need submit scripts to be able to do:

  • Check software requirements will work so pipeline can proceed
  • Perform quality checks and output quality plots to view
  • Filter and trim sequences
  • Perform main DADA2 denoising with error learning, dereplication, merge paired ends, construct sequence table, remove chimeras
  • Assign taxonomy using table of taxonomy database

Create Make rule to verify data format

Create a Makefile rule that will double check the format of the data.

Data Format Specifications

  • For now, only take paired-end read samples
  • Check that files are .fastq or .fastq.gz
  • Check naming scheme of files

User Story

As a user, I want to be able to be able to type the following

make verify

so that I can have a check that at least loading data won't be a problem.

Save file of keeping track of changes from workflow

Examples:

  • Names of files
  • Number of files
  • Number of raw reads input to workflow
  • Number of reads after trimming and filtering
  • Sequence table of read lengths and quantity of each
  • Number of chimeras and percent of reads left

First take: http://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names
head(track)
##        input filtered denoised merged tabled nonchim
## F3D0    7793     7113     7113   6600   6600    6588
## F3D1    5869     5299     5299   5078   5078    5067
## F3D141  5958     5463     5463   5047   5047    4928
## F3D142  3183     2914     2914   2663   2663    2600
## F3D143  3178     2941     2941   2575   2575    2550
## F3D144  4827     4312     4312   3668   3668    3527

Add feedback to user in various functions

There is no feedback on what scripts are doing. Good to have messages within the functions to let user know what is going on.

The feedback is written as an R script, but they should be within the functions itself.

Add Travis CI badge to README

[![Build Status](https://travis-ci.org/erictleung/dada2HPCPipe.svg?branch=master)](https://travis-ci.org/erictleung/dada2HPCPipe)

Failed installation of Rcpp

Description

DADA2 requires Rcpp, but can't install Rcpp.

Error

Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/users/leunge/rpkgs/3.4/Rcpp/libs/Rcpp.so':
  /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /home/users/leunge/rpkgs/3.4/Rcpp/libs/Rcpp.so)

Progress to Solution

Create Makevars File

Someone has similar issues installing data.table (https://github.com/Rdatatable/data.table/issues/2055). A possible solution based on this answer (https://github.com/Rdatatable/data.table/issues/2055#issuecomment-289213983) would be to create an ~/.R/Makevars file to specify the correct versions and paths of the compilers I need for the package.

Similar discussion to force R to use particular versions of g++ (https://stackoverflow.com/questions/23414448/r-makevars-file-to-overwrite-r-cmds-default-g-options)

More on start up files, refer to http://stat.ethz.ch/R-manual/R-patched/library/base/html/Startup.html.

Fix Missing GLIBCXX File

The above error says I'm missing GLIBCXX_3.4.20. From another source (https://askubuntu.com/q/575505), I can take a look at these using the following (modified to work on the cluster):

$ strings /usr/lib64/libstdc++.so.6 | grep GLIBCXX
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_DEBUG_MESSAGE_LENGTH

As shown, GLIBCXX_3.4.20 is missing. This is, however, from the root usr directory.

Similar issue, but using Anaconda (https://github.com/ContinuumIO/anaconda-issues/issues/483).

Edit: Looking at ~/packages/miniconda/pkgs (I've set this path personally after installing miniconda), you can find a directory libgcc-5.2.0-0/lib/ and then you can run the following:

$ strings libstdc++.so.6 | grep GLIBCXX
GLIBCXX_DEBUG_MESSAGE_LENGTH
GLIBCXX_3.4
GLIBCXX_3.4.1
GLIBCXX_3.4.2
GLIBCXX_3.4.3
GLIBCXX_3.4.4
GLIBCXX_3.4.5
GLIBCXX_3.4.6
GLIBCXX_3.4.7
GLIBCXX_3.4.8
GLIBCXX_3.4.9
GLIBCXX_3.4.10
GLIBCXX_3.4.11
GLIBCXX_3.4.12
GLIBCXX_3.4.13
GLIBCXX_3.4.14
GLIBCXX_3.4.15
GLIBCXX_3.4.16
GLIBCXX_3.4.17
GLIBCXX_3.4.18
GLIBCXX_3.4.19
GLIBCXX_3.4.20
GLIBCXX_3.4.21

Note the inclusion of GLIBCXX_3.4.20. I'll need to figure out how this gets into the PATH to be called.

Add unit tests to functions

Let Travis CI validate and test most functions in this package to make sure they are doing what we think they are doing.

For test example, follow data analysis structure

Create analysis structure similar to one found here https://www.maxmasnick.com/analysis-org/

Directories to have:

  • data (data to be used in the analysis)
  • results (results like plots of quality and errors and others)
  • intermediate (intermediary data to be used between steps)
  • metadata (like mapping files and other)
  • src (R scripts)

Will need to update the Makefile to put the example data in a particular space.

The submit scripts can be put into the root of this directory. So right now, move the submit scripts out of the test/ directory up one directory and have it in line with the README for the scripts/ directory.

Test SLURM scripts

There are SLURM scripts in the scripts/test/ directory of the project. These need to be tested on the cluster.

  • test_install
  • test_input_quality
  • test_filter_trim_err
  • Rest of pipeline

After filtering and trimming, the rest of the pipeline doesn't require much manual input.

Create configuration file for user input parameters

Currently, configuration and parameters are buried within the R scripts run by Slurm scripts. Need to create higher-level configuration file to use so that you don't need to modify the R scripts themselves.

https://stackoverflow.com/a/5276466/6873133

Filter and Trim

Here are the possible parameters to input into the filtering and trim step.

filterAndTrim(fwd, filt, rev = NULL, filt.rev = NULL, compress = TRUE,
  truncQ = 2, truncLen = 0, trimLeft = 0, maxLen = Inf, minLen = 20,
  maxN = 0, minQ = 0, maxEE = Inf, rm.phix = TRUE, primer.fwd = NULL,
  matchIDs = FALSE, id.sep = "\\s", id.field = NULL,
  multithread = FALSE, n = 1e+05, OMP = TRUE, verbose = FALSE)

There are a total of 22 different parameters, a subset of which would require some changes.

The paramters you'd likely change are:

  • truncLen for forward and reverse truncation length
  • maxN for max number of N values after truncation
  • maxEE for maximum number of expected errors after truncation

For provenance of what's going on, the verbose parameter should be TRUE to output status messages as the function filters and trims.

For the rest of the paramters that are unused, but potentially used, we may need to build into the configuration file to allow other paramters with the same naming scheme as the ones in the function itself.

Configuration File

The configuration file will be in a YAML format. We can read in this file using the yaml package.

library(yaml)
config <- yaml.load_file("config.yml")

Overview

There appears to be only one step that requires intervention (i.e. human-in-the-loop), which is the quality step. The user will need to decide on the quality cut-offs that are sufficient for the data.

After the quality decision, the rest of the workflow can be automated.

  • Dereplication: pretty standard and default maximum number of sequences to dereplicate seems reasonable and have no reason to change.
  • Merge paired reads: may want to split up to own step, in case there are issues with overlap
  • Construct sequence table: makeSequenceTable() requires no other paramter
  • Remove chimeras: just need paramter for the method of removing chimeras, "consensus" vs "per-sample"
  • Track reads through pipeline: can be done with all the other steps
  • Assign taxonomy: just needs to specify the dataset with the taxonomic information

As noted with the filterAndTrim function, there needs to be an option in the configuration file that allows that use of unmentioned paramters. The reason to not list out all the paramters is because it is overly verbose and unnecessary most of the time. However, in the even that those parameters need to be used, then there is option for it to be used.

Save file of parameters used and versioning

Once the analysis has completed, this workflow should save out a file with information to reproduce the analysis.

sessionInfo.txt

Print out a file with results from the command sessionInfo() in R.

parameters.json parameters.yml (this file should be used as input)

  • JSON file with functions as keys with the corresponding parameter inputs as the values
  • Parameters used in trimAndFilter() function
  • Parameters used in dada() function
  • Parameters used in various other functions

Edit Use YAML file instead of JSON file for output parameter file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.