nextflow-io / nf-hack17 Goto Github PK

Nextflow hackathon 2017 projects

Shell 100.00%

nextflow hackathon

nf-hack17's Introduction

Nextflow hackathon 2017 projects

This repository lists the project ideas and the material used during the hackathon organised in the context of the Nextflow workshop that will take on 14-15 September 2017 in Barcelona.

Location

PRBB building, Charles Darwin room, ground floor (you will need an ID card to enter in the building).

Schedule

The workshop schedule is available here.

Event chat

We will use the following Gitter channel during the workshop activities. Feel free to register and use it for any question about logistic, problems, doubts during the hackathon, etc.

Hackathon

Project ideas are listed in the issues page.

Feel free to join one of the listed project or create a new one if you have a specific idea on which you would like to work.

We also encourage you to create a separate GitHub repository to keep track your project files and link it to the relative project idea in the above link. This will allow us to share and follow-up the hackathon achievement.

Tutorial

The Nextflow tutorial is available at this reporitory.

Tutorial participants are encouraged to implement the Variant Calling pipeline described in this tutorial in the second day.

Slides

You can find below the link to the speaker slides:

Frédéric Lemoine - Large scale benchmarking of Phylogenetic bootstrap methods
Hugues Fontenelle - Medical Genetics at Oslo University Hospital
Jordi Rambla - Standardizing life sciences datasets to improve reproducibility in the EOSC
Luca Cozzuto - From zero to Nextflow
Matthieu Foll - Computational workflows for omics analyses at the IARC
Paolo Di Tommaso - Lessons learned and new challenges
Phil Ewels - Standardising Swedish genomics analyses using Nextflow
Scott Hazelhurst - Building pipelines to support African bioinformatics
Tim Dudgeon - Nextflow for chemistry - crossing the divide

nf-hack17's People

Stargazers

Watchers

Forkers

mes5k ewels

nf-hack17's Issues

Project 4: HTML tracing report

HTML tracing report

The Nextflow tracing reports contain loads of information, but it can be a bit of a pain to dig through them. It would be nice to be able to generate HTML output (like the timeline reports) with tables that can be sorted and plots visualising the numbers.

Data:

Any example pipeline + data should be suitable.

Computing resources:

Everything can be run on local computers.

Project Lead:

Phil Ewels (@ewels).

Project 11: AWS Batch integration

AWS Batch integration

Nextflow has an experimental support for AWS Batch. Goal of this project is to stabilise the current implementation, to add missing features and and to make it able to process real world pipelines.

Data:

(to be provided)

Computing resources:

(to be provided)

Project Lead:

Francesco Strozzi (@fstrozzi)

Project 8: Reproduce phylogenetic study

Project

The idea is to try to reproduce a part of figure 1 from the following paper.

Data:

As data are simulated, no need of external data.

Computing resources:

The computer intensive tasks are :

200 tasks each taking ~1min on 3 cpus and ~600M ram
200 tasks each taking ~2min on 3 cpus and ~1G ram
200 tasks each taking ~10min on 3 cpus and ~3G ram

Amazon instance may be useful. Dimensions may be discussed.

Project Lead:

@fredericlemoine

Project 10: De novo assembly of nanopore reads

Project

The project aims at building a modular pipeline for de-novo bacterial assembly using nanopore reads.

Preliminary plan:

adapter trimming using porechop
assembly using:
- canu
- miniasm
polishing using:
- racon
- nanopolish
circularisation using circlator

If time permits it:

annotation using prokka
format conversion using gff3toembl

(the workflow is open to software suggestions and improvements)! 😄

Data:

We'll use a subset of E. coli data from a R9 run from here

We'll work directly with the fasta files (1.5G) to avoid to lose time on the basecalling.

Computing resources:

About 20GB of RAM is necessary for miniasm. 8 cores would be great but we could manage with 4. I think a t2.2xlarge or an r4.xlarge would suffice.

Project Lead:

The project will be led by Hadrien Gourlé

Project 1: Nextflow tutorial for newbie users

Project

During the hackathon will be organised a Nextflow training for newbie users. During this session participants will learn:

how to install the Nextflow framework;
how to write a pipeline script:
how to manage dependencies with software containers to make it reproducible and portable across different execution platforms.

Project lead:

Emilio Palumbo (@emi80)

Project 3: HTML e-mail summaries

HTML e-mail summaries

We've recently implemented summary reports / e-mails in our pipelines (see NGI-RNAseq). They're basically the same as the example in the docs, but HTML and prettier (eg. a big red box if the pipeline fails).

The idea is basically to write this into core Nextflow somehow so that it's easier for others to get this functionality into their pipelines. We'd write a default template / theme which could be customised if desired. There's an issue discussing this here: nextflow-io/nextflow#375

Data:

Any example pipeline + data can be used for this project. Just a dummy pipeline with some hardcoded channel variables would be fine.

Computing resources:

None required. Can be run locally (though could be nice to test on AWS?).

Project Lead:

Phil Ewels (@ewels).

Project 7: Nextflow modules, components and data plugins

Nextflow modules, components and data plugins

Implement and expand on and the ideas from the repository here.

Nextflow modules could encapsulate the definitions of similar tools, referred to as components.
Each module would be made of components which could share the same data input/output types.
Individual components would contain their own specific commands and execution environments (containers).
Implement mechanisms for downloading/caching data from sources relevant to the module.

Example Module readMapping

Example Component kallisto

Data:

Any data relevant to the modules developed.

Project Lead:

Evan Floden (@skptic)

Project 6: ChIPseq comparative pipeline

Project

ChIPseq comparative pipeline for benchmarking procedures:

The pipeline will run combination of tools for both mapping and peak calling
The output of every combination should be standardized
There should be some kind of metrics that can be used for evaluation.

Data:

A number of fastq files (compressed) and genome files (fasta). I estimated around 200 gb

Computing resources:

(to be estimated)

Project Lead:

Luca Cozzuto (@lucacozzuto)

Project 5: Pipeline for 16S Microbial data

Project

The H3Africa Bioinformatics Network has developed a pipeline for doing 16S analysis (https://github.com/h3abionet/h3abionet16S). This pipeline has been produced using CWL. We are interested in migrating it to Nextflow for two reasons (1) some groups in the network are using Nexflow and would like to extend (e.g., for 18S or shotgun) and it would make sense to have this as a basis; and (2) it would be interesting to compare the two workflow implementations in various ways.

Data:

There are various public data sets, but we suggest using the H3ABionet Accreditation Practice data sets. Our estimate is that input data set size (including reference data sets that are needed will be about 10GB and another 4GB or so for analysis and output.

Computing resources:

The following estimate is from Gerrit who worked on the CWL workflow:

The dataset is small and we would be able to manage with 16GB of RAM for the run. With the CWL implementation we were not able to do threading so a t2.xlarge (4 cores, 16GB RAM) would be sufficient for our current design. The tasks that needs threading do not require much RAM so if it is possible to thread some of our tasks on Nextflow we can probably keep with a 16GB RAM machine but with more cores so maybe a c4.4xlarge (16 cores, 30GB) would be an option.

The above however is the requirement if we just need to make one run. We would need to think of how we will work together on this. Will we have once machine where everyone is logged in and doing testing? If that is the idea we will maybe need a bigger machine that will allow for more tasks and memory requests. Maybe a c3.8xlarge (32 cores, 60 GB RAM)

I do not know if there is any budget requirements. Maybe we just need to say we want two or three of these machines and we will definitely spin up the one and only spin up the others if we find there is a need for that.

(Also we can make some resources from our Wits cluster available)

Project Lead:

TBD: we have four people coming from Bionet and we'll make an nomination shortly.

Project 9: Approaches to scaling out Nextflow.

Project

Approaches to scaling out Nextflow.
Comparing running in Kubernetes and HPC environments.

Data:

A few examples of workflows can be found here, but no specific large size datasets will be needed.

Computing resources:

I expect this will mostly be working out how to try things out on different environments, rather than trying to execute any particular HPC workflows, but if access to a small Kubernetes cluster was possible that would be useful. Same for HPC environments.

Project Lead:

Tim Dudgeon.
Assistance from people familiar with using Nextflow in these environments would be useful.

Project 2: Variant Calling pipeline for RNA data

Project

Implementation of a Variant Calling analisys pipeline for RNAseq data based on GATK best practices and using Nextflow as pipeline framework.

This is supposed to be a tutorial project for newbie Nextflow users how are interested
to learn how to assemble a real world genomic pipeline, manage dependencies by using
container technology (Docker & Singularity) and deploy the pipeline execution in a reproducible manner in the AWS cloud.

The participants will need to assemble the pipeline on their own following the documentation provided at this link.

nextflow-io / nf-hack17 Goto Github PK

nf-hack17's Introduction

Nextflow hackathon 2017 projects

Location

Schedule

Event chat

Hackathon

Tutorial

Slides

nf-hack17's People

Stargazers

Watchers

Forkers

nf-hack17's Issues

HTML tracing report

Data:

Computing resources:

Project Lead:

AWS Batch integration

Data:

Computing resources:

Project Lead:

Project

Data:

Computing resources:

Project Lead:

Project

Data:

Computing resources:

Project Lead:

Project

Project lead:

HTML e-mail summaries

Data:

Computing resources:

Project Lead:

Nextflow modules, components and data plugins

Data:

Project Lead:

Project

Data:

Computing resources:

Project Lead:

Project

Data:

Computing resources:

Project Lead:

Project

Data:

Computing resources:

Project Lead:

Project

Recommend Projects

Recommend Topics

Recommend Org