fmalmeida / ngs-preprocess Goto Github PK

View Code? Open in Web Editor NEW

28.0 4.0 4.0 5.4 MB

A pipeline for preprocessing NGS data from Illumina, Nanopore and PacBio technologies

Home Page: https://ngs-preprocess.readthedocs.io/

License: GNU General Public License v3.0

Nextflow 97.55% Dockerfile 2.45%

ngs-preprocess illumina pacbio nextflow pipeline trimgalore nanopack bax2bam porechop reproducible-research

ngs-preprocess's Introduction

Hello 😁 👋

Hello there, my name is Felipe Almeida, a brazilian scientist, bioinformatician, pipeline developer and problem solver. My main interests are: Bioinformatics, genomic surveillance, precision medicine, and microbial genomics. You can also find me on twitter @fmarquesalmeida, stackoverflow and linkedin.

Academic info

I'm a PhD student at the University of Brasilia, at the CompGen (Computational Genomics) laboratory with academic guidance from PhD. Prof. Georgios J. Pappas Jr.

Some of my favourite tools:

My stats

ngs-preprocess's People

Contributors

Stargazers

Watchers

Forkers

vikash84 minghao2016 ravinpoudel lorepaga1996

ngs-preprocess's Issues

Update NanoPack tools alternatives

Some tools from NanoPack have been replaced for quicker tools as described here: https://github.com/wdecoster/nanopack?tab=readme-ov-file

The task would be to update such tools in the pipeline.

change structure of output directory

The structure of the output directory is not standardized and needs some changes in order to enable easy accession of final (preprocessed) reads.

It would be nice to have:

A final directory, probably called final_output that will contain all final (trimmed and filtered) fastq files, in fq.gz format to standardize filenames.
This directory will hold all results and separate reads in subdirectories (for longreads or shortreads).
Then, the other files (quality, merging steps, correction steps, etc.) would be saved in other directories, one for each step, software or strain ... still needs to think about it.

More brainstorming about this issue is still required before taking action into its implementation. Help required to decide the structure (@gpappasunb).

change to bioconda images

Instead of creating a custom docker image with all tools, reconfigure the pipeline to use the bioconda channels and images, which will enable users to run the tool with conda, docker or singularity.

fix bam2fastq source code

PacBio has changed the location of the many of its tools, including bam2fastx that is now in a different conda package.
https://github.com/pacificbiosciences/pbtk/

The reason of this ticket is to update references and from where the pipeline fetches the code to be able to use the latest.

consider using porechop_abi

Assess and consider the change from 'porechop' which is deprecated to porechop_abi which is under maintenance.

https://github.com/bonsai-team/Porechop_ABI

update module to fetch data from sra

Currently, the pipeline understands it to split downloaded data to modules based on the patterns: Illumina,pacbio,nanopore.
But what if a downloaded data is not from any of these platforms?
Think on how to better approach channel splitting.

include the automatic generation of a samplesheet for MpGAP

Add the automatic generation of a samplesheet that can be directly used as input for the https://github.com/fmalmeida/MpGAP pipeline.

Enhance documentation (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Create an "Output" page to facilitate users on the output structure and refer the correct tools-specific links as it is done in the bacannot documentation page, which gives users the interpretation of the generated results, including the directory structure and the relevant links for the tool-specific reference material.

Add more parallel jobs

Add the option to execute more jobs in parallel, being each job up to N threads. As it happens in bacannot!

add citation information

Add information about citation: https://f1000research.com/articles/12-1205/v1

Suggestion for hybrid error correction

Hi there,

Found your wrapper over Twitter, great incentive :). I have a suggestion for your pipeline - it would be of interest to consider hybrid correction (aka. combine short + longread). With my current pipeline I was using fmlrc, combine with ONT works pretty well

Cheers,

Tuan

Add example of non-bacterial dataset analysis (paper review)

Background
This issue is meant to address the comments received on the paper review here.

Description
Generate a new page in the web documentation, showing the analysis of a fungi or plant sequencing dataset. Make sure that they have the necessary command lines from input to output, so one can reproduce, but also, add an overview of the generated results in the web page.

Once done, check how easily one can we update the paper to provide an additional Zenodo for the non-bacterial analysis (ngs-preprocess + MpGAP).

standard profile to not load docker

Instead of making the standard profile of the pipeline to automatically load Docker, it is best to make it do not load for any profile by default and act as a simple local pipeline.

So, if users desire to use one of the available profiles one must explicitly select -profile docker/singularity/conda.

Change software for filtering longreads

Consider changing the software used to filter the long reads in order to use a more recent and faster app.

Currently, NanoFilt is used for the task.

The software to consider changing it is nanoq.

new tool for long reads QC

A new tool for long reads quality assessment is now available:

https://github.com/yfukasawa/LongQC

The task is to evaluate the tool and compare it with NanoPack and pycoQC in order to evaluate whether this tools is worthy its inclusion or the replacement of one of the mentioned tools in the pipeline.