Coder Social home page Coder Social logo

replikation / porecov Goto Github PK

View Code? Open in Web Editor NEW
39.0 5.0 16.0 29.13 MB

SARS-CoV-2 workflow for nanopore sequence data

Home Page: https://case-group.github.io/

License: GNU General Public License v3.0

Nextflow 64.15% Python 32.53% Shell 2.33% R 0.99%
workflow nanopore-data artic bioinformatics sars-cov-2 nanopore basecalling

porecov's Introduction

poreCov | SARS-CoV-2 Workflow for nanopore sequencing data

Twitter Follow

Citation:

poreCov - an easy to use, fast, and robust workflow for SARS-CoV-2 genome reconstruction via nanopore sequencing
Christian Brandt, Sebastian Krautwurst, Riccardo Spott, Mara Lohde, Mateusz Jundzill, Mike Marquet, Martin Hölzer
https://www.frontiersin.org/articles/10.3389/fgene.2021.711437/full

What is this Repo?

  • poreCov is a SARS-CoV-2 analysis workflow for nanopore data (via the ARTIC protocol) or SARS-CoV-2 genomes (fasta)
  • the workflow is pre-configured to simplify data analysis:

Table of Contents

1. Quick Setup (Ubuntu)

1.1 Nextflow (the workflow manager)

  • poreCov needs Nextflow and java run time (default-jre)
    • install java run time via: sudo apt install -y default-jre
    • install Nextflow via: curl -s https://get.nextflow.io | bash && sudo mv nextflow /bin && sudo chmod 770 /bin/nextflow

1.2 Container (choose one - they manage all the tools)

Docker

  • installation here (recommended), alternatively via: sudo apt install -y docker
  • add Docker to the user: sudo usermod -a -G docker $USER

Singularity

  • Singularity installation here
  • if you can't use Docker

Note, that with Singularity the following environment variables are automatically passed to the container to ensure execution on HPCs: HTTPS_PROXY, HTTP_PROXY, http_proxy, https_proxy, FTP_PROXY and ftp_proxy.

Conda (not recommended)

  • Conda installation here
  • install Nextflow and Singularity via conda (not cluster compatible) - and use the singularity profile

1.3 Basecalling (optional)

  • only important if you want to do basecalling via GPU with the workflow:
    • local guppy installation (see oxford nanopore installation guide)
    • or: install nvidia Docker tool kit
    • or: Singularity (with --nv support)

2. Run poreCov

2.1 Test run

  • validate your installation via test data:
# for a Docker installation
nextflow run replikation/poreCov -profile test_fastq,local,docker -r 1.1.0 --update

# or for Singularity or conda installation
nextflow run replikation/poreCov -profile test_fastq,local,singularity -r 1.1.0 --update

2.2 Quick run examples

  • poreCov with basecalling and Docker
    • --update tryies to force the most recent pangolin lineage and nextclade release version (optional)
    • -r 1.1.0 specifies the workflow release from here
    • --primerV specifies the primer sets that were used, see --help to see what is supported
      • alternatively provide a primer bed file on your own
nextflow run replikation/poreCov --fast5 fast5/ -r 1.1.0 \
    --cores 6 -profile local,docker --update --primerV V4
  • poreCov with a basecalled fastq directory and custom primer bed file
nextflow run replikation/poreCov --fastq_pass 'fastq_pass/' -r 1.1.0 \
    --cores 32  -profile local,docker --update --primerV primers.bed
  • poreCov with basecalling and renaming of barcodes based on sample_names.csv
# rename barcodes automatically by providing an input file, also using another primer scheme
nextflow run replikation/poreCov --fast5 fast5_dir/ --samples sample_names.csv \
   --primerV V1200 --output results -profile local,docker --update

2.3 Extended Usage

  • see also nextflow run replikation/poreCov --help -r 1.1.0

Version control

  • poreCov supports version control via -r this way, you can run everything reproducible (e.g. -r 1.1.0)
    • moreover only releases are extensively tested and validated
  • poreCov releases are listed here
  • add -r <version> to a poreCoV run to activate this
  • run nextflow pull replikation/poreCov to install updates
    • if you have issues during update try rm -rf ~/.nextflow and then nextflow pull replikation/poreCov
    • this removes old files and downloads everything new

Important input flags (choose one)

  • these are the flags to get "data" into the workflow
    • --fast5 fast5_dir/ for fast5 directory input
    • --fastq_pass fastq_dir/ directory with basecalled data (contains "barcode01" etc. directories)
    • --fastq "sample*.fastq.gz" alternative fastq input (one sample per file)
    • --fasta "*genomes.fasta" SARS-CoV-2 genomes as fasta (.gz allowed)

Custom primer bed files

  • poreCov supports the input of primer.bed files via --primerV instead of selecting a preexisting primer version like --primerV V4
  • the main issue with primer bed files is that they need to have the correct columns and text to be recognized via artic
  • the following rules apply to the bed file (see also example)
    • each column is separated via one tab or \t
    • column 1 is the fasta reference, and it should be MN908947.3 (poreCov replaces that automatically)
    • column 2 is the primer start
    • column 3 is the primer end
    • column 4 is the primer name, and it has to end with _RIGHT or _LEFT
    • column 5 is the pool and it should be named nCoV-2019_1 or nCoV-2019_2
    • column 6 defines the strand orientation with either - or +
MN908947.3	30	54	nCoV-2019_1_LEFT	nCoV-2019_1	+
MN908947.3	1183	1205	nCoV-2019_1_RIGHT	nCoV-2019_1	-
MN908947.3	1100	1128	nCoV-2019_2_LEFT	nCoV-2019_2	+
MN908947.3	2244	2266	nCoV-2019_2_RIGHT	nCoV-2019_2	-
MN908947.3	2153	2179	nCoV-2019_3_LEFT	nCoV-2019_1	+
MN908947.3	3235	3257	nCoV-2019_3_RIGHT	nCoV-2019_1	-
MN908947.3	3144	3166	nCoV-2019_4_LEFT	nCoV-2019_2	+
MN908947.3	4240	4262	nCoV-2019_4_RIGHT	nCoV-2019_2	-

Sample sheet

  • barcodes can be automatically renamed via --samples sample_names.csv
  • required columns:
    • _id = sample name
    • Status = barcode number which should be renamed
  • optional column:
    • Description = description column to be included in the output report and tables

Example comma separated file (don't replace the header):

_id,Status,Description
Sample_2021,barcode01,good
2ndSample,BC02,bad

Pangolin Lineage definitions

  • lineage determinations are quickly changing in response to the pandemic
  • to avoid using out of date lineage schemes, a --update flag can be added to each poreCov run to get the most recent version-controlled pangolin container
  • we are currently building two times every week version-controlled pangolin container automatically, see here
    • it is also possible to use instead of --update this flag: --pangolindocker 'nanozoo/pangolin:3.1.1--2021-06-14'
    • this way you can use other container or version, but beware some containers might not be compatible with poreCov

3. Quality Metrics (default)

  • Regions with coverage of 20 or less are masked ("N")
  • Genome quality is compared to NC_045512.2
  • Pangolin lineages are determined
  • nextstrain clades are determined including mutation infos
  • reads are classified to human and SARS-CoV-2 to check for possible contamination and sample prep issues

4. Workflow

  • poreCov was coded with "easy to use" in mind, while staying flexible
  • therefore we provide a few input types which adjusts the workflow automatically (see image below)
    • fast5 raw data, fastq files (one sample per file), fastq_pass (the basecalling output) or fasta (supports multifastas)
  • primer schemes for ARTIC can be V1, V2, V3(default), V4, V4.1 or V1200 (the 1200bp amplicon ones)

5. Literature / References to cite

If you are using poreCov please also check the used software to cite in your work:

6. Troubleshooting

  • Collection of some helpful infos

Singularity

7. Time to results

Table 1: Execution speed of poreCov on different Ubuntu 20 Systems using a single sample file with 167,929 reads. Command used: nextflow run replikation/poreCov -r 0.9.4 -profile test_fastq,local,docker.

Hardware First time with download (DB+container)¹ Default settings
2 CPUs 4 GB RAM 1h 2min 32 min 30s ²
2 CPUs 8 GB RAM 46 min 21m 20s
4 CPUs 16 GB RAM 40 min 12m 48s
8 CPUs 32 GB RAM 35 min 11m 39s
16 CPUs 64 GB RAM 30 min 9m 39s

¹ time depends mostly on available internet speed
² was not able to execute read classification due to limited hardware, but generated and classified SARS-CoV-2 genomes

Table 2: Execution speed of poreCov on different Ubuntu 20 Systems using 24 fastq samples. Command used: nextflow run replikation/poreCov -r 0.9.4 --fastq "*.fastq.gz" --primerV V1200 --samples samplenames.csv -profile local,docker. Time meassured by the start of the workflow.

Hardware Default settings
2 CPUs 4 GB RAM 13h 33m ¹
2 CPUs 8 GB RAM 7h 56m ¹
4 CPUs 16 GB RAM 4h 10 min
8 CPUs 32 GB RAM 2h 15 min
16 CPUs 64 GB RAM 1h 25 min

¹ was not able to execute read classification due to limited hardware, but generated and classified SARS-CoV-2 genomes

8. Credits

The key steps of poreCov are carried out using the ARTIC Network field bioinformatics pipeline. Kudos to all amazing developers for your incredible efforts during this pandemic! Many thanks to all others who have helped out and contributed to poreCov as well.

porecov's People

Contributors

angelovangel avatar bwlang avatar dataspott avatar hoelzer avatar marielataretu avatar raverjay avatar replikation avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

porecov's Issues

Summary report also in text table format

It would be really helpful to get a table with the data instead of only an HTML file. This is because often content has to be copied or accessed programmatically per column of data.

The HTML report is great, but having the main content also written as a summary table would be awesome.

@replikation I think you already write some JSON files for your auto database input?

But for other users an XLSX (CSV) would be actually also good.

@RaverJay I imagine it should be relatively easy to print out a dataframe that is anyway constructed for the HTML report also as a CSV or TSV?

I put this w/ a prio high bc/ it would really help people that are already actively using the pipeline - but of course I know that there are also other busy things todo! ;)

Warning when older version of pipeline is used

Is it possible to add a warning when an older version of the pipeline is used? So that people can get automatically aware of new versions? Not sure if such a feature is maybe even part of Nextflow?

Generate one summary consensus QC TSV

It would be great to provide one summary TSV that tells the user which samples passed QC and which not. This makes downstream analyses easier/faster as well.

Change/Remove tree approach

  • remove quick time tree build feature and reference to nextstrain?
  • or allow a simple drop in of A.) gisaid fasta + b.) gisaid metadata ?

report

It would be good to have at least a simple summary report for the reconstructed consensus sequence. This should include:

  • used version of poreCov
  • used tools and versions within poreCov
  • basic stats about the reconstructed consensuses (length, N50, number Ns, maybe pairwise identity to Wuhan strain ...)
  • if possible some stats about the called variants

E.g. in a single PDF report per run.

For the first part (technical stats) it might be also enough to use the nextflow internal functions for reporting.

Update consensus QC

president has a new version v0.6.0:
https://gitlab.com/RKIBioinformaticsPipelines/president/-/releases/v0.6.0

Besides some bug fixes that do not really affect poreCov, the order of the columns in the report.tsv was adjusted to be more intuitive (and not simply alphabetically sorted).

However, I assume this will break poreCov when the container is updated w/o applying some changes to the code where metrics are extracted from the report.tsv?

Simple Read QC

It would be good to have some qc of the input data. I think the easiest is a NanoPlot module.

Maybe PycoQC as an option if the summary.txt is available.

As a final output, it would be great to have a summary QC for all input samples/barcodes where one can immediately observe strange samples.

ukj-Flag to create mongoDB database entry and "filename support"

Create Flag [--ukj]

  • if used PoreCov uses the runinfo.txt-file provided in the run-directory
  • for each barcode a string in runinfo.txt is give, containing following information about the sample:
    YYYYMMDD_Location_SampleID_Abbreviation

-> YYYYMMDD = Isolation-date
-> Location = Location of isolation
-> SampleID = internal sample ID
-> Abbreviation = Combination of letter S (= Survillance) or E (= Extern) with a number, describing that itś the x run of this sample (beginning at 0)

  • parses this info together with the rki-report, selected president-results, the primer-info and the analysing-date for each sample into one .csv-file for the upload into the database.

samples from samplesheet.yml are not reported

This might be an issue with an older version (analyses was run end of February).

Samples that are given in the samplesheet.yml and yield no reads in demultiplexing are not reported in output. this may lead to missing information in tracking samples further downstream.

could you add the information of lacking reads in a report.csv or similar?

Only add genomes to final RKI FASTA and report.csv that meet QC criteria

Question/Thought

I think currently in #28 all final consensus seqs are added to the all_genomes.fasta and rki_report.csv?

Whereas this is fine (because then all information is in one place) we could also think of just writing sequences to these summary files that also meet the consensus QC. Otherwise, they might be anyway rejected when submitted to RKI.

However, people might want to work also with sequences that don't meet the QC criteria internally and would otherwise miss them when they are not part of the summary files?

And: when QC thresholds change, this must be also reflected in poreCov to not reject sequences that might actually pass the later QC

missing bases in low-covered homopolymer stretches

In regions with lower ONT read coverage single bases might be missed in the generated consensus due to issues in basecalling homopolymer stretches.

For example (position 11075, end of ORF1):
Screenshot from 2021-01-08 12-41-56

Maybe this can be fixed by additionally checking for deletions after Medaka that are in homopolymer (e.g. length > 6 nt) stretches and by again comparing to a reference (Wuhan) sequence. However, can be difficult if real deletions occur in homopolymers.

Maybe it's already fixed in the current ARTIC pipeline.

Output QC

I think currently there is no quality check of the generated consensuses?

I suggest having at least some simple qc e.g.

  • check for the number of N
  • pairwise sequence identity to e.g. the Wuhan reference sequence

to automatically decide which samples yield good enough consensus sequences for further processing.

We could even simply implement a python script that people are working on right now:
https://gitlab.com/RKIBioinformaticsPipelines/president

Here, an input FASTA sequence is aligned to the Wuhan reference strain an a pairwise identity is calculated and reported in a tabular format.

Filter FASTQ by length - report

As an initial step, FASTQ files are filtered by length and if the file size is too small, the FASTQ is not processed any further. It would be good to have this somehow reported.

E.g. I just tested a FAST5 run (V3 primers) that resulted in 24 barcoded FASTQ and it seems 9 of them were sorted out and not processed any further.

It would be good to have a TSV with e.g. all IDs and a column that states which were sorted out due to low number of reads after filtering.

Add absolute classification numbers to the report

The new report is great!

One thing: to directly see the amount of reads that account to SC2 or human (in particular for a negative control) it would be great to add these numbers instead/ in addition to the percentage values. 100 % SC2 in a negative control sounds worrying but then the user anyway has to look into the kraken classification/ krona plot to see that these are maybe only a handful (false positive) reads.

It would be then important though, to take care of large numbers and format them in a nice way (e.g. 1k 1m 1g) to not spoil the nice table view.

Report Insertions as well

It seems Nextclade also provides information on insertions and not only deletions:

> colnames(x)
 [1] "seqName"
 [2] "clade"
 [3] "qc.overallScore"
 [4] "qc.overallStatus"
 [5] "totalGaps"
 [6] "totalInsertions"
 [7] "totalMissing"
 [8] "totalMutations"
 [9] "totalNonACGTNs"
[10] "totalPcrPrimerChanges"
[11] "substitutions"
[12] "deletions"
[13] "insertions"
[14] "missing"
[15] "nonACGTNs"
[16] "pcrPrimerChanges"
[17] "aaSubstitutions"
[18] "totalAminoacidSubstitutions"
[19] "aaDeletions"
[20] "totalAminoacidDeletions"
[21] "alignmentEnd"
[22] "alignmentScore"
[23] "alignmentStart"
[24] "qc.missingData.missingDataThreshold"
[25] "qc.missingData.score"
[26] "qc.missingData.status"
[27] "qc.missingData.totalMissing"
[28] "qc.mixedSites.mixedSitesThreshold"
[29] "qc.mixedSites.score"
[30] "qc.mixedSites.status"
[31] "qc.mixedSites.totalMixedSites"
[32] "qc.privateMutations.cutoff"
[33] "qc.privateMutations.excess"
[34] "qc.privateMutations.score"
[35] "qc.privateMutations.status"
[36] "qc.privateMutations.total"
[37] "qc.snpClusters.clusteredSNPs"
[38] "qc.snpClusters.score"
[39] "qc.snpClusters.status"
[40] "qc.snpClusters.totalSNPs"
[41] "errors"

see [13]

Could be worth to integrate this as well.

Report percentage of Ns

Maybe also rather present %Ns rather than absolute numbers (will make it easier for people who have a hard time quickly dividing by 29902)?

Or both? I know that you are likely just taking the information from president so I could also escalate that to that tool so you could then just grep it from the report.tsv?

collect consensus sequences in one folder

Hi,

would it be possible that the final consensus sequences would be put/linked into an output folder? Anything like 'consensus_sequences' ? This would facilitate copying out all consensus sequences for further use outside of the pipeline.

Add reference genome coverage to the report

One important metric for us is also the genome reference coverage. That means, how well is the Wuhan reference genome supported by the reads. As we anyway map the reads to the Wuhan reference this should be easy to calculate based on a BED file?

In the Illumina pipeline, we use a coverage of at least 20X as a cutoff.

In the end, what I would like to have is basically: "reads from sampleXY cover 98.12345% of the Wuhan reference genome with at least 20X" as another number in the report.

Secondly, based on that it would then be also easy to report a median coverage value (or do you think this is not as useful @replikation as we discussed in some other thread regarding nanopore amplicon sequencing)

Readme

  • cleanup and simplify readme

Adjust percentage of classified reads

Currently, the report shows e.g. 100% SARS-CoV-2 and 0% human even if there are only 10 reads classified as SARS-CoV-2 (and the rest is unclassified).

Thus, it is good to have the absolute numbers now as well but would it not be better to report the percentages in accordance with the Krona plot? E.g. 4 % SARS-Cov-2 if there is 96 % unclassified?

Clean-up

Is it possible to implement a parameter for a final clean-up after the pipeline finished? For example, it would be great to automatically get rid of the -w work folder once the pipeline finished succesfully.

Might be possible via onComplete?

more verbose error messages if the sample input is faulty or wrong

  • a better input validation would be gread to validate the input csv
  • current observation is that users are confused as the workflow is not explicitly stating if something is wrong with their input
  • solutions might be some groovy syntax to validate the file first before the channel magic happens?

Publish basecalled FASTQ files

For downstream analyses often the basecalled FASTQ files are needed.

Can we add a parameter e.g. --publish_fastq to activate publishing of .fastq.gz files after basecalling and demultiplexing?

Update guppy

Should we update Guppy GPU and CPU with the next PR?

4.4.1-1--a3fcea3

local config

Hi, thanks for implementing this workflow. New fan here!
Just a minor request for the local profile. There're some processes which one would expect to require small resources (e.g. nanoplot), but the local profile assigns cpus = params.cores. In my case, I'm running it on a local server with 32 cpus. That leads to "allocating" the whole server to a single instance of (say) nanoplot, when this process only require little more than 1 cpu at most, slowing down the whole wf execution (blocks parallel execution of multiple instances).
Very easy to solve with a new configuration, my request is just to improve the experience to users new to nextflow. I think for most of these processes giving a cpus = 2 as default would be enough, and would allow multiple parallel instances. Some of them may require a little more tho, as artic and bwa.

Keep alignment file

For a more detailed variant calling QC it would be good as a first step to keep the alignment file. Best would be to have PSL format (I think like minimap2 directly can provide it)

Installation from source without conda/singularity/docker

Hi,

Is it possible to install the poreCov workflow directly from source without using conda/singularity/docker?

Could you please publish a list of software/tools that are required for this workflow?

What I could figure out from the README is that guppy is required. On our HPC cluster we already have a guppy 4.0.11 installation with CUDA support. What else is needed?

Best regards

Sam

Optimize CPU basecalling

Current PR integrates CPU basecalling #18

We can check for some optimization of the CPU basecalling process, if needed (number of callers , callers per cpu, ...).

E.g. @hoelzer can check how people are currently run CPU basecalling on larger machines to improve the current basic command.

replace rki csv

new template for upload:

SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;PUBLICATION_STATUS;OWN_FASTA_ID
12346;20210119;ILLUMINA;X;s001;Y;A-899
12346;20210119;OXFORD_NANOPORE;Y;s006;N;A_900
12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;P;A.901

pangolin version tag

for sample flag a pangolin field needs to be added about "which version" was used

rki conform metadata output

  • add RKI conform metadata output
IMS_ID;SENDING_LAB;DATE_DRAW;SEQ_TYPE;SEQ_REASON;SAMPLE_TYPE;OWN_FASTA_ID
IMS-12345-CVDP-00001;12346;20210119;ILLUMINA;X;s001;A-899
IMS-12345-CVDP-00002;12346;20210119;OXFORD_NANOPORE;Y;s006;A_900
IMS-12345-CVDP-00003;12346;20210119;OTHER;A[B.1.1.7/B.1.351];s017;A.901

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.