mhh-rcug / wochenende Goto Github PK

Deprecated see https://github.com/MHH-RCUG/nf_wochenende : A whole Genome/Metagenome Sequencing Alignment Pipeline in Python3

Home Page: https://github.com/MHH-RCUG/nf_wochenende

License: MIT License

Python 70.52% Shell 29.48%

genomics alignment pipeline metagenomics slurm conda-environment bioinformatics nanopore

wochenende's Introduction

Wochenende - A whole Genome/Metagenome Sequencing Alignment Pipeline

This version has been superseded by an improved Nextflow version available at the following link, with support for more clusters and easier configuration:

https://github.com/MHH-RCUG/nf_wochenende

Warning ! This repository is deprecated for usage outside Hannover Medical School (from June 2022)

Wochenende runs alignment of short reads (eg Illumina) or long reads (eg Oxford Nanopore) against a reference sequence. It is relevant for genomics and metagenomics. Wochenende is simple (python script), portable and is easy to configure with a central config file.

Features
Platforms
Usage
Tutorial
Installation
Wochenende output
Run the postprocessing automatically
Normalization
Raspir
Wochenende_plot output
Growth rate estimation
Known bugs
Running software tests
General usage
Contributors
Tools used
Gallery

Preprint

Please view and cite the Wochenende preprint at https://www.biorxiv.org/content/10.1101/2022.03.18.484377v2

Features

Features include (see programs listed below at the bottom of this page)

QC (Fastqc)
pre alignment duplicate removal (perldup)
pre alignment poor sequence removal (Prinseq - used for single ended reads only)
trimming (Trimmomatic or fastp or trim galore or ea-utils)
alignment (bwa mem, or minimap2 for short reads, minimap2 or ngmlr for long reads)
SAM-> BAM conversion (samtools and sambamba)
AlignerBoost Mapping Quality recalculation (in testing April-June 2021)
Report % aligned reads (samtools)
Output unmapped reads as fastq (samtools) (from v1.4)
Post-alignment duplicate removal (Samtools from v1.7.8, Sambamba)
Removal reads with x mismatches (bamtools), adjustable from v1.7.3
Realignment (Abra2)
MD tag marking (Samtools)
Normalization (to Bacteria per Human cell, RPMM Reads Per Million sequenced reads per Million reference bases etc, see Reporting below for details)
Visualization (chromosome coverage, intended for bacteria in metagenomics projects) (from v1.4)
Growth rate estimation. Estimate how fast detected bacteria are growing
Rare species prediction (Raspir)

Project Haybaler https://github.com/MHH-RCUG/haybaler allows postprocessing of Wochenende results:

collation/integration of multiple reports (reporting csv or bam.txt files) using Python Pandas
prepare results for heatmaps
create heatmaps using multiple different R libraries

Did you know ?

Wochenende means weekend in German. The original developer, Tobias, called the pipeline Wochenende, because you can start it running and go off to enjoy your weekend early (at least, that was the plan!).

Platforms

Wochenende has only been tested (by the authors) on Ubuntu Linux 20.04 and 16.04 64bit. We advise against any attempts on MacOS or Windows. An appropriate conda environment, BASH and Python3.6+ is critical to get things working. We view Wochenende to be stable (master branch) but are still updating the pipeline with new features.

Usage

You can just run the pipeline as a normal Python3 script. However, we also offer a template for the job scheduler SLURM below. This template can also be used with Bash to run the commands at the bottom of the SLURM pipeline while ignoring any SLURM specific parameters.

SLURM usage

See section Installation below if you have not already done so.
Copy get_wochenende.sh to your directory with your FASTQ files (this will set the directory up with required scripts and subfolders for analysis, and for later postprocessing)

cp /path/to/wochenende/get_wochenende.sh .

Check the environment variable used by the get_wochenende.sh script point to your Wochenende directory (now taken care of by setup.sh, see Installation below )
Get Wochenende by running the script

bash get_wochenende.sh

Adjust settings in the script (eg single ended, paired end read, reference sequence)

nano run_Wochenende_SLURM.sh

Run the pipeline using SLURM for all _R1.fastq files in the directory (the "_R1" is important, R1_001.fastq is not allowed)

bash runbatch_sbatch_Wochenende.sh

After completion of the alignment and filtering, run wochenende_postprocess.sh (Requires Haybaler for final integration steps, R for optional automated heatmaps and optionally raspir for rare species detection).

bash wochenende_postprocess.sh -r -h -s -p -g

Tutorial

Once you've got the tools installed and tested, you can look at or run the commands in the tutorial in the subdirectory tutorial. https://github.com/MHH-RCUG/Wochenende/blob/master/tutorial/tutorial.txt

Installation

We recommend using Bioconda for installation of the tools used by our pipeline. First install miniconda if you have not already done this. Use mamba instead of conda if you like faster installs (follow the mamba install instructions here https://github.com/mamba-org/mamba ) Required libs are listed in the file env.wochenende.minimal.yml

Clone or download the repository to your local machine.

git clone https://github.com/MHH-RCUG/wochenende.git OR wget https://github.com/MHH-RCUG/wochenende/archive/master.zip

Create a conda environment for the pipeline. You should have first installed miniconda 64-bit Linux.

cd Wochenende
# mamba is faster
mamba env create -f env.wochenende.minimal.yml

# conda is a slower alternative:
conda env create -f env.wochenende.minimal.yml

Install all the other tools.
- ABRA2
Download a reference sequence from https://drive.google.com/drive/folders/1q1btJCxtU15XXqfA-iCyNwgKgQq0SrG4?usp=sharing or create your own.
a) If you want to use bwa mem as aligner (recommended for short reads), you'll need to create an index of that fasta reference sequence as usual, eg. gunzip x.fa && bwa index x.fa x.fa &. Minimap2 works with fasta directly without this step.
Important! Edit the configuration section of config.yaml to set the paths to the tools, tmp directory and reference sequences. Use a code editor to avoid breaking the yaml format.
Edit the paths to Wochenende and optionally haybaler in setup.sh
Run bash setup.sh to configure Wochenende BASH environment variables (for current user and server only)
Log out, and log back in to allow bash to read the environment variables (or source ~/.bashrc )
Activate the conda environment before running the pipeline. conda activate wochenende
Optional: run the tests, see below.

Update conda environment

If there is already a conda environment named wochenende:

conda env update -f env.wochenende.minimal.yml

Wochenende output

Output file meanings

- .calmd. - samtools Calculate MD tags on a BAM file (to enable easily viewing SNPs in some genome browsers like JBrowse).
- .dup. - duplicates excluded by samtools markdup
- .fix. - fixed Paired End reads (PE files only)
- .lc.  - Low-complexity sequences removed by the tool Prinseq
- .mm.  - mismatch filter applied as configured (see run_Wochenende_SLURM.sh file and logs, and or check mismatch distributions in BAM files to check the setting)
- .mq20 - mapping quality cutoff to reads with 20+
- .mq30 - mapping quality cutoff to reads with 30+
- .ndp. - no duplicates, as assessed by the program Perldup on Fastq files
- .ns.  - not-sorted. Sorting follows the .fix step (PE files only)
- .s.   - sorted bam file
- .trm. - trimmed reads
- .unmapped. - Fastq reads which were not mapped to the reference (meta)genome. Currently for single ended reads only. Can be further assessed with other tools

Wochenende produces many output files, many of which are superseded by later output files and can be removed.

Initial quality checks and read filtering.
- MB_AERO_044_S70_R1.ndp.fastq                  # Fastq after removal of duplicates by Perldup
- MB_AERO_044_S70_R1.ndp.lc.fastq               # Fastq after removal of low-complexity sequences by Prinseq
- MB_AERO_044_S70_R1.ndp.lc_seqs.fq.fastq       # The removed low-complexity output sequences from Prinseq
- MB_AERO_044_S70_R1.ndp.lc.trm.s.bam.unmapped.fastq    # Unmapped reads in FASTQ format. Can be further analysed, eg with alternative programs such as nextflow-blast, kraken, centrifuge etc

BAMs, Mapping Quality (MQ), Duplicate filtering (dup) and mismatch (mm) filtering results
- MB_aero_S2_R1.fastq               # Input file Read1. Note the form R1.fastq is required, R1_001.fastq will not work.
- MB_aero_S2_R1.fastqprogress.tmp   # Temporary file with pipeline stage progress
- MB_aero_S2_R1.trm.bam             # Initial, unsorted BAM. Can usually be deleted !
- MB_aero_S2_R1.trm.fastq           # Trimmed FASTQ.
- MB_aero_S2_R1.trm.s.bam           # Sorted BAM output file
- MB_aero_S2_R1.ndp.lc.trm.s.bam.unmapped.fastq   # unmapped FASTQ reads
- MB_aero_S2_R1.trm.s.mq30.bam                    # BAM where only well mapped reads with Mapping Quality 30 are retained
- MB_aero_S2_R1.trm.s.mq30.mm.bam               # Reads with 3 or more mismatches (default, can be changed) have been excluded
- MB_aero_S2_R1.trm.s.mq30.mm.dup.bam           # Duplicate reads were excluded by Picard
- MB_aero_S2_R1.trm.s.mq30.mm.dup.calmd.bam     # MD tags have been calculated to enable SNV visualization in JBrowse etc
- MB_aero_S2_R2.fastq                 # Read 2 input file
- MB_aero_S2_R2.trm.fastq             # Trimmed Read 2 file

# Wochenende reporting input and output
- MB_aero_S2_R1.trm.s.mq30.mm.dup.bam.txt       # Important: input for simple runbatch_metagen_awk_filter.sh and complex Wochenende reporting
- MB_aero_S2_R1.trm.s.mq30.mm.dup.bam.txt.filt.sort.csv           # Filtered and sorted BAM.txt read output
- MB_aero_S2_R1.trm.s.mq30.mm.dup.bam.txt.reporting.sorted.csv    # Output from Wochenende reporting step
- MB_aero_S2_R1.trm.s.mq30.mm.dup.bam.txt.reporting.unsorted.csv  # Output from Wochenende reporting step

# Haybaler output
- reporting/haybaler/haybaler_output/
- reporting/haybaler/haybaler_output/heattree_plots/
- reporting/haybaler/haybaler_output/top_50_taxa/


# Wochenende_plot.py input (.filt.csv) and output (png images)
- MB_AERO_044_S70_R1.ndp.lc.trm.s.mq30.mm.dup_cov_window.txt              # Coverage per window in each BAM
- MB_AERO_044_S70_R1.ndp.lc.trm.s.mq30.mm.dup_cov_window.txt.filt.csv     # Filtered (regions have at least 1+ reads) coverage per window in each BAM
- MB_AERO_044_S70_R1.ndp.lc.trm.s.mq30.mm.dup_cov_window.txt.filt.sort.csv  # Filtered and sorted (descending) coverage per window

# Wochenende_plot.py output (png images)
- plots/images/sample1.dup_cov_window.txt.filt.csv/
- plots/images/sample1.dup_cov_window.txt.filt.csv/high_score/  
- plots/images/sample1.dup_cov_window.txt.filt.csv/low_med_score/

# Growth rate code output
- growth_rate/fit_results/output/sample/results_summary.csv

Run the postprocessing automatically

This is strongly recommended!

After a successful Wochenende run, make sure you check that all bams have been created and are sized as expected eg ls -lh *.bam

Now start the postprocessing script bash wochenende_postprocess.sh -r -h -s -g -p to automatically:

run sambamba depth to get read coverage of all configured BAM files in the current directory
run the Wochenende plot to create coverage diagrams (-p)
run Wochenende reporting to count and normalize all read data (-r)
run the Haybaler report integration tool (-h provided it is installed and configured)
run raspir (-s)
run growth_rate analysis (-g)
clean up files

This script requires Haybaler and its dependencies to be installed, and will otherwise fail at some steps.

Normalization

The reporting tool (which requires python v3.6+) reports length, GC content of the sequence, read counts attributed to the species and various normalized read count parameters. Normalizations are for:

a) reads normalized to the idealized length of a bacterial chromosome (normalization to 1 million base pairs)

b) total reads in the sequencing library (normalization to 1 million reads)

c) the above two normalizations combined (RPMM, so Reads Per Million reads per Million base pairs)

d) (bacterial) reads per human cell (only works for metagenomes from human hosts. An estimate of absolute abundance).

See the subfolder reporting in the repository.

conda activate 
conda activate wochenende
python3 basic_reporting.py --input_file tmp_R1.ndp.lc.trm.s.mq30.01mm.dup.bam.txt --reference /lager2/rcug/seqres/metagenref/2016_06_PPKC_metagenome_test_1p_spec_change_cln.fa --sequencer illumina --output_name test

Usage

Usage: basic_reporting.py [OPTIONS]

  This script can be used to report the results of the Wochenende pipeline.
  The .bam.txt file as input is recommended. The .bam file will take longer
  and  generate more information.

  The column reads_per_human_cell is only for metagenomes from human hosts.

  Reports for solid sequencing data are not supported, a special
  normalisation model has to be implemented first.

Options:
  -i, --input_file TEXT   File in .bam.txt or .bam format from the Wochenende
                          pipeline output
  -r, --reference TEXT    File in .fasta format has to be the reference used
                          by the Wochenende pipeline
  -s, --sequencer TEXT    Sequencer technology used only solid and illumina
                          are available, only illumina is supported, default:
                          illumina
  -o, --output_name TEXT  Name for the output file(sample name), default
                          report
  --help                  Show this message and exit.

Running Wochenende_plot manually

Again, automatic generation via wochenende_postprocess.sh is recommended.

Preparing the data from BAM files

First generate the data files for wochenende_plot.py

# in a directory full of *dup.bam files
bash runbatch_sambamba_depth.sh

Then 
bash runbatch_metagen_window_filter.sh

The result should be a series of output files containing the keyword window

Run the actual plotting

Finally, run the actual wochenende_plot.py script or the helper bash script.

bash runbatch_wochenende_plot.sh

wochenende_plot.py usage:

python3 wochenende_plot.py
usage: wochenende_plot.py [-h] [--minMeanCov MINMEANCOV]
                          [--createAllPngs CREATEALLPNGS] [--sclim SCLIM]
                          [--minWindows MINWINDOWS]
                          filename1
wochenende_plot.py: error: the following arguments are required: filename1

Wochenende_plot output

Wochenende_plot creates one subdirectory per input file. These contain png images of taxa which are probabably (high score, largely based on consistent evenness of coverage and high mean coverage) or perhaps present (need manual review). Confident attributions to taxa depends strongly on the number of reads assigned to bacterial taxa (low in airway metagenomes, higher in for example stool samples).

Growth rate estimation

The tools in the subfolder growth_rate estimate the speed at which bacteria are growing. Possible values are no growth, slow, medium and fast. This is based on the observation by Korem et al 2015 link, namely that the ratio of read copy number at the bacterial genomic origin (ori) to the read copy number at the terminus (ter) can be used to infer growing species in a microbiome. Growth rate is determined for bacteria which are attributed sufficient numbers of reads during the alignment process.

Raspir

The external tool raspir has been integrated into the pipeline. Raspir is known to reduce the number of false positives in Wochenende output considerably. It works on BAM files created by Wochenende and creates another estimation of which species are present in the metagenomic reads. You must install raspir into it's own conda environment (pandas is required for example) before it will successfully run.

Known bugs

RPMM bug: fixed in v1.7.8. In October 2020 a bug in the Wochenende_reporting script was found which calculated the RPMM column incorrectly. Please recalculate your reporting statistics if you use this feature. Thanks to @sannareddyk at the MHH and @twehrbein at Leibniz University Hannover.

Running software tests

These tests test the software and installation on your server with known read set and database. If all is working correctly, then an output message should be created that these tests passed.

From the Wochenende directory

Using SLURM:
sbatch run_Wochenende_SLURM.sh testdb/reads_R1.fastq
Or without a scheduler:
python3 run_Wochenende.py --metagenome testdb --threads 4 --testWochenende --aligner bwamem --mq30 --remove_mismatching --readType SE --debug --force_restart testdb/reads_R1.fastq

You should be able to see in the SLURM outfile or standard out if the tests passed or not. Failed tests may be due to program versions or pipeline configuration issues.

General usage

Warning, this usage is just an example and might be slightly out of date. 

Run this with: 
python3 run_Wochenende.py

Wochenende - Whole Genome/Metagenome Sequencing Alignment Pipeline
Wochenende was created by Dr. Colin Davenport, Tobias Scheithauer and Fabian Friedrich with help from many further contributors https://github.com/MHH-RCUG/Wochenende/graphs/contributors
version: 1.9.1 - Mar 2021

usage: run_Wochenende.py [-h] [--aligner {bwamem,minimap2,ngmlr}]
                         [--readType {PE,SE}]
                         [--metagenome {2021_02_meta_fungi_human_masked,2021_02_meta_fungi_human_unmasked,2020_09_massiveref_human,2020_05_meta_human,2020_03_meta_human,2019_01_meta,2019_10_meta_human,2019_10_meta_human_univec,2019_01_meta_mouse,2019_01_meta_mouse_ASF_OMM,2019_01_meta_mouse_ASF,2019_01_meta_mouse_OMM,hg19,GRCh37,GRCh38-45GB,GRCh38-noalt,GRCh38-mito,mm10,rn6,rat_1AR1_ont,zf10,ss11,PA14,ecoli,nci_viruses,ezv_viruses,testdb,strept_halo,k_variicola,k_oxytoca,clost_bot,clost_bot_e,clost_diff,clost_perf,citro_freundii}]
                         [--threads THREADS] [--fastp] [--nextera]
                         [--trim_galore] [--debug] [--longread]
                         [--no_duplicate_removal] [--no_prinseq] [--no_fastqc]
                         [--no_abra] [--mq20] [--mq30]
                         [--remove_mismatching REMOVE_MISMATCHING]
                         [--force_restart] [--testWochenende]
                         fastq

positional arguments:
  fastq                 _R1.fastq Input read1 fastq file

optional arguments:
  -h, --help            show this help message and exit
  --aligner {bwamem,minimap2,ngmlr}
                        Aligner to use, either bwamem, ngmlr or minimap2.
                        Usage of minimap2 and ngmlr currently optimized for
                        nanopore data only.
  --readType {PE,SE}    Single end or paired end data
  --metagenome {2021_02_meta_fungi_human_masked,2021_02_meta_fungi_human_unmasked,2020_09_massiveref_human,2020_05_meta_human,2020_03_meta_human,2019_01_meta,2019_10_meta_human,2019_10_meta_human_univec,2019_01_meta_mouse,2019_01_meta_mouse_ASF_OMM,2019_01_meta_mouse_ASF,2019_01_meta_mouse_OMM,hg19,GRCh37,GRCh38-45GB,GRCh38-noalt,GRCh38-mito,mm10,rn6,rat_1AR1_ont,zf10,ss11,PA14,ecoli,nci_viruses,ezv_viruses,testdb,strept_halo,k_variicola,k_oxytoca,clost_bot,clost_bot_e,clost_diff,clost_perf,citro_freundii}
                        Meta/genome reference to use
  --threads THREADS     Number of threads to use
  --fastp               Use fastp trimmer instead of fastqc and trimmomatic
  --nextera             Attempt to remove Illumina Nextera adapters and
                        transposase sequence (default is Illumina Ultra II
                        adapters, but Illumina Nextera more common in future)
  --trim_galore         Use trim_galore read trimmer. Effective for Nextera
                        adapters and transposase sequence
  --debug               Report all files
  --longread            Only do steps relevant for long PacBio/ONT reads eg.
                        no dup removal, no trimming, just alignment and bam
                        conversion
  --no_duplicate_removal
                        Skips steps for duplicate removal. Recommended for
                        amplicon sequencing.
  --no_prinseq          Skips prinseq step (low_complexity sequence removal)
  --no_fastqc           Skips FastQC quality control step.
  --no_abra             Skips steps for Abra realignment. Recommended for
                        metagenome and amplicon analysis.
  --mq20                Remove reads with mapping quality less than 20.
                        Recommended for metagenome and amplicon analysis. Less
                        stringent than MQ30.
  --mq30                Remove reads with mapping quality less than 30.
                        Recommended for metagenome and amplicon analysis.
  --remove_mismatching REMOVE_MISMATCHING
                        Remove reads with x or more mismatches (via the NM
                        bam tag). Default 3 (so reads with 0-2 mismatches remain). Argument required.
  --force_restart       Force restart, without regard to existing progress
  --testWochenende      Run pipeline tests vs testdb, needs the subdirectory
                        testdb, default false

We recommend using bioconda for the installation of the tools. Remember to run
'conda activate <environment name>' before you start if you are using
bioconda. Details about the installation are available on
https://github.com/MHH-RCUG/Wochenende#installation

Contributors

Thanks to:

@B1T0 Original programmer, testing, evaluation, documentation

@colindaven Concept, programming, updates, integration, maintenance, evaluation, documentation

@Colorstorm Programming, testing, maintenance

@konnosif Plots visualisation

@Nijerik Wochenende reporting

@sannareddyk Bug testing, updates, evaluation

@poer-sophia Code review, testing, maintenance, programming (haybaler and more)

@twehrbein - growth rate code module, testing

@mmpust - raspir, testing, reference sequences, discussion

@irosenboom - reference sequences, testing, bugfixes

@vangreuj - bugfixes, testing

@LisaHollstein - reference sequences, testing

Tools

Alignerboost. GPL3, in dependencies folder.
ABRA2
bamtools
BWA
ea-utils
fastp
FastQC
FastUniq
Minimap2
NGMLR
perldup Already copied to dependencies folder with permission. Developed by Richard Leggett.
PRINSEQ
sambamba
samtools
trim_galore
trimmomatic

Postprocessing

Haybaler

Optional extras

Gallery

wochenende's People

Contributors

Stargazers

Watchers

Forkers

b1t0 bionij konnosif colindaven colorstorm sannareddyk phamidko pythseq poer-sophia khemlalnirmalkar kevinlyw366 pyoelii lisahollstein maryemichou ahmedelhosseiny mhh-rcug

wochenende's Issues

minimap2 @SQ problem with many refs

Fails when piping directly via samtools to BAM.

Must perform intermediate SAM step ?
consider ngmlr as alternative aligner ?
trial --split-prefixes tmpPrefix option

generate unmapped reads SAMs for blast etc analysis

run samtools on initial x.s.bam file

samtools view -f 4 foo.bam | samtools bam2fq -

added to Wochenende local dev
added to Wochenende local gitlab

All tests failed

Bug

Problem

The tests for the Wochenende pipeline failed

General info

Wochenende Version: 1.6.2
Data: testdb/reads_R1.fastq

Command

# On console
sbatch run_Wochenende_SLURM.sh testdb/reads_R1.fastq

# In the slurm script 
python3 run_Wochenende.py --metagenome testdb --threads $cpus --testWochenende --aligner bwamem --mq30 --remove_mismatching --readType SE --debug --force_restart $fastq
python3 run_Wochenende.py --metagenome testdb --fastp --threads $cpus --testWochenende --aligner bwamem --mq30 --remove_mismatching --readType SE --debug --force_restart $fastq
python3 run_Wochenende.py --metagenome testdb --longread --threads $cpus --aligner ngmlr --readType SE --debug --force_restart $fastq
python3 run_Wochenende.py --metagenome testdb --longread --threads $cpus --aligner minimap2 --readType SE --debug --force_restart $fastq

Pipeline outpout

slurm-3145064.txt
slurm-3145075.txt
slurm-3145076.txt
slurm-3145077.txt

Add option --no_abra

With option --no_abra, the pipeline step abra is not performed.
This option almost always fails on metagenomics alignments.

Try creating minimal conda environment

When trying many packages using explicit lists of conda packages and versions, some fail to install.

Idea: set a minimal conda env description with just the bioinfo tools needed, and let everything else get determined by conda as needed.

Suggestion: to be trialled.

env.wochenende.minimal.yml

visualize coverage profiles of bacteria

python pandas script
simple bar / hist /line diags of each species
exclude ignore human/mouse chr reads
Terms for image subfolders: probably_present, potentially_present
Only produce pictures when 5+ windows are present (else >1000 pictures per BAM file)
Stats in additional stats file. Stats for all hits in BAM file, i.e. disregard 5+ rule above

fastp update

@colindaven

fastp program for adapter & quality trimming of raw reads -

and we also need to update wochenende pipeline to remove the deprecated options. --
update docs to fastp 0.20.0
test

cut_by_quality5 DEPRECATED, use --cut_front instead. --cut_by_quality3 DEPRECATED, use --cut_tail instead. --cut_by_quality_aggressive DEPRECATED, use --cut_right instead.

These are the parameters I'm using for trimming SOLID data SE 75bp, fastp -i $file --adapter_fasta $FASTQS_PATH/adapters_solid.fa --cut_front --cut_window_size 5 --cut_mean_quality 15 -l 36 --json $OUT_PATH/${outfile}.trm.json --html $OUT_PATH/${outfile}.trm.html -o $OUT_PATH/${outfile}.trm.fastq

Add option --no-prinseq

Test if the low complexity sequences prinseq excludes still make sense in 2020.

Integrate new reference sequence

TODO:
Add and test new reference sequence

/lager2/rcug/seqres/metagenref/bwa/all_kingdoms_refseq_2019_Jan_final.fasta

running metagenomics reporting script, python version(requires 3.6+)

Ran basic_reporting.py script from the wochenende pipeline with python version 3.5, ran into issues with syntax with this version.

basic_reporting.py works with python version 3.6+

set ngmlr parameters

--min-identity 0.90 (0.85 currently used !)
--min-residues 0.60

    -i <0-1>,  --min-identity <0-1>
        Alignments with an identity lower than this threshold will be discarded [0.65]
    -R <int/float>,  --min-residues <int/float>
        Alignments containing less than <int> or (<float> * read length) residues will be discarded [0.25]



# Existing code (--min-residues is still default)
    ngmlrMinIdentity = (
        0.85  # Aligner ngmlr only: minimum identity (fraction) of read to reference
    )

Add coverage profiling of bacteria

exclude human coverage
add posthoc runbatch script
add as single input output script to Wochenende
test

idxstats failed to load index

Bug

Problem

There is no bam.txt(empty file) created by the command samtools idxstats after the mq30 filtering step.

General info

Wochenende Version: 1.6.1
Data: ont(oxford nanopore sequencing) and SOLiD

Command

python3 run_Wochenende.py --metagenome 2019_10_meta_human --threads $cpus --longread --no_prinseq --aligner minimap2 --mq30 --readType SE --debug --force_rest$
python3 run_Wochenende.py --metagenome 2019_10_meta_human --threads $cpus --longread --no_prinseq --aligner ngmlr --mq30 --readType SE --debug --force_restart$
python3 run_Wochenende.py --metagenome 2019_10_meta_human --threads $cpus --fastp --no_fastqc --aligner bwamem --no_abra --mq30 --remove_mismatching --readType SE --debug --force_restart $fastq

Pipeline outpout

[M::mm_mapopt_update::96.419*2.40] mid_occ = 236
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1122
[M::mm_idx_stat::98.728*2.36] distinct minimizers: 152568511 (33.07% are singletons); average occurrences: 4.933; average spacing: 5.353
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 27896
[main_samview] truncated file.
[ERROR] failed to write the results
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 27894 reads
samtools idxstats: fail to load index for "/working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.bam"
[bam_fillmd] input SAM does not have header. Abort!
samtools index: "/working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.calmd.bam" is in a format that cannot be usef
Wochenende - Whole Genome/Metagenome Sequencing Alignment Pipeline
Wochenende was created by Dr. Colin Davenport and Tobias Scheithauer
version: 1.6 - Mar 2020

Meta/genome selected: 2019_10_meta_human
######  Alignment  ######
minimap2 -x map-ont -a -t 12 /lager2/rcug/seqres/metagenref/bwa/refSeqs_allKingdoms_201910_3.fasta /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_
######  Sort BAM  ######
######  Index BAM  ######
######  samtools flagstat  ######
/mnt/ngsnfs/tools/miniconda3/envs/wochenende/bin/samtools flagstat /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.
######  Get unmapped reads from BAM  ######
/mnt/ngsnfs/tools/miniconda3/envs/wochenende/bin/samtools view -f 4 /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s
/mnt/ngsnfs/tools/miniconda3/envs/wochenende/bin/samtools view -f 4 /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s
######  Remove MQ30 reads  ######
######  Samtools index stats - chromosome counts  ######
######  Index BAM  ######
Filelist item: Alignment
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.fastq
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.bam
Filelist item: Sort BAM
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.bam
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.bam
Filelist item: Remove MQ30 reads
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.bam
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.bam
Filelist item: Samtools index stats - chromosome counts
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.bam
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.bam.txt
Filelist item: Samtools calmd
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.bam
Filelist item: /working2/rcug_lw/2020_025624_wiehlmann_fossil/we_mini/2020_03_13_fische_wiehlmann_1_bc05.s.mq30.calmd.bam

Aditional information

Without mq30 step this problem does not occur( at least on ont data)

Integrate reporting as option for human metagenomes

add code
add test method
update docs

plotting - filenames too long

Some Windows filesystems has problems with filenames ( not path names) from Wochenende plotting.

Reported by Lutz W.

Action - reduce filename length.

@konnosif Can you please try to fork the dev branch of the repo and look at this, then make a pull request vs dev ?
@Colorstorm Fabian or myself can give you a hand if you like.

For example, you could at least remove the MIGHT but especially also complete_genome from the filename.

Thanks!
Colin

add viral sequences

extract and add important human viral ref seqs from NCI USA
only use on unmapped read FASTQs after human and bact reads removed
solved with c062aab

bug in RPMM calculation

solved by Keerthi fixed with #81

sambamba markdup fails on >16383 contigs

sambamba-markdup: More than 16383 reference sequences are unsupported

eg contig seqs

Integrate sambamba window into wochenende

add code
add test method
add docu

perl version mismatch

Wochenende script fails at pcr duplicate removal step with the following error,

ListUtil.c: loadable library and perl binaries are mismatched (got handshake key 0xdb00080, needed 0xcd00080)

OS: Ubuntu 20.04
Perl version: 5.26.2

current conda env file has perl version,

perl=5.22.0.1
perl-threaded=5.22.0

try updating env.wochenende.yml file to include perl 5.26.2 to resolve compatibility issues.

new aligner - for SV alignments: lra

Supposed to be better than bwamem, ngmlr and minimap2 in SV regions ?
check if BAM supported
integrate

Conda env

We can create a conda.env.yml file for the Pipeline, afterwards it is easier to configure and to update the conda enviroment.

Creation of conda.env.yml file
Adding to the repo
Upate dokumentation

sambamba and ldc version

Hi,

Ran in to this error while running Wochenende pipeline with the latest version of sambamba( 0.7.1) installed via conda

sambamba: error while loading shared libraries: libphobos2-ldc-shared.so.87: cannot open shared object file: No such file or directory

@colindaven has suggested to pin sambamba=0.6.6 & also pin ldc=1.11.0

conda install -c conda-forge ldc=1.11.0

request: allow more mismatches (scale by read length)

if longer reads are used than our standard (ca 76bp) then more mismatches shiould be accepted
check for a new mismatch program (more flexible bamfilter on mismatches
otherwise just adjust and expand current clunky method
stay conservative: 0-100 - 2mm allowed, 100-150 -3mm allowed, 150-300 -4mm allowed ?
user can set parameter at runtime
write test for too many mismatches in filtered BAM !

"Dynamic" reference argument adding

Right now, when setting the reference paths in the python script, one has to change the settings in the command line arguments definition section of the script. This is neither documented nor wanted.

openpyxl module not found (csv-> xlsx )

If you use a version which converts csv files into xlsx, and get a openpyxl module not found error
then try this:

conda install -c bioconda openpyxl

wochenende plot error b: exotic headers ?

Project 2020_035351

Traceback (most recent call last):
  File "wochenende_plot.py", line 450, in <module>
    organism4=i.split("_")[6]
IndexError: list index out of range
Traceback (most recent call last):
  File "wochenende_plot.py", line 450, in <module>
    organism4=i.split("_")[6]
IndexError: list index out of range

MQ30 mapping quality filter

Implement using samtools

create new repo for conda env yml comparisons

https://github.com/colindaven/we_config

variable unbound

This is likely a problem with unassigned variables in a bash script which are being picked up by pipefail.

Workaround: comment out pipefail, then rerun

# before:
set -euo pipefail
after
#set -euo pipefail

basic_reporting.py and conda on ngsnfs

I'm running into this error when using conda installed at /mnt/ngsnfs/tools/miniconda3/etc/profile.d/conda.sh(run_Wochenende_reporting_SLURM.sh). Runs fine with conda installed in my working directory.

File "basic_reporting.py", line 20, in <module>
    import pysam
  File "/mnt/ngsnfs/tools/miniconda3/envs/wochenende/lib/python3.6/site-packages/pysam/__init__.py", line 5, in <module>
    from pysam.libchtslib import *
ImportError: /mnt/ngsnfs/tools/miniconda3/envs/wochenende/lib/././libk5crypto.so.3: undefined symbol: krb5int_utf8s_to_ucs2les, version krb5support_0_MIT

conda activation march 2020 - changed again !


#activate conda 2020_03
eval "$(/home/colin/miniconda3/bin/conda shell.bash hook)" 
#To install conda's shell functions for easier access, first activate, then:
conda init

Aligner not chosen - leads to bug

Example command, --aligner not defined
python3 run_Wochenende.py --metagenome 2016_06_1p_spec_corrected --threads 12 --no_abra --readType SE --debug $fastq

Alignment

Traceback (most recent call last):
File "run_Wochenende.py", line 623, in
main(parser.parse_args(), sys.argv[1:])
File "run_Wochenende.py", line 531, in main
args.threads, args.readType)
File "run_Wochenende.py", line 135, in runFunc
cF = func(cF, *extraArgs)
File "run_Wochenende.py", line 381, in runAligner
if "minimap2" in aligner:
TypeError: argument of type 'NoneType' is not iterable

Solution: default=bwamem

Is this ok @B1T0 ? Or should we make aligner choice a required non-positional argument?

no_prinseq may interfere with .ndp. step perldup

Using --no_prinseq at least with PE reads might exclude the perldup step .ndp.

To be tested in detail

add simple test for wochenende

tiny ref seq bwa mem only
tiny read set
check expected sizes and output of full run through
additional method : runTests

minimum versions needed for Wochenend viz / plotting

TODO - downgrade values in docs to these values.

@sannareddyk Can you try this (not urgent) please ?

fork -> your account
Update README.md to the versions below
Pull request to the account you forked from

Thanks!

conda list | grep numpy
numpy                     1.12.1           py36he24570b_1
numpy-base                1.15.0           py36h1793315_0

conda list | grep pandas
pandas                    0.23.4           py36h04863e7_0

conda list | grep matplot
matplotlib                2.2.2            py36h0e671d2_0

make singularity container

TODO

https://github.com/grst/containerize-conda New method, conda-pack, looks great! https://conda.github.io/conda-pack/
test conda in container --- > see https://stackoverflow.com/questions/54678805/containerize-a-conda-environment-in-a-singularity-container
make on basis Ubuntu20.04
problems: - sourcing env variables for script from singularity is very difficult (sh, not bash, no source command). Even more tricky as yaml is user defined
test usage in slurm
document - singularity tutorial
mamba container as base ! Use mamba container..(but can it read a conda env file ?) https://www.biostars.org/p/9476786/
added config.yml to make Wochenende config options independent of invidual scripts

Advantages : more reproducible. Not dependent on changing conda libs. Distributable as single file
Disadvantages: requires singularity install. More technical to change reference FASTA paths.

ngmlr -long reads- bam.txt output missing

@Colorstorm @sannareddyk

after each relevant stage eg after filtering, samtools idstats is called to create a *.bam.txt file. The file contains counts of read hits to each genome
this needs to be assessed and implemented properly for long reads, where duplicate exclusion eg via Picard is a no-go or not needed
For some reason, some of the bams *mq30 on your test data are not generating output.
However it seems the *calmd.bam ARE generating output using the above command

PE reads - unmapped sequences not output

File output but size 0. Probably needs different settings for PE reads.

filter - max x bp mismatches

todo - filter aligned reads

longread problems

make --no_duplicate_removal default on --longread mode
adjust SLURM scripts
try ngmlr on big >10GB reference, if ok, use for long read alignment to avoid @sq align and bam convert errors.

complete nextera adapter removal improvements: add trim-galore

basic nextera removal added (1.7.1)
add trim-galore code (1.7.4)
add bioconda trim-galore to conda env. Version: 0.6.5-0
Failed -
two step strategy - first trimmomatic, then fastp --> rejected as failed
adjust fastp adapters as appropriate depending on command line switches

minimap2 and fastp

The above programs should both be in the conda env install list, and not need to be manually installed

Redesign and revise tests

TODO

For filenames, test should rename filenames created with test system to avoid having to change test

mv someFancyName1 test1.bam.txt
mv someFancyName_dup2 test2.bam.txt

add samtools fixmate for PE reads

fixmate needed for PE reads, prior to the markdups step.

add long read tests

add long read samples, nanopore only, use existing testdb
add long read tests
test

reporting -filenames too long

shorten filenames for windows

reporting -> rep
sorted -> s
unsorted -> us

@Colorstorm we need to do this so file transfers to our windows colleagues work. sFTP transfers are not working because of this.

code linter - use black

try new aligner - bwa-mem2

add bwa-mem2 for testing

trim_galore trimming considerations

double trimming - first trim_galore, then trimmomatic being done at present
test if better to use no hard (?) Q20 cutoff with trim_galore, but to trim twice (trim_galore THEN trimmomatic). Is trimmomatic trimming with the phred algorithm more effective ?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.