dkfz-odcf / alignmentandqcworkflows Goto Github PK
View Code? Open in Web Editor NEWThe DKFZ alignment workflow plugin originally developed at the eilslabs
Home Page: https://github.com/DKFZ-ODCF/AlignmentAndQCWorkflows/wiki
License: Other
The DKFZ alignment workflow plugin originally developed at the eilslabs
Home Page: https://github.com/DKFZ-ODCF/AlignmentAndQCWorkflows/wiki
License: Other
The current code concerned with adapter clipping is a PITA. Variables i1, i2, o2, etc. (please use descriptive names) are defined in bwaCommonAlignmentSettings.sh. Partially they are used, partially not (o2?). The exit status of the trimmomatic call (why the hell is there an eval?) is not caught and checked.
Clean this mess up.
E.g. if two lane files are compressed differently:
lane1.fastq (uncompressed)
lane1.fastq.gz (gzip)
lane2.fastq.gz (gzip)
lane2.fastq.bz2 (bzip2)
one could argue, that this should not be the case, but Roddy should be able to detect it and pass it properly to its alignment jobs.
Currently, merging of lanes is only possible together duplication marking.
Implement mere BAM merging, without marking in the WGS and WES workflows.
Currently, the file defined by the CLIP_INDEX variable is checked, but only if it does not start with $DIR_EXECUTION. The file should always be checked and the workflow cancelled, if the file is not readable. The check should already be done in QCPipeline, as clipping can be done with all alignment data (currently in BisulfiteCoreWorkflow)
Use ExecutionContext.getExecutionDirectory for this. Alternatively, it may be sufficient to substitute the $DIR_EXECUTION
by ${DIR_EXECUTION}
.
In the WGBS workflow the CHROMOSOME_INDICES
variable is not cross-checked against the content of the BAM file. An incorrect CHROM_SIZES_FILE
in the may yield an unspecific message from the filter_readbins.pl
script (that also occurs in other situations, e.g. if the input is truncated due to pipe-errors.
This is a nuissance every time a new assembly is included.
The contents of CHROMOSOME_INDICES
, CHROM_SIZES_FILE
must be subsets of what is found in the BAM header. This should be checked early in the job scripts or earlier, e.g. in checkExecutability()
in the Java code. If there is no BAM file yet, it should be checked against the BWA index directly to prevent mapping against the wrong INDEX. For the BWA index the .ann
file can be used.
Currently, adapters are only trimmed if an adapter sequence is known before. Implement de novo adapter detect with fastp.
Allow setting:
(a) don't trim
(b) trim with trimmomatic (should remain default when trimming -- for backwards compatibility)
(c) trim with fastp without known adapter
(d) trim with fastp including known adapter
(e) only check whether de novo (observed) adapter is compatible with provided (expected) adapter.
By default, the quality-trimming should be similar (by cutoffs) to the quality-trimming currently done with trimmomatic.
The implementation should allow for flexible configuration of fastp, in case any of the other trimming options are required.
Implementation details.
fastp \
--thread 12 \
--report_title $title"_"$id \
-i $r1 \
-I $r2 \
-o $OUTDIR"/"$title"_"$id"_"$(baseOutput "$r1")"_R1.fastq" \
-O $OUTDIR"/"$title"_"$id"_"$(baseOutput "$r1")"_R2.fastq" \
-Q \
-l 8 \
--detect_adapter_for_pe \
--json $title"_"$id"_"$(baseOutput "$r1")"_fastp.json" \
--html $title"_"$id"_"$(baseOutput "$r1")"_fastp.html" \
2> $title"_"$id"_"$(baseOutput "$r1")"_fastp_out.log"
This call needs to be extended to additionally do quality-based trimming.
Currently, de.dkfz.b080.co.common.BasicCOProjectsRuntimeService#extractLibrariesFromSampleDirectory
has the position of the library in the path hardcoded. However, it also coded in sampleDirectories config-variable. Use this variable!
Tagmentation data is sometimes produced using a non-directional sequencing protocol. To bypass this issue we have to pre-sort read1, and read2 fastq files according to T <-> C ratio and A <-> G ratio. Charles provides me @schmattes42 with that script.
Chromosomose in the stats file (chromosome lengths ATCG) need to be checked to be a subset of those in the FASTA file/index. If they are not, then probably the workflows is misconfigured. A descriptive error message needs to be thrown. All tools taking the CHROM_SIZES_FILE should do this check. Similarly, alse the CHROMOSOME_INDICES
variable of the WGBS workflow should get checked
Failure to check may result in partial statistics or statistics mislabeled, e.g. in case of xenograph alignments, without any error raised. In the worst case these remain unnoticed. Alternatively, an incorrect CHROM_SIZES_FILE
may yield an unspecific message from the filter_readbins.pl
script (that also occurs in other situations, e.g. if the input is truncated due to pipe-errors).
This should be checked early in the job scripts or earlier, e.g. in checkExecutability()
in the Java code. If there is no BAM file yet, it should be checked against the BWA index directly to prevent mapping against the wrong INDEX. For the BWA index the .ann
file can be used.
Currently, there is code in the COWorkflows plugin referring to the library, however, the library seems only used in the AlignmentAndQCWorkflows plugin (and therein only by the WGBS pipeline). Consider relocating all that code to the AQCWF plugin.
Data can (and does) degenerate on the harddisk. This can be detected by whenever reading from a BAM calculating the MD5 sum comparing the result to the MD5 sum stored in the .md5 file.
This issue would also increase compliance with the Rahmendatenschutzkonzept.
Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.
Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.
For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals
All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.
Tagmentation WGS and WES data come up and need to be processed in a way similar to the T-WGBS data, namely by first doing duplication marking per library followed by simple merging of library-BAMs.
The following steps need to be taken:
At the moment methylation calling is done per library bam and afterwards on the merged bam. Instead of doing this it would be more preformance friendly to only add up the counts for all libraries.
Consider the following example Roddy call:
roddy.sh run testCoBaseConfigs-Roddy-3-0.Picard.SoftwareBwa.WGBS@alignment methylationpid --usemetadatatable=wgbsTableTest1.tsv --useconfig=applicationProperties-workstation-to-lsf.ini --usefeaturetoggleconfig=featureToggles.ini --configurationDirectories=configs,AlignmentAndQCWorkflows/,AlignmentAndQCWorkflows/resources --usePluginVersion=AlignmentAndQCWorkflows:1.2.73 --useiodir=wgbsTableTest1.testCoBaseConfigs-Roddy-3-0.Picard.SoftwareBwa.WGBS.noInputDir --cvalues="INDEX_PREFIX:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/hs37d5_PhiX_Lambda.conv.fa,CHROM_SIZES_FILE:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/stats/hs37d5_PhiX_Lambda.fa.chrLenOnlyACGT.tab,CYTOSINE_POSITIONS_INDEX:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/hs37d5_PhiX_Lambda.pos.gz,CHROMOSOME_INDICES:(21 22),outputAllowAccessRightsModification:false,usedResourcesSize:xs,runFastQC:true,runFingerprinting:true,usedResourcesSize:m" --additionalImports=wgbs-standard-lsf --useRoddyVersion=3.0.9
When rerunning the WGBS analysis on existing data, the Picard merge/mark duplicates step dies with exit code 100 (99):
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once. 1: run1_id2:HWI-xxxxxxx:104:XXXXXXXXX:1:1102:19656:66811
at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:132)
at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:285)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:89)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:99)
This is probably because the script tries to keep an existing BAM file and instead merges in a new lane that actually is already in the BAM file (because this a simple "run").
The logic for handling BAMs involves 3 factors:
bam
parameter.useOnlyExistingTargetBam
parameter is set to true or false (false by default).TODO: Clean up the logic and fix the bug.
It is a nuissance having to add target region size, although it can simply and efficiently be calculated from the input target region file.
Currently this is not the case and the workflow will fail e.g. in this case:
lane1.fastq
lane1.fastq.gz
lane2.fastq
=> This will fail
One could argue, that this should not be the case, but people tend to do funny things.
The diffchrom plot script (in R) truncates the output file name (-o
option) to 511 characters (probably 512 including the \0
string terminator in C).
In contrast to WGBS, the WES and WGS workflows currently still use the extraction of execution parameters (sample, pid, etc.) from paths.
Furthermore,
Note that path-parsing code is also in RoddyCore. E.g. de.dkfz.roddy.core.RuntimeService#extractDataSetIDFromPath
It may be a possible to start in the AQCWF, generalize into COWF and even later in the RoddyCore, as was done for the MetadataTable. Then MetadataTable and PathExtraction may become strategies (Strategy Pattern). Others may be e.g. getting data from a database (such as OTP).
Move the PID registry, Tempfile registry, and linear pipe chain extension code into the DefaultPlugin or even its own library, to make it accessibly in all plugins. Don't forget the tests!
In xenograft samples the coverage is expected to be different for the grafted (human) genome and the host's (mouse) genome. This needs to be visible in the statistics.
The workflow usually has one set of chromosomes also in xenograft, but in xenograft either the human or the mouse genome are prefixe. (CHR_PREFIX). This can be used to calculate separate statistics.
The current detection code is:
for (int i = 0; i < er.resultLines.size(); i++) {
//Look at Matthias ExomePipeline extension. The code is mainly taken from there.
String dCString = null
String dRString = "gzip -c" //POSSIBLE-ERROR: Not zipper set for 'cat'/ASCII
String type = er.resultLines[i]
if (type.startsWith("setgid "))
type = type[7..-1].trim().split(" ")[0]
if (type == "0") continue
if (type == "gzip") {
dCString = "gunzip -c"
dRString = "gzip -c"
} else if (type == "bzip2") {
dCString = "bunzip2 -c -k"
dRString = "bzip2 -c -k"
} else if (type == "ASCII" || type == "ASCII text") {
dCString = "cat"
dRString = "head -n E"
}
allLaneFiles[i].setDecompressionString(dCString)
allLaneFiles[i].setRecompressionString(dRString)
}
file -bL used to work? (or we did not recognize that it never worked properly due to other circumstances). It needs to recognize strings e.g. from:
ASCII text with very long lines
If two jobs are (by mistake or bug in an automation system or batch processing system -- our scenario) run during the same time their files get mixed and may be corrupt. For the statistics files this might by acceptible (they can be recalculated), but not for the BAM (e.g. if the FASTQs are deleted after some time -- which is our use case). The costs of such mistakes are more expensive than the costs of implementing and maintaining this feature.
Implement a simple job that checks the integrity of the final merged BAM file, e.g. by samtools flagstat $bam > /dev/null
.
Bash arrays provided to Roddy need to be of the form ( a b c )
, while the actual Bash-compatible form is (a b c)
, i.e. without spaces after (
and before )
.
The two functions waitForRegisteredPids and the _BashSucksVersion wait for all pids en bloc. It is not possible to say exactly, which PID produced the error code.
Currently, reads whose orientation can not unambiguously be classified by WGBS_read_pair_reconstruction.py
are simply ignored and no statistics are reported.
Sooner or later we have to remove the per lane, and per library bam files and only keep the final merged bam file. Therefore, whenever a new lane for a specific library is sequenced, the aligned reads for that library have to be extracted from the merged bam file, and the newly aligned lane has to merged and duplicate marked against the library bam file and later merged into the final bam file.
The context-dependent (CG+CH) methylation rates on autosomes need to be combined across chromosomes into a single number.
Detailed documentation of all input files is needed.
Important side task: While documenting keep track of necessary improvements in output and open new issues.
bwa-mem2 is 1.7-2 times faster with ~2 times more memory. For time-critical data bwa-mem2 should be used.
bwa-mem2 command (CLI) seems to be compatible to bwa-mem. Therefore,
Currently the PID registry is a plain array. With associative arrays one can give a names to the PIDs and report them upon errors, which is great for debugging and development.
Make the PID registry an associative array. See here for extensive good docu: https://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/.
We had two cases where alignment jobs failed, but the alignment files where already moved to the final location, so that a rerun required finding and deleting those cases of wrongly 'completed' alignment files.
Not sure why this happened. One likely explanation is that the files are not moved to their final destination at the very end of the script, but rather before a couple of final BWA checks. So if one of those fails, the job fails, but the alignment files are not marked as temporary. This is unwanted, as the BWA checks are used to indicate alignments which need to be replaced/taken care of.
Things to do in summary:
The bash variable PIPESTATUS contains an array of the most recent pipe expression. So
$ false | true | false | true
$ echo ${PIPESTATUS[@]}
0 1 0 1
Rescue the pipestatus for all pipes in the scripts. Write the status to dedicated files, as is done for the BWA error code.
Note that PIPESTATUS is also available in functions:
$ rescuePipeStatus () {
declare -la pipstat=(${PIPESTATUS[@]})
local outfile="$1"
echo "${pipstat[@]}" > "$outfile"
}
$ (false | true | true; rescuePipeStatus pipstat) &
$ wait
$ cat pipstat
0 1 1
According to this article the biased region during PBAT may extend further than 9 base pairs. However, the current script only ignores 9 bps when creating conversion statistics.
The methylationcalling meta job starts bisulfite calling processes with bash -- always in pairs of largest (front) and smallest (back) chromosome. This has a number of disadvantages
This job should be done by the workflow manager! Move the logic from the meta job into the workflow manager!
MultiQC simplifies the visual representation of QC data. For display in MultiQC result files either should be one of the already supported formats or can be annotated. Check the output files and restructure their content to support MultiQC.
Preferentially S3.
Problems:
We may first have to implement some basic support in Roddy. See TheRoddyWMS/Roddy#331
count(input) <= count(flagstats) [samtools < 1.2, sambamba < 0.5.9]
count(input) <= non_supplementary_count(flagstats) [samtools >= 1.2, sambamba >= 0.5.9]
Instead of
while (my $line = <$fh>) {
doSomethingWithLine($line);
}
the following paradigm should be used
while (!eof($fh)) {
my $line = readline($fh);
if (! defined $line) {
die "IO Error: $!"
}
doSomethingWithLine($line);
}
This correctly recognizes and handles IO errors during the processing (in contrast to the previous simple method).
To be able to compare input and output numbers of reads if trimming is turned on (WGBS!), the number of trimmed reads should be collected from the trimmomatic output. Add the value to the qualitycontrol.json (required for OTP internal consistency tests similar to T1263).
Example output:
Input Read Pairs: 95530459 Both Surviving: 90669916 (94.91%) Forward Only Surviving: 4023149 (4.21%) Reverse Only Surviving: 604635 (0.63%) Dropped: 232759 (0.24%)
Note that all these numbers have the unit "read pairs".
Number of non-supplementary reads in the BAM
$ samtools view -F 0x800 -c ../../replicate1_libNA_run160429_SN7001355_0065_BC7D03ACXX_41_Hf03_LiHe_Ct_WGBS_S_2_CTTGTA_L007_paired.bam.sorted.bam 181339832
This means that only the read pairs are kept, that are long enough (i.e. not dropped) and where both reads are retained.
Bug in R-Script coveragePlot.R, finishes with error:
Error in axis(1, at = pretty(c(0, nrow(sampleOneCon[[i]]))), labels = paste(pretty(c(0, :
'at' and 'labels' lengths differ, 8 != 5
Execution halted
Search for this and correct it.
/**
* Hack method! Allow to decrease a file stage
*/
public LaneFile getFSDecreasedCopy() {
LaneFile lf = new LaneFile(this, this.getExecutionContext());
lf.fileStageSettings = fileStageSettings.decreaseLevel();
return lf;
}
Please fix a small bug in the coveragePlot.R script: the plotting device is only closed if twoSamples is true. The dev.off()
should be moved out of the if(twoSamples)
block. Otherwise, the script fails with a 'Too many open devices error'. See snippet below:
I would do it myself but there seems to be a central place where this change has to be made and committed, and I am not sure how to do that.
for( i in names(sampleOneCon)){
chr.name = i
chr.name = gsub("^0*", "", i)
chr.name = paste("chr", chr.name, sep = "")
fname = outputFile
fname = gsub("\\.png", paste("_", chr.name, ".png", sep = ""), fname)
png(fname, width=1100,height=1000)
par(mfrow=c(3,1),cex=1.5,mar=c(2,4,2,0.1))#
#sampleOne#
plot(log(sampleOneCon[[i]]$reads+1,2),pch=".",main=paste(sampleOneName,"_", chr.name,sep=""),ylab=paste("log2 #",countType," per ",windowSize,"kb",sep=""),axes=FALSE)
box()#
axis(2)#
axis(1,at=pretty(c(0,nrow(sampleOneCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleOneCon[[i]]))/bp_scale),"Mb",sep=""))#
#sampleTwo#
if (twoSamples) {
plot(log(sampleTwoCon[[i]]$reads+1,2),pch=".",main=paste(sampleTwoName,"_", chr.name,sep=""),ylab=paste("log2 #",countType," per ",windowSize,"kb",sep=""),axes=FALSE)
box()#
axis(2)#for( i in names(sampleOneCon)){
axis(1,at=pretty(c(0,nrow(sampleTwoCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleTwoCon[[i]]))/bp_scale),"Mb",sep=""))#
#
#Ratio between the 2 samples normalized to #reads all chromosomes
plot(log(((sampleTwoCon[[i]]$reads)/sumSampleTwo)/((sampleOneCon[[i]]$reads)/sumSampleOne),2),pch=".",main=paste("log2 ratio ",sampleTwoName," vs. ", sampleOneName, " ", chr.name, sep=""),ylab=paste("#",countType," per ", windowSize, "kb normalized to total reads",sep=""),axes=FALSE)#
box()#
axis(2)#
axis(1,at=pretty(c(0,nrow(sampleOneCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleOneCon[[i]]))/bp_scale),"Mb",sep=""))#
abline(h=0,col=2)#
dev.off() # THIS LINE HAS TO BE MOVED DOWN OUT OF THE IF BLOCK, SO THAT DEVICES ARE ALWAYS CLOSED
}
}
Currently, methylCtools (fqconv) currently encodes information in a string that is attached to the identifier. This causes two problems:
/{12}
suffixes the concatenation prevents BWA to strip of the read number suffix and downstream tools failfqconv
should write the meta information into the FASTQ comment in a suitable SAM-spec compatible attribute format and BWA should be called with -C
option. bconv
needs to parse the comment information back instead of taking it from the extended identifier.
See https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf:
e.g. X?:Z:$value
In method mergeAndRemoveDuplicatesSlim, the check and throw of an exception is insufficient during submission time. Instead there should be a check in the workflow/plugin method already earlier before submission starts in checkExecutability()
.
Implementation:
de.dkfz.b080.co.common.BasicCOProjectsRuntimeService#extractLibrariesFromSampleDirectory
into a static method. Leave the throw where it is, because the exception is only useful here. You could also change the code into a hash { biobambam: MERGEANDMORMDUP_SLIM_BIOBAMBAM, ... }
checkExecutability()
.According to Naveed the tool BIS-Sniper is a very fast tool for Methyation Calling as well as SNP-Calling. It could be worthwile to replace our current Methylation Caller by BIS-Sniper. Another advantage of this tool is, that we get SNP calls for free!!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.