exascience / elprep Goto Github PK

View Code? Open in Web Editor NEW

286.0 286.0 42.0 3.55 MB

elPrep: a high-performance tool for analyzing sequence alignment/map files in sequencing pipelines.

License: Other

Go 100.00%

elprep's People

Contributors

Stargazers

Watchers

Forkers

blindglobe vswilliamson leonorpalmeira fsoubes geertvandeweyer im-kulikov arun-sub matthdsm cgkarl hoardboard dauss75 haoziyeung dayedepps yixf-self altingia xiaoqiwang19 shammishailaj maurya-anand roysomak4 rjammala flywind2 tanakh pythseq ardeshir raonyguimaraes shalomhillelroth kfabryczny booew linhxxx mewbak juke34 quanrd datastark pstrybol schaudge aipolly iamyourboss gxpeng 20788500 xiongjun19 elliotwutingfeng

elprep's Issues

sfm --haplotypecaller introduces duplicate lines in BAM

Hi,

I am testing elPrep, mostly for the variant-calling step: hoping to use it as a drop-in replacement for GATK4. I am using a small test BAM, paired-end reads, some mates map to different chromosomes. I noticed that in sfm mode, every read whose mate maps to a different chrom gets duplicated in the output.bam. This doesn't affect the GVCFs produced by elPrep, but the resulting BAM is not spec-compliant, and I suspect this may affect other tools working on the elPrep-produced BAMs. Details below.

Run elprep in sfm mode:
elprep sfm testin.bam testout_sfm.bam --haplotypecaller test_sfm.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Run elprep in filter mode:
elprep filter testin.bam testout_filter.bam --haplotypecaller test_filter.g.vcf.gz --reference /data/HumanGenome/hs38DH.elfasta
Compare small BAMs sfm vs filter:
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 testout_sfm.bam --in2 testout_filter.bam
-> testout_sfm.bam contains duplicate lines for each read whose mate maps to a different chromosome.

Question: does this impact the elprep GVCFs?
zdiff test_sfm.g.vcf.gz test_filter.g.vcf.gz
-> only diff is the header ##elPrepCommandLine , no consequence for the elprep variant-caller.

The bug doesn't occur if I don't call variants:
elprep filter testin.bam ttt_filter.bam
elprep sfm testin.bam ttt_sfm.bam
/home/nthierry/Software/BamUtil/bamUtil_1.0.13/bamUtil/bin/bam diff --in1 ttt_sfm.bam --in2 ttt_filter.bam
-> no difference

The SAM spec says:
[QNAME] In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.
[FLAG] For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’
-> the elPrep-produced BAMs with duplicate lines doesn't seem compliant (AFAICT?)

Regards,
Nicolas

elprep drops into ldb on sbcl 1.2.9 OSX

The following error is reproducible. I traced the problem to (setup-standard-streams). If this is not called, all is well. If I rebind the standard IO streams to sb-sys:*stderr* instead of using setq then the problem is resolved (so far).

./elprep CORRUPTION WARNING in SBCL pid 2496(tid 140735277009248):
Memory fault at 0x1bc000 (pc=0x100116a4b2, sp=0x19ff878)
The integrity of this image is possibly compromised.
Continuing with fingers crossed.

...

Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10037B6B83}>
Help! 11 nested errors. SB-KERNEL:*MAXIMUM-ERROR-DEPTH* exceeded.
Backtrace for: #<SB-THREAD:THREAD "main thread" RUNNING {10037B6B83}>
fatal error encountered in SBCL pid 2496(tid 140735277009248):
%PRIMITIVE HALT called; the party is over.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> backtrace
Backtrace:
   0: Foreign function ldb_monitor, fp = 0x19f6f80, ra = 0x109c16
   1: Foreign function call_lossage_handler, fp = 0x19f6f90, ra = 0x10625a
   2: Foreign function lose, fp = 0x19f7080, ra = 0x1064a1
   3: Foreign function handle_trap, fp = 0x19f70b0, ra = 0x108bc2
   4: Foreign function signal_emulation_wrapper, fp = 0x19f7100, ra = 0x112ff7
   5: Foreign function stack_allocation_recover, fp = 0x19f7170, ra = 0x111bf0
   6: Foreign function stack_allocation_recover, fp = 0x19f75e8, ra = 0x111bf0
   7: SB-IMPL::ERROR-ERROR
   8: SB-IMPL::INFINITE-ERROR-PROTECTOR
   9: SB-KERNEL::INTERNAL-ERROR
  10: Foreign function call_into_lisp, fp = 0x19f76f0, ra = 0x11992e
  11: Foreign function funcall2, fp = 0x19f7720, ra = 0x10298c
  12: Foreign function interrupt_internal_error, fp = 0x19f7760, ra = 0x108ada
  13: Foreign function handle_trap, fp = 0x19f7790, ra = 0x108b87
  14: Foreign function signal_emulation_wrapper, fp = 0x19f77e0, ra = 0x112ff7
  15: Foreign function stack_allocation_recover, fp = 0x19f7850, ra = 0x111bf0
  16: Foreign function stack_allocation_recover, fp = 0x19f7cc0, ra = 0x111bf0
  17: 
  18: SB-IMPL::ERROR-ERROR
  19: SB-IMPL::INFINITE-ERROR-PROTECTOR
  20: COMMON-LISP::ERROR
  21: SB-KERNEL::TWO-ARG--
  22: SB-DEBUG::MAP-BACKTRACE
  23: (COMMON-LISP::FLET THUNK66 KEYWORD::IN SB-DEBUG::PRINT-BACKTRACE)
  24: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  25: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  26: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  27: SB-DEBUG::PRINT-BACKTRACE
  28: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-IMPL::ERROR-ERROR)
  29: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  30: SB-IMPL::ERROR-ERROR
  31: SB-IMPL::INFINITE-ERROR-PROTECTOR
  32: COMMON-LISP::ERROR
  33: SB-KERNEL::TWO-ARG--
  34: SB-DEBUG::MAP-BACKTRACE
  35: (COMMON-LISP::FLET THUNK66 KEYWORD::IN SB-DEBUG::PRINT-BACKTRACE)
  36: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  37: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  38: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  39: SB-DEBUG::PRINT-BACKTRACE
  40: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-IMPL::ERROR-ERROR)
  41: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  42: SB-IMPL::ERROR-ERROR
  43: SB-IMPL::INFINITE-ERROR-PROTECTOR
  44: COMMON-LISP::ERROR
  45: SB-SYS::MEMORY-FAULT-ERROR
  46: Foreign function call_into_lisp, fp = 0x19f8630, ra = 0x11992e
  47: Foreign function post_signal_tramp, fp = 0x19f86b8, ra = 0x119b20
  48: SB-IMPL::OUTPUT-CHAR-UTF-8-LINE-BUFFERED
  49: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  50: COMMON-LISP::TERPRI
  51: SB-FORMAT::&-FORMAT-DIRECTIVE-INTERPRETER
  52: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  53: SB-FORMAT::%FORMAT
  54: COMMON-LISP::FORMAT
  55: SB-DEBUG::%PRINT-DEBUGGER-INVOCATION-REASON
  56: SB-DEBUG::%INVOKE-DEBUGGER
  57: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  58: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  59: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  60: COMMON-LISP::INVOKE-DEBUGGER
  61: COMMON-LISP::ERROR
  62: SB-SYS::MEMORY-FAULT-ERROR
  63: Foreign function call_into_lisp, fp = 0x19f8d00, ra = 0x11992e
  64: Foreign function post_signal_tramp, fp = 0x19f8d88, ra = 0x119b20
  65: SB-IMPL::OUTPUT-BYTES/UTF-8
  66: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  67: SB-IMPL::FD-SOUT
  68: SB-IMPL::%WRITE-STRING
  69: SB-PRETTY::OUTPUT-LINE
  70: SB-PRETTY::MAYBE-OUTPUT
  71: COMMON-LISP::PPRINT-NEWLINE
  72: SB-FORMAT::_-FORMAT-DIRECTIVE-INTERPRETER
  73: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  74: (COMMON-LISP::LABELS BODY-NAME-1166 KEYWORD::IN SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK)
  75: SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK
  76: SB-FORMAT::<-FORMAT-DIRECTIVE-INTERPRETER
  77: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  78: SB-FORMAT::%FORMAT
  79: COMMON-LISP::FORMAT
  80: SB-DEBUG::%INVOKE-DEBUGGER
  81: (COMMON-LISP::LAMBDA () KEYWORD::IN SB-DEBUG::FUNCALL-WITH-DEBUG-IO-SYNTAX)
  82: SB-IMPL::CALL-WITH-SANE-IO-SYNTAX
  83: SB-IMPL::%WITH-STANDARD-IO-SYNTAX
  84: COMMON-LISP::INVOKE-DEBUGGER
  85: COMMON-LISP::ERROR
  86: SB-SYS::MEMORY-FAULT-ERROR
  87: Foreign function call_into_lisp, fp = 0x19f9840, ra = 0x11992e
  88: Foreign function post_signal_tramp, fp = 0x19f98c8, ra = 0x119b20
  89: SB-IMPL::OUTPUT-BYTES/UTF-8
  90: (COMMON-LISP::LAMBDA (COMMON-LISP::&REST COMMON-LISP::REST) KEYWORD::IN SB-IMPL::GET-EXTERNAL-FORMAT)
  91: SB-IMPL::FD-SOUT
  92: SB-IMPL::%WRITE-STRING
  93: SB-PRETTY::OUTPUT-LINE
  94: SB-PRETTY::MAYBE-OUTPUT
  95: COMMON-LISP::PPRINT-NEWLINE
  96: SB-FORMAT::_-FORMAT-DIRECTIVE-INTERPRETER
  97: SB-FORMAT::INTERPRET-DIRECTIVE-LIST
  98: (COMMON-LISP::LABELS BODY-NAME-1166 KEYWORD::IN SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK)
  99: SB-FORMAT::INTERPRET-FORMAT-LOGICAL-BLOCK
ldb>

python script handing of `single-end` flag

Hi,

when running the following command:

elprep-sfm-gnupar.py \
       elprep_sWGS.bam \
        elprep_sWGS_filtered.bam \
        --mark-duplicates \
        --sorting-order coordinate\
        --nr-of-threads $CORES \
        --nr-of-jobs 2 \
        --intermediate-files-output-type sam \
        --intermediate-files-output-prefix elprep \
        --single-end

I get the following error when the filter step is about to start:

flag provided but not defined: -single-end

It seems to me the sfm scripts do nog handle the --single-end flag correctly by passing it to the filter command, resulting in the error above.

Could you have a look at those?

Thanks
M

Q: support for PacBio "native" format

Hi,
is elPrep also a drop-in replacement for processing BAM files in "PacBio native" format, i.e., BAM files containing unaligned long reads plus additional quality information that is used for downstream steps such as alignment with pbmm2?
Thanks for the info.
+Peter

vcf-to-elsites

Hi, @caherzee
I am trying to convert a vcf file to elsites but I am getting an error message like this
missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>
that's the log for the process

elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.

2022/01/10 14:00:16 Created log file at ../analysis/elprep_VC/logs/elprep/elprep-2022-01-10-14-00-16-701177358-EET.log
2022/01/10 14:00:16 Command line: [elprep vcf-to-elsites NA12877.vcf NA12877.elsites --log-path ../analysis/elprep_VC/]
2022/01/10 14:00:16 missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>
panic: missing ID in a VCF meta-information line: PEDIGREE=<Child=NA12877,Mother=NA12890,Father=NA12889>

goroutine 1 [running]:
log.Panicf(0x687f09, 0x2d, 0xc000063b90, 0x1, 0x1)
	/opt/conda/conda-bld/elprep_1633347706013/_build_env/go/src/log/log.go:361 +0xc5
github.com/exascience/elprep/v5/vcf.(*StringScanner).ParseMetaInformation(0xc000063c90, 0xc000022440, 0x36)
	/opt/conda/conda-bld/elprep_1633347706013/work/vcf/vcf-files.go:155 +0x2ed
github.com/exascience/elprep/v5/vcf.ParseHeader(0xc0000104e0, 0x2e, 0xc000074120)
	/opt/conda/conda-bld/elprep_1633347706013/work/vcf/vcf-files.go:293 +0x21b
github.com/exascience/elprep/v5/intervals.FromVcfFile(0x7ffdaa54d41c, 0xb, 0x0)
	/opt/conda/conda-bld/elprep_1633347706013/work/intervals/intervals.go:310 +0xb8
github.com/exascience/elprep/v5/cmd.VcfToElsites()
	/opt/conda/conda-bld/elprep_1633347706013/work/cmd/convert.go:47 +0x228
main.main()
	/opt/conda/conda-bld/elprep_1633347706013/work/main.go:75 +0x38b

I am not sure what could be the issue. the VCF file I am using is for the platinum genome from illumina so can anyone help here ?

Encapsulate format conversions into elprep

Do you have plans to encapsulate format conversions (VCF->elPrep VCF, Fasta->elprep_fasta, BED->ElPrep BED) inside the tool, so the user does not have the overhead of running these commands?
These are standard bioinformatic formats, and additionally keeping these special elprep files could be cumbersome. Thanks :)

Support for multiple input files

Hi,

I was wondering if it would be possible to add support for multiple input files. This way elprep could be used to merge multiple split bam files into a single final bam. This approach is also used by GATK.

Thanks
M

testing?

There only a very light covering of the code base with tests here. In my experience, ensuring a safe and robust implementation of BGZF and BAM (and to a lesser degree SAM) is non-trivial. So more tests are certainly warranted.

filter mismatches and or percentage identity?

Hello,

thanks for a really fast and efficient tool.

I know this is primarily targeted towards speeding up preprocessing before the GATK workflow, but I wonder if you had considered mismatches ?

It's actually not that easy to filter by mismatches, especially dynamically. With bamtools you can set a filter as json, eg NM<4, but you can't do this dynamically per read to my knowledge. That leaves pysam, which is slow and an additional dependency.

With long reads eg ONT, it would be very advantageous to set a read filter defined as percent identity or NM per 100 / 1000 bp. This would help to better filter out poor reads, which at the moment are not easy to filter. Some aligners (ngmlr) include a pc identity filter, yet the fastest tools such as minimap2 do not support a percent identity filter.

Are there any plans to add new features like this ?

cheers,
Colin

GATK equivalency - version

Hi,

Can you add in the README which GATK version exactly Elprep outputs equivalent data with?
Thanks
M

Does the command remove duplicates also does the marking of the duplicates?

Hi,
Should command --remove-duplicates only be used in combination with --mark-duplicates? Does it also mark the duplicates or just removes previously marked ones?
(when I used both --mark-duplicates and --remove-duplicates it seamed i got smaller size output file comparing to the one I got with only --remove-duplicates filter)
Thank you.

Type error for stream in parse-sam-header on sbcl 1.2.9 OSX

I've got further from #2 by using the method I described. I've read the code and would expect this to work (via Gray streams), but nevertheless:

elPrep version 2.11. See http://github.com/exascience/elprep for more information.
Executing command:
  elprep /dev/stdin NA12878-chr22-10pct.only_mapped.bam --filter-unmapped-reads --sorting-order keep --gc-on 0 --nr-of-threads 1

debugger invoked on a TYPE-ERROR in thread
#<THREAD "main thread" RUNNING {10037AECC3}>:
  The value
    #S(ELPREP::BUFFERED-ASCII-INPUT-STREAM
       :INDEX 0
       :LIMIT 8192
       :BUFFER #(69 82 82 49 57 52 49 52 55 46 53 53 ...)
       :SECONDARY-BUFFER NIL
       :ELEMENT-TYPE BASE-CHAR
       :STREAM #<SB-SYS:FD-STREAM for "file /dev/fd/0" {10038C60C3}>)
  is not of type
    STREAM.

Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.

(no restarts: If you didn't do this on purpose, please report it as a bug.)

(ELPREP:PARSE-SAM-HEADER #<unavailable argument>) [tl,external]
0]

BQSR filters out reads 'not in target region'

I noticed in the logs that when using the BQSR option, reads not in target region are filtered out. In our case this is unwanted behavior, since we intend to use the output bam as our only archived copy of the data. As such we would like to keep a maximum amount of data.
The read removal doesn't seem to happen when omitting BQSR, so I'm not sure if this is a feature or a bug.

Matthias

Originally posted by @matthdsm in #44 (comment)

Variant calling using `sfm` does not adhere to thread limit

Hi,

I noticed that when using 'sfm' mode, elprep uses more resources than allowed (e.g. all available threads instead of the given thread count). When using filter, this does not occur.

Cheers
Matthias

Bioconda recipe

Hi,

Following your talk at FOSDEM earlier this week, I've taken the liberty of creating a conda recipe for elPrep. I believe this will add another layer of visibility to the tool, and make it even easier to install.
Please feel free to modify/update/whatever.

bioconda/bioconda-recipes#7576

Cheers
Matthias

elprep sfm mode exits with out of Memory error

Thanks for the great work with elPrep! It has been really useful in cutting down analysis runtimes!

We have been running elPrep (4.1.5) on a WGS dataset to primarily use the mark duplicates and bqsr functionalities, with mixed success. A subset of samples work as expected while some are exiting with the following runtime out of memory error. Would greatly appreciate any inputs regarding this problem -

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x5f0611, 0x16)
        /opt/local/lib/go/src/runtime/panic.go:617 +0x72
runtime.sysMap(0x11748000000, 0x4000000, 0x78c078)
        /opt/local/lib/go/src/runtime/mem_linux.go:170 +0xc7
runtime.(*mheap).sysAlloc(0x7746c0, 0x2000, 0x7746d0, 0x1)
        /opt/local/lib/go/src/runtime/malloc.go:633 +0x1cd
runtime.(*mheap).grow(0x7746c0, 0x1, 0x0)
        /opt/local/lib/go/src/runtime/mheap.go:1222 +0x42
runtime.(*mheap).allocSpanLocked(0x7746c0, 0x1, 0x78c088, 0x7f9d8df1a888)
        /opt/local/lib/go/src/runtime/mheap.go:1150 +0x37f
runtime.(*mheap).alloc_m(0x7746c0, 0x1, 0x45002f, 0x7f9d8df1a888)
        /opt/local/lib/go/src/runtime/mheap.go:977 +0xc2
runtime.(*mheap).alloc.func1()
        /opt/local/lib/go/src/runtime/mheap.go:1048 +0x4c
runtime.systemstack(0x0)
        /opt/local/lib/go/src/runtime/asm_amd64.s:351 +0x66
runtime.mstart()
        /opt/local/lib/go/src/runtime/proc.go:1153

The targeted coverage for the dataset is 60X , the input BAM is roughly ~114GB. The command line taken from the logs -

/home/ubuntu/elPrep/elprep sfm INPUTBAM OUTPUTBAM --mark-duplicates --mark-optical-duplicates OUTPUTDUPMETRICS --sorting-order keep --bqsr OUTPUTRECAL --bqsr-reference ucsc.hg19.elfasta --known-sites <knownSiteFiles>

Unclipped position not present in SAM alignment

Hi,
I'm getting failed tasks when using --mark-optical-duplicates, and a message 'Unclipped position not present in SAM alignment'. Could this be the cause, or should I look for some other reason that the task is failing?

Thank you.

Estimate RAM usage based on input filesize

Hi,

I'm trying to find a way to get a rough estimate of how much ram I'll need to run elprep filter based on the size of the input bam.

Do you have any way of calculating this, e.g. for when submitting a job to some cloud provider?

Thanks
Matthias

"claws" dependency in .asd file not found.

compiling or loading the .asd system barfs on the missing "claws" ("claws" is not available in quicklisp and I wasn't sure what is actually is).

Marco

Proper running using sam input file?

Hi,

I read that elprep can work with .sam file input, and since it d oes coordinate sorting, I just mapped my reads to the elfasta converted reference, and used the first .sam file as input.
I am a little confused/concerned whether the final vcf would be correct due to the job log.
I used the following code:
elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --nr-of-threads 28 --tmp-path $TMPDIR
--mark-duplicates --mark-optical-duplicates AL91.metrics
--sorting-order coordinate
--bqsr AL91.recal
--reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta
--haplotypecaller AL91.vcf.gz

and the log- I thought for proper variant calling it had to first convert/sort the .sam and then split.
It has been ~16 hours and the only output is the AL91.recal and not a AL91.metrics out.

Here is the log.
elprep version 5.1.1 compiled with go1.16.7 - see http://github.com/exascience/elprep for more information.

2022/01/20 20:44:07 Created log file at /users/PHS0338/jpac1984/logs/elprep/elprep-2022-01-20-20-44-07-250202704-EST.log
2022/01/20 20:44:07 Command line: [elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --nr-of-threads 28 --tmp-path /tmp/slurmtmp.17532726 --mark-duplicates --mark-optical-duplicates AL91.metrics --sorting-order coordinate --bqsr AL91.recal --reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta --haplotypecaller AL91.vcf.gz]
2022/01/20 20:44:07 Executing command:
elprep sfm AL91.sam AL91.output.bam --filter-unmapped-reads --mark-duplicates --mark-optical-duplicates AL91.metrics --optical-duplicates-pixel-distance 100 --bqsr AL91.recal --reference /users/PHS0338/jpac1984/data/myse-hapog.elfasta --quantize-levels 0 --max-cycle 500 --haplotypecaller AL91.vcf.gz --sorting-order coordinate --nr-of-threads 28 --tmp-path /tmp/slurmtmp.17532726 --intermediate-files-output-prefix AL91 --intermediate-files-output-type sam
2022/01/20 20:44:07 Splitting...
2022/01/20 21:01:22 Filtering (phase 1)...
2022/01/20 21:29:00 Filtering (phase 2) and variant calling...

Hopefully, I am doing the proper procedure and not wasting time.

Best regards;

Juan

Feature discussion: avoiding samtools calls

Hi,

To further improve performance, would it be a good choice to avoid shelling out to samtools by using a go native HTSlib implementation? A quick google search brought https://github.com/biogo/hts to the surface.

Do you suppose it's viable to integrate this into elprep?

Cheers
M

Compilation problems

Hi,

When I tried to install your tools, I get the following error:
$ go get github.com/exascience/elprep
package github.com/exascience/elprep/v4/cmd: cannot find package "github.com/exascience/elprep/v4/cmd" in any of:
/usr/lib/go-1.7/src/github.com/exascience/elprep/v4/cmd (from $GOROOT)
/home/bvalot3/.go/src/github.com/exascience/elprep/v4/cmd (from $GOPATH

The GOPATH variable is set correctly:
$ echo $GOPATH
/home/bvalot3/.go

My configuration is:

Debian Strech
go version go1.7.4 linux/amd64

Thanks in advance

Is this a reasonable way to limit default threads used

We are very excited to try the new features in Version 5. Version 4 has saved us much time and complexity.

I know that by default GOMAXPROCS is set to all threads. This doesn't work well with our shared system. We have had people use a shell script to force a smaller default. However I was curious if the following change would work for the executable itself. I realize that duplicating the 16 is not the best software hygiene (should be a const var that could be changed just once) but this is an easy change for me to make:

cmd/filter.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/merge.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/merge-optical-duplicates-metrics.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/sfm.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")
cmd/split.go:	flags.IntVar(&nrOfThreads, "nr-of-threads", 16, "number of worker threads")

best and thanks for this utility,

Jim Henderson, Cal Academy of Sciences

"exit status 2" error

Hey,

It's really a exciting variant calling tool. I would like to use for aligners benchmark. But I ran it on SSD and always got an error "exit status 2". I attached the log and the commands. Thanks.

Log:

2021/12/10 12:47:43 Splitting...
2021/12/10 12:54:06 Filtering (phase 1)...
2021/12/10 13:05:51 Filtering (phase 2) and variant calling...
2021/12/10 13:05:51 exit status 2

Command:

#1.	ref 
elprep fasta-to-elfasta /ssd-path-to-folder/data/hg37.fna \
/ssd-path-to-folder/data/elprep/hg37.elfasta
 
#2.	sites 
elprep vcf-to-elsites /ssd-path-to-folder/yan/variant-call/v2.19-out/truth_snp.recode.vcf \
/ssd-path-to-folder/data/elprep/hg37_snp.elsites
 
elprep vcf-to-elsites /ssd-path-to-folder/yan/variant-call/v2.19-out/truth_indels.recode.vcf \
/ssd-path-to-folder/data/elprep/hg37_indels.elsites
 
#3.	variant calling
elprep sfm /ssd-path-to-folder/yan/NA12878/accalign.mason.bam \
/ssd-path-to-folder/yan/NA12878/NA12878.output.bam \
--mark-duplicates --mark-optical-duplicates /ssd-path-to-folder/yan/NA12878/NA12878.output.metrics \
--sorting-order coordinate \
--bqsr /ssd-path-to-folder/yan/NA12878/NA12878.output.recal \
--known-sites /ssd-path-to-folder/data/elprep/hg37_indels.elsites,/ssd-path-to-folder/data/elprep/hg37_snp.elsites \
--reference /ssd-path-to-folder/data/elprep/hg37.elfasta \
--haplotypecaller accalign.gatk.vcf.gz

UMI's and duplicate marking

Hi,

Are there plans to support UMI based duplicate marking as is supported in Picard? We're looking into using UMI's in our shallow WGS pipeline and would like to stream into Elprep.

Thanks
Matthias

panic: interface conversion: interface {} is nil, not string

Using this command (from container quay.io/biocontainers/elprep:4.1.6--0):

    elprep sfm ${bam_file} ${bam_file.baseName}.elprep.bam  \\
    --mark-duplicates --mark-optical-duplicates output.metrics \\
    --bqsr output.recal \\
    --bqsr-reference ${genome_index} \\
    --known-sites ${known_sites} \\
    --nr-of-threads ${task.cpus} \\
    ${params.elprep_options}

I got the following error

 elprep merge /home/juke34/git/test/nf_global_editing/.nextflow/workDir/9e/36a20046eb015afb9584a1738d9f6c/elprep-splits-processed-2021-02-02T13:29:40Z/ /dev/stdout --nr-of-threads 2
panic: interface conversion: interface {} is nil, not string

goroutine 10 [running]:
github.com/exascience/elprep/v4/filters.readGroupCovariate(0xc00010d730, 0xc00019a000, 0x5ba740, 0xc000182200)
	/Users/costanza/develop/go/elprep/filters/bqsr.go:76 +0x205
github.com/exascience/elprep/v4/filters.BaseRecalibratorTables.ApplyBQSRWithMaxCycle.func1.2(0xc00019a000, 0xf01)
	/Users/costanza/develop/go/elprep/filters/bqsr.go:1416 +0x9d
github.com/exascience/elprep/v4/sam.ComposeFilters.func1(0x0, 0x5b2460, 0xc0001a37c0, 0x5b2460, 0xc0001a37c0)
	/Users/costanza/develop/go/elprep/sam/filter-pipeline.go:219 +0xcb
github.com/exascience/pargo/pipeline.feed(0xc000078380, 0xc00000dd60, 0x3, 0x4, 0x0, 0x0, 0x5b2ca0, 0xc00000dda0)
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/filter.go:64 +0x5b
github.com/exascience/pargo/pipeline.(*lparnode).Begin.func1(0xc000078400, 0xc000078380, 0x0)
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/lparnode.go:80 +0xb3
created by github.com/exascience/pargo/pipeline.(*lparnode).Begin
	/Users/costanza/go/pkg/mod/github.com/exascience/[email protected]/pipeline/lparnode.go:58 +0x188

error while trying to split

Hi.
I have a 100X human bam file (about 210G) that I want to mark duplicates inside.

I have a server with about 800Gb of RAM, but judging by your description it would be unsafe to try marking duplicates without splitting first.

When I try the splitting command like this:
elprep-v2.10/elprep split bam.merged.bam splitFiles --output-prefix bamMerged --output-type sam

I get the following error:
"view: invalid option -- '@'
open: No such file or directory
[main_samview] fail to open file for reading.
view: invalid option -- '@'
[main_samview] fail to open file for reading.
view: invalid option -- '@'
[main_samview] fail to open file for reading.
"
I have a hunch that there is a problem with version of samtools, but its only a guess. Do you have any ideas?

Issue during build of v5.1.0

It seems building from source doesn't work:
https://app.circleci.com/pipelines/github/bioconda/bioconda-recipes/53687/workflows/40af8e10-1d30-4541-9fba-0b3cc7ab98c3/jobs/170646

13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/exascience/pargo v1.1.0
13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/google/uuid v1.2.0
13:19:42 BIOCONDA INFO (OUT) go: downloading golang.org/x/sys v0.0.0-20210514084401-e8d321eab015
13:19:42 BIOCONDA INFO (OUT) go: downloading github.com/bits-and-blooms/bitset v1.2.0
13:19:44 BIOCONDA INFO (OUT) # github.com/exascience/elprep/v5/sam
13:19:44 BIOCONDA INFO (OUT) sam/split-merge.go:237:21: undefined: MergeInputs
13:19:44 BIOCONDA INFO (OUT) sam/split-merge.go:671:21: undefined: MergeInputs

How to control compression level of output?

Hi,
Is there a way to control compression level of output, similarly to picard MarkDuplicates?
Thanks.

elprep split behaviour with/without trailing slash on /path/to/output/

elPrep version 2.10 (LispWorks pre-compiled distribution)

I think the documentation for elprep split could be clearer on the importance of the trailing slash on the /path/to/output/. If you omit the trailing slash, everything appears to be written to a single file called e.g. /path/to/output.sam. If the slash is included, then the data are split correctly into the directory.

This could be considered a bug because the non-splitting behaviour is probably not what is intended.

exit status 2

Hi,
I run elprep. However I couldn't get a vcf file.

$ elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --sorting-order coordinate --bqsr Sample.output.recal --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --reference ucsc.hg19.elfasta --haplotypecaller Sample.output.vcf.gz

elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.

2021/06/30 19:13:41 Created log file at /home/nmc-ei/logs/elprep/elprep-2021-06-30-19-13-41-430755928-JST.log
2021/06/30 19:13:41 Command line: [elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --sorting-order coordinate --bqsr Sample.output.recal --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --reference ucsc.hg19.elfasta --haplotypecaller Sample.output.vcf.gz]
2021/06/30 19:13:41 Executing command:
 elprep sfm Sample.Aligned.sortedByCoord.out.bam Sample.output.bam --mark-duplicates --mark-optical-duplicates Sample.output.metrics --optical-duplicates-pixel-distance 100 --bqsr Sample.output.recal --reference ucsc.hg19.elfasta --quantize-levels 0 --max-cycle 500 --known-sites dbsnp_138.hg19.elsites,Mills_and_1000G_gold_standard.indels.hg19.elsites --haplotypecaller Sample.output.vcf.gz --sorting-order coordinate --intermediate-files-output-prefix Sample.Aligned.sortedByCoord.out --intermediate-files-output-type bam
2021/06/30 19:13:41 Splitting...
2021/06/30 19:15:39 Filtering (phase 1)...
2021/06/30 19:16:46 exit status 2

best and thanks for this utility,

eisuke

sfm sorting problem

elprep v 4.0.1 SFM functionality results in incorrectly sorted bam.

command : /opt/NGS/binaries/elPrep/4.0.1/elprep sfm 'DNA1802266B_recalibrated.bam' /home/shared_data_medgen_frax/Exome_CNV_TestRuns/CNV_tmp_files/dna1802266b/dna1802266b.full.bam --remove-duplicates --filter-mapping-quality '40' --log-path '/home/shared_data_medgen_frax/Exome_CNV_TestRuns/CNV_Job_Output/dna1802266b/' --filter-unmapped-reads --nr-of-threads '16' --sorting-order 'coordinate' --filter-non-overlapping-reads '/opt/NGS/References/hg19/BedFiles/WGS.bed' --replace-reference-sequences '/opt/NGS/References/hg19_masked_par/samtools/0.1.19/hg19.dict'

results in error when applying samtools index resulting.bam

[bam_index_core] the alignment is not sorted (ST-K00127:290:HNFYKBBXX:1:1216:29904:8330): 249240352 > 752696 in 2-th chr [bam_index_build2] fail to index the BAM file.

running elprep filter instead of elprep sfm results in correct sorting, so the problem seems to be related to the SFM merging. I also tried with --contig-group-size 1, instead of default, with similar errors as a result.

BaseRecalibrator maxCycle

Hello,

I was in JOBM2019 conference, and I have seen the presentation of your elprep4 tool and it looks pretty cool. I decided to test it on our data (PGM sequencing).
Unfortunately I have an issue I had with GATK base recalibrator about the number of cycle :

2019/07/11 14:11:09 cycle value exceeds maximum cycle value

In GATK there is an option "-maxCycle" that allow to avoid this error by increasing the default (500 by memory). I believe there is no such option in elprep. Am I wrong ?

Do you have any advice regarding my issue please ?

Thank you for your guidance.

Question about the paper

I read the article you published about the benchmarking with elPrep in C++, Java and Go.
I'm very interested also to know how fast was each version developed.

I would have used the mailing-list/forum, but I can't send any messages on it.

Warning: The --sqq optional flag is set without using --bqsr.

Hi,
I am trying to use elprep in order for whole exome sequencing.
I used this command :
elprep sfm /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.bam /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.bam --mark-duplicates --mark-optical-duplicates /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.out_metrics --sorting-order coordinate --bqsr /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.out_recal --sqq 10,20,30 --known-sites /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/dbsnp_138.hg19.elsites,/ngs/datagen/genetique/dev/TEST_GATK/ElPREP/Mills_and_1000G_gold_standard.indels.hg19.elsites --bqsr-reference /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/hg19.elfasta

While runing there is a warning massage at the end saying that the base recalibration is not requested, while I specify that I wanted it.
:
2019/07/09 11:07:53 Command line: [elprep filter /dev/stdin /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.bam --sqq 10,20,30 --bqsr-apply /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-tabs-2019-07-09T11:01:46+02:00/ --recal-file /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.out_recal]
2019/07/09 11:07:53 Warning: The --sqq optional flag is set without using --bqsr. This parameter is ignored because base recalibration is not requested.
2019/07/09 11:07:53 Executing command:
elprep filter /dev/stdin /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.bam --bqsr-apply /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-tabs-2019-07-09T11:01:46+02:00/ --quantize-levels 0 --recal-file /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/19-16169-A-01-00.sort.elprep.sfm.out_recal --sorting-order keep
2019/07/09 11:07:54 Executing command:
elprep merge /ngs/datagen/genetique/dev/TEST_GATK/ElPREP/elprep-splits-processed-2019-07-09T11:01:46+02:00/ /dev/stdout

If I don't use sfm but filter instead, I have no warning massage. Does it means that the bqsr is not done with sfm ?

Thanks

How to call population snps/indels?

Hi,

I have 2 questions about haplotypecaller.

How to excute haplotypecaller function only? I have analysis-ready bams and don't need to do any other pre-process step.
How to use elprep to call population snps/indels (input multiple bams and output single vcf)?

Best,
Kun

Converting a dbsnp vcf to elsites uses a lot of resources

While converting a dbsnp vcf to elsites I noticed the process took over the whole node (80 threads) and consumed about 320GB ram.
Perhaps some kind of warning should be in place so unsuspecting users don't crash their servers trying to prepare data.
Alternatively, an argument to limit resource usage could be added.

Matthias

Originally posted by @matthdsm in #44 (comment)

More memory needed than samtools on 1CPU

I was trying to sot a huge file, I didn't succeed with elprep (I tried with different combination of CPU even 1). I went back to samtools and succeeded (on 1 CPU). Did you get any similar feedback that elprer is using more memory than sametools in some cases?

[RFE] haplotypecaller: support for custom GVCF GQ slices

Hi,

GATK4 allows to specify custom boundaries for the GQ slices used to group consecutive non-variant positions into a non-variant block, with --gvcf-gq-bands . This feature is quite useful IMO, as the resulting GVCFs can be much smaller (and much faster to work with downstream). The only "loss" is fine-grained resolution of GQ slices for non-variant blocks, but only in the GQ ranges that the user decided he doesn't care about.
It would be great if a similar feature could be implemented in elPrep!

Regards,
Nicolas

Secondary .bai files

Hi,
Is there a possibility for elprep to generate index .bai files? Or do you maybe plan on including this functionality, as the secondary index files are often requested by the tools for variant calling.

Thank you.

Comment: handling RG's in SFM

With regard to handling RG's when merging different bams in sfmmode as mentioned here.
Would it be possible to just add multiple @RG headers to the bam file?
I've noticed bamsormadup does just that, while preserving the RG tag in the reads. This indirectly also leads to "multi-sample" support in the bam file.
In our current case we merge single sample bam files from different lanes, so we'd expect to see those lanes represented in the RG header lines, which would then correspond to the RG tag in the reads.

Just spitballing here.
Cheers
M

MapReduce markdup

I’m wanting to use this tool in a mapreduce engine that I use (it’s not Hadoop it’s container based).

I’m wondering about how I might go about marking duplicates ( similar to Picard markdup) in a sharded manner. Currently, I have a pipeline that bins unmapped reads, maps them, and then merges the mapped bam file. I then shard the bam file into padded bins of 1Mb. I’m wondering if it’s possible to use your markdup tool on individual shards. Do you see any way forward here? Reads are currently sorted by chr coordinate, but maybe sorting by read name makes this possible?

Doing just variant calling /haplotypecaller? non-model organism no known-variant file for bqsr

Hi,
I got interested in elprep5 because I have been trying GATK4 but it is taking more than five days which is ridiculous.
I wanted to ask two questions. Since I have tried GATK4 I have a sorted/markeduplicated/bam file. Is there a way to just perform variant calling using elprep?

Also, I have tried to do the mapping and converting to get a .bam file as input, I am using the following job script:

./elprep sfm PA113corr.bam --mark-duplicates --mark-optical-duplicates PA113corr.metrics --sorting-order coordinate
--bqsr PA113corr.recal --haplotypecaller PA113corr.vcf.gz --reference Autosome.elfasta

and I still get and error
elprep version 5.0.2 compiled with go1.16.4 - see http://github.com/exascience/elprep for more information.
2021/06/06 03:18:42 Filename(s) in command line missing.

Thanks;

Feature request: Add quality filter

Hi,

We're very excited about elPrep, but we are missing a vital filter flag for it to be drop in replacement for our current pipeline.

Would it be possible to add a quality score filter, similar to the -q flag in samtools view?
Additionally, would it be possible to add a filter flag where the user can set a custom value to filter?

Thanks a lot
Matthias

Large amount of diskspace required

We are running a few benchmarks on the MarkDuplicates features of ElPrep (v4.1.5) on WGS samples in isolation.

We started with a 'small' WGS sample with a mean coverage of 41 (BAM: 60GB - Mean Cov. 41+/-15, Median 43). Your paper would suggest it required 1.6 times that of GATK/Picard. We found it needed nearly 4.5 times the space (~550 GB).

A large WGS sample with a mean coverage of 70 (BAM: 101GB - Mean Cov. 70+/-24, Median 74) needed 4.7 times the disk space (~950GB).

The scaling means that we would not be able to use ElPrep on larger WGS samples (120X). We replicated the call on two different compute clusters. Both calls had the same required disk space. Is there something we are missing?

Call

/usr/bin/time --verbose elprep sfm "$local_file_path" "$output_file" \
    --mark-duplicates \
    --mark-optical-duplicates "$metrics_file" \
    --optical-duplicates-pixel-distance 2500 \
    --sorting-order keep \
    --log-path $(pwd) \
    --nr-of-threads 12 \
    --tmp-path $TMPDIR \
    --timed

Method

A du --bytes $TMPDIR ran in the background with an interval of 5 minutes in order to determine the maximum used disk storage.

panic: runtime error: index out of range

I was trying to run elprep with most of the filters on:

2019/01/31 09:59:09 Executing command:
 elprep filter /data/input/DX123_HFJ5KDSXX_L2.sam /data/output/DX123_HFJ5KDSXX_L2.bam --mark-duplicates --mark-optical-duplicates /data/output/DX123_HFJ5KDSXX_L2.opticaldups.txt --optical-duplicates-pixel-distance 100 --remove-duplicates --bqsr /data/output/DX123_HFJ5KDSXX_L2.bqsr.txt --bqsr-reference /data/reference/hg19/ucsc_hg19.fa.elprep --quantize-levels 0 --sorting-order coordinate --nr-of-threads 30 --log-path /data/output
panic: runtime error: index out of range
goroutine 166040 [running]:
runtime/debug.Stack(0xdcc8256380, 0x0, 0xc00011e700)
	/opt/local/lib/go/src/runtime/debug/stack.go:24 +0xa7
github.com/exascience/pargo/internal.WrapPanic(0x5aa000, 0x741450, 0x741450, 0xe1cf34ab)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/internal/internal.go:41 +0x45
github.com/exascience/pargo/parallel.RangeReduce.func1.1.1(0xd559ba65a0, 0xd55ce65a50)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:803 +0x43
panic(0x5aa000, 0x741450)
	/opt/local/lib/go/src/runtime/panic.go:513 +0x1b9
github.com/exascience/elprep/v4/filters.computeSnpEvents(0xd0c194a3c0, 0x0, 0x0, 0x0, 0xd0c1933800, 0x1e, 0x100, 0x4, 0x1e, 0x100)
	/Users/caherzee/Documents/Work/Code/elprep/filters/bqsr.go:326 +0x3b3
github.com/exascience/elprep/v4/filters.(*BaseRecalibrator).Recalibrate.func2(0x457055b, 0x469da0b, 0xc00004a220, 0xc00004aca0)
	/Users/caherzee/Documents/Work/Code/elprep/filters/bqsr.go:952 +0x65a
github.com/exascience/pargo/parallel.RangeReduce.func1(0x457055b, 0x469da0b, 0x1, 0xd55ce65a50, 0x96)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:789 +0x2ba
github.com/exascience/pargo/parallel.RangeReduce.func1.1(0xd559ba65a0, 0xd55ce65a50, 0xd0c21c34e8, 0x457055b, 0x469da0b, 0x2, 0x1, 0xd559ba6590)
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:806 +0x83
created by github.com/exascience/pargo/parallel.RangeReduce.func1
	/Users/caherzee/go/pkg/mod/github.com/exascience/[email protected]/parallel/parallel.go:801 +0x1d2

The sam file was created by mapping raw reads to reference using bwa
Reference was generated by elprep fasta-to-elfasta

>uname -a
Linux hostname 2.6.32-696.28.1.el6.x86_64 #1 SMP Wed May 9 23:09:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Any ideas?

UPDATE:
It seemed there is something wrong with snp event computing, so I tried to pass to --known-sites with elsites compiled from dbsnp by elprep vcf-to-elsites, which didn't help.

Behaviour of markduplicates wrt pairs where one read is mapped to another c'some

I've used elprep split to split a BAM file by chromosome. The test chunk for one chromosome has these counts:

samtools flagstat in.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I've run elprep on this chunk:

elprep --mark-duplicates --sorting-order keep in.bam test1.sam

On completion, I run samtools flagstat on the elprep result:

samtools flagstat test1.sam 
136919 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
136919 + 0 mapped (100.00%:-nan%)
136919 + 0 paired in sequencing
64918 + 0 read1
72001 + 0 read2
115896 + 0 properly paired (84.65%:-nan%)
116826 + 0 with itself and mate mapped
20093 + 0 singletons (14.68%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

As a comparison I've run bammarkduplicates from https://github.com/gt1/biobambam. The result of samtools flagstat for this is:

samtools flagstat test2.bam
198799 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
84287 + 0 duplicates
198799 + 0 mapped (100.00%:-nan%)
198799 + 0 paired in sequencing
119421 + 0 read1
79378 + 0 read2
115896 + 0 properly paired (58.30%:-nan%)
178706 + 0 with itself and mate mapped
20093 + 0 singletons (10.11%:-nan%)
61880 + 0 with mate mapped to a different chr
21845 + 0 with mate mapped to a different chr (mapQ>=5)

I didn't ask elprep to remove any reads, so I was surprised to find the number in the output was reduced compared to the input. Comparing the three results, I can see that elprep and bammarkduplicates agree on the duplicate count. Subtracting the number of reads after elprep from the initial value (198799 - 136919 = 61880) yields the number of reads with mate mapped to a different chr.

From this I understand that it removes reads with mates mapped to a different chr. There was no second file written, so are they discarded? Am I using elprep correctly in this case? Thanks.

elprep split fails to split into CRAM format

elPrep version 2.10 (LispWorks pre-compiled distribution)

split sam-file /path/to/output/ 
 [--output-prefix name] 
 [--output-type [sam | bam | cram]] 
 [--nr-of-threads nr]

When I select CRAM, I get an error reminding me to specifiy a reference:

An error occurred in elPrep 2.10: When creating CRAM output, either a reference-fasta or a reference-fai must be provided.

However, when I add the required arguments, elprep complains:

elprep split --reference-t Homo_sapiens.GRCh37.NCBI.allchr_MT.fa.fai --output-prefix foo_n --output-type cram --nr-of-threads 12 foo_n.bam  ./elprep/split/
elPrep version 2.10. See http://github.com/exascience/elprep for more information.
Incorrect number of parameters: Expected 2, got 4 (--reference-t Homo_sapiens.GRCh37.NCBI.allchr_MT.fa.fai foo_n.bam ./elprep/split/).
split sam-file /path/to/output/ 
 [--output-prefix name] 
 [--output-type [sam | bam | cram]] 
 [--nr-of-threads nr]

elprep sfm run failed due to insufficient RAM

Hi,
Would it be possible to make elPrep sfm more stable by not failing it due to insufficient RAM memory, but to maybe just prolong its runtime? It is really high resource demanding, e.g. elprep has a failed run with a 150x bam file of 113GB on an instance with 92vCPUs and 192GB RAM.
Thank you.

Feature request: working with limits on VMEM

It would be grand if ElPrep could somehow be made to work with limited virtual memory.

Our cluster consists for 90% of nodes that have ~360GB RAM available. In addition to that we process whole-genome BAM files that are around 90-120X Coverage.

From our ElPrep benchmark we guestimate that ElPrep will need between 400~550GB VMEM to run most of our samples. This would negate the runtime benefit because the jobs would spend more time in the shared-cluster queue.

Hence a feature request to be able to work with limited memory without crashing:
runtime: out of memory: cannot allocate 8192-byte block

One could imagine to have ElPrep only start a sfm thread if sufficient memory is available and have the limit set via memory limit.

I doubt we are the only one that would love to take ElPrep into production but are limited by hardware and are looking for a somewhat grey area in the resources X runtime landscape.

Looking forward to your thoughts on the matter.

Cheers,
Chris