morispi / lrez Goto Github PK

View Code? Open in Web Editor NEW

12.0 4.0 4.0 2.18 MB

Standalone tool and library allowing to work with barcoded linked-reads

License: GNU Affero General Public License v3.0

Makefile 0.96% Shell 0.02% C++ 92.78% C 5.59% Python 0.65%

barcode barcodes linked-reads linked reads 10x 10xgenomics 10x-genomics index haplotagging

lrez's Issues

"stoi" error when indexing bam positions

This Leviathan issue is in fact an LRez issue.

query bam with option `-H` generates malformatted SAM

I am using LRez query bam with the option -H to include a header in the output. It however adds a blank line inbetween header and alignments causing the SAM to be malformatted. See example below:

...
@PG	ID:samtools.3	PN:samtools	CL:samtools cat -o final.bam chunks/chrA.calling.bam unmapped.bam	PP:samtools.2	VN:1.12

A00621:130:HN5HWDSXX:4:1368:15555:22889	163	chrA	26885	60	137M	=	26959	137	CAAGTGATCTGCCCACCTCGGCCTCCCAAAGTGCTGGGATTACACGTGTGAACCACCATGCCTGGTCTCTAATTTTTCTGATTCTATAAAATTACATTCTATTTGCTGAAAGAGTACTTTAGAGTTGAAGAAAAAGA	FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	XF:i:0	PG:Z:MarkDuplicates	RG:Z:1	XG:f:1	NM:i:0	BX:Z:CTTGGTCATTCATACAGTCC-1	MI:i:198
...

LRez does not handle Haplotagging and stLFR barcodes ending with "-1"

LRez supposes "-1" ending of the barcode in the BX tag of the bam file is specific to 10X genomics linked-reads data.
However, it seems that some mappers (such as EMA) add "-1" at the end of barcodes whatever the linked-read technology.
Be carefull : this does not issue any error... but results in all barcodes begin encoded in the same key value (so no indexing).

Could we handle better "-1" endings in LRez ?

Claire

Unrecognized sequencing technology

Hello,

I'm trying to use LEVIATHAN and require the barcode indices from LRez. Introduced a step in my workflow that uses samtools view to filter reads on mapping quality, and it seems that doing so has created issues with LRez no longer recognizing the BX:Z: tags (where it did previously). These are haplotagging data, where the index is AXXCXXBXXDXX.

$ LRez index bam -p -b 2A_3_221221_15x.bam -o test.bci
determineSequencingTechnology: Unrecognized sequencing technology. Please make sure your barcodes originate from a compatible technology or are reported as nucleotides in the BX:Z tag.

Unless I'm mistaken, my bam files are formatted normally:

$ samtools view -h 2A_3_221221_15x.bam | head -18
@HD     VN:1.6  SO:coordinate
@SQ     SN:2L   LN:23513712
@SQ     SN:2R   LN:25286936
@SQ     SN:3L   LN:28110227
@SQ     SN:3R   LN:32079331
@SQ     SN:4    LN:1348131
@SQ     SN:X    LN:23542271
@SQ     SN:Y    LN:3667352
@RG     ID:2A_3_221221_15x      SM:2A_3_221221_15x
@PG     ID:bwa  PN:bwa  CL:bwa mem -C -t 6 -M -R @RG\tID:2A_3_221221_15x\tSM:2A_3_221221_15x Assembly/Assembly/dmel.trunc.fa Trimming/2A_3_221221_15x.R1.fq.gz Trimming/2A_3_221221_15x.R2.fq.gz  VN:0.7.17-r1188
@PG     ID:samtools     PN:samtools     CL:samtools view -h -F 4 -q 30 -t Assembly/Assembly/dmel.trunc.fa.fai -T Assembly/Assembly/dmel.trunc.fa -   PP:bwa  VN:1.17
@PG     ID:samtools.1   PN:samtools     CL:samtools sort -T Alignments/bwa/2A_3_221221_15x --reference Assembly/Assembly/dmel.trunc.fa -O bam -m 4G -o Alignments/bwa/2A_3_221221_15x.sort.bam -  PP:samtools     VN:1.17
@PG     ID:sambamba     CL:markdup -t 4 -l 4 Alignments/bwa/2A_3_221221_15x.sort.bam Alignments/bwa/2A_3_221221_15x.bam      PP:samtools.1   VN:1.0
@PG     ID:samtools.2   PN:samtools     PP:sambamba     VN:1.17 CL:samtools view -h 2A_3_221221_15x.bam
A00470:481:HNYFWDRX2:1:2177:6090:16266  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FF,FFFF::FFFFF:FFFF:FFF:FFFFFF:FFFFFFFFFFF:FFF:FFFFF,FFFF,F,FFFFFF:FFFFFFFFFFFFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2207:9426:24267  1123    2L      4831    40      80M     =       5112407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FF:F,F,FFF,FFFF,FFFFF::FFFFFFF,FFFF:FFFFFFF:FFFFF:FFF:F:FF,FFF,FF:F,FF,:FFFF,:,F NM:i:0  MD:Z:80      MC:Z:22S126M    AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2254:21902:6699  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FFF,F:FFF,FFFFF,FFFFFF:FFFFFFF:FFFFFFFFFF:FFFFFF:FFFFFFFF,FFFF,F::,FFFFF,:FF,FFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT
A00470:481:HNYFWDRX2:1:2273:24334:6433  1123    2L      4831    40      80M     =       5088407      CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCGCAAACCCAAAAAGACAATACACGACAGAGA    FFFFFF,F:FFFFFFF:FF:F:F,FFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFF:FF,FFFF:FFFFFFF NM:i:0  MD:Z:80      MC:Z:150M       AS:i:80 XS:i:80 RG:Z:2A_3_221221_15x    BX:Z:A95C26B84D96 1:N:0:TATCAGTA+TTACTACT

Do you have insights to provide on this that may reveal a mistake on my end or a bug in LRez?
While the workflow is listed in the @PG tags, the steps are:

map with bwa mem
filter with samtools view
sort and convert to bam with samtools sort
mark duplicates with sambamba markdup

"LRez index fastq": unable to index a FASTQ file not gzipped

I am trying to index a FASTQ file not gzipped with the command LRez index fastq but it returns an error.
I have tried for two different FASTQ files, but it returns an error for both.

FASTQ file 1:
LRez index fastq --fastq NA24385_phased_possorted.fastq --output NA24385_phased.shelve
Error returned:
"gzIndex: Untable to open gzip index for file NA24385_phased_possorted.fastqi for reading. Please make sure the gzip index file exists.: iostream error"

FASTQ file 1:
LRez index fastq --fastq stLFR_NA24385.sort.rmdup_barcodes_extracted.fastq.gz --output stLFR_NA24385.shelve
Error returned:
"gzIndex: could not open stLFR_NA24385.sort.rmdup_barcodes_extracted.fastq.gz for reading. Please make sure the file exists.: iostream error"

bam query slow

I have indexed a quite large barcoded BAM (~220 Gb) file using LRez and now I want to perform queries for barcodes. I have several lists of barcodes with about 2000 entries in each. Unfortunately it is very slow. If I read the paper correctly queries of about 1000 barcodes took at most 10 min. For me it has been running for almost 3 hours with files of 2000 queries without finishing.

Commands

# Index
LRez index bam- b file.bam -o file.bam.bci -f -t 10

# Query
LRez query bam -b file.bam -i file.bam.bci -l list.bxu -o list.bam -t 10 -H

Below is the memory/CPU usage. I am run two query commands in parallel with 10 threads each.

I guess the initial sharp memory-incline is from loading the index (size about 55Gb on disk), this seams to take about 10 min or so. Then it is presumably doing index lookups for the list of barcodes which is taking much longer that I would expect. Any idea why this is so slow?

As a side-note it seems that core utilisation is quite poor with only about 1 core per process being used.

Include barcode integer suffix in index.

Relates to #6.

As noted in the longranger docs (below) the suffix number can be any integer, not just "-1", as it is mean to allow for merging of different 10X libraries into the same BAM.

The BX tag includes a suffix with a dash separator followed by a number:
AGAATGGTCTGCATCG-1
This number denotes what we call a GEM group, and is used to virtualize barcodes in order to achieve a higher effective barcode diversity when combining samples generated from separate GEM chip channel runs. Normally, this number will be "1" across all barcodes when analyzing a sample generated from a single GEM chip channel. It can either be left in place and treated as part of a unique barcode identifier, or explicitly parsed out to leave only the barcode sequence itself.

I run into this issue when trying to run LRez index bam on a BAM with multiple libraries which resulted in the following error:

determineSequencingTechnology: Unrecognized sequencing technology. Please make sure your barcodes originate from a compatible technology or are reported as nucleotides in the BX:Z tag.

From what I can understand from the code this suffix is currently not include in the index. For LRez to work with BAMs that contain multiple libraries this would need to be fixed.

morispi / lrez Goto Github PK

lrez's Issues

"stoi" error when indexing bam positions

query bam with option `-H` generates malformatted SAM

LRez does not handle Haplotagging and stLFR barcodes ending with "-1"

Unrecognized sequencing technology

"LRez index fastq": unable to index a FASTQ file not gzipped

bam query slow

Include barcode integer suffix in index.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent