I have indexed a quite large barcoded BAM (~220 Gb) file using LRez and now I want to

Hi Clarie, We are using DBS linked reads as described <a href="https

bam query slow about lrez HOT 5 OPEN

morispi commented on June 23, 2024

bam query slow

from lrez.

Comments (5)

clemaitre commented on June 23, 2024

Hi,

Thank you for the feedback.

We have already encoutered a similar slowness problem for a particular linked-read dataset (from TELL-seq) in which the barcode distribution was excessively skewed, with most barcodes appearing on only one or two read pairs and a very small number of barcodes shared by hundreds of thousands of read pairs. In this case, if the list contains such a barcode with an excessive number of reads to extract, it can be very slow and the multi-threading seems not to work since the parallelization is performed by splitting the barcode list.

Could it be the case for your dataset ?
What is the linked-read technology used for your dataset ?

Best,
Claire

from lrez.

pontushojer commented on June 23, 2024

Hi Clarie,

We are using DBS linked reads as described here. The resulting data is quite similar to TELL-Seq but with 20 bp barcodes. The distribution of reads per barcode is for sure skewed in the way you describe. So you are saying that barcodes with high number of reads would cause this slowdown? The upper-end barcodes for my dataset have in the order of ~10,000 reads associated with them, so not quite hundreds of thousands. For this number of queries per barcode do you think this is still a problem?

Regarding the multithreading you say its parallelised over the list of barcodes? If this was the case I would assume an initial high CPU load with a gradual decrease as only barcodes with many reads remain. In this case the load is more or less constant.

Another side note, would it not be preferable to parallelise over offsets instead of barcodes? This would be more efficient regardless for the linked-read technology. It would also help when doing single queries.

Ps. The queries in question finished after about 5.5 hours.

from lrez.

pontushojer commented on June 23, 2024

I run LRez stats on the output BAM to check the reads per barcode distribution for the one of the query lists that finished. This is the output

Number of barcodes: 2128
Number of mapped reads: 2328152

Number of reads per barcode:
	 1st quantile: 53
	 median: 170
	 3rd quantile: 591

Thats about 1000 reads per barcode on average, but with a median of 170 the distribution is quite skewed.

from lrez.

clemaitre commented on June 23, 2024

Hi,

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d71).

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Best,
Claire

from lrez.

pontushojer commented on June 23, 2024

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

It has not been used outside our lab to my knowledge, so not to many datasets are available at the moment. But asside from this issue and #8 I have had no major issue using it on this data.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d71).

I pulled the latest version and run in on the same output BAM

min: 1
1st quantile: 53
median: 170
3rd quantile: 591
max: 20736

So the maximum is 20736, again no too many in my opinion.

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Thanks for the kind words!

from lrez.

bam query slow about lrez HOT 5 OPEN

Comments (5)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent