Coder Social home page Coder Social logo

bam query slow about lrez HOT 5 OPEN

morispi avatar morispi commented on June 23, 2024
bam query slow

from lrez.

Comments (5)

clemaitre avatar clemaitre commented on June 23, 2024

Hi,

Thank you for the feedback.

We have already encoutered a similar slowness problem for a particular linked-read dataset (from TELL-seq) in which the barcode distribution was excessively skewed, with most barcodes appearing on only one or two read pairs and a very small number of barcodes shared by hundreds of thousands of read pairs. In this case, if the list contains such a barcode with an excessive number of reads to extract, it can be very slow and the multi-threading seems not to work since the parallelization is performed by splitting the barcode list.

Could it be the case for your dataset ?
What is the linked-read technology used for your dataset ?

Best,
Claire

from lrez.

pontushojer avatar pontushojer commented on June 23, 2024

Hi Clarie,

We are using DBS linked reads as described here. The resulting data is quite similar to TELL-Seq but with 20 bp barcodes. The distribution of reads per barcode is for sure skewed in the way you describe. So you are saying that barcodes with high number of reads would cause this slowdown? The upper-end barcodes for my dataset have in the order of ~10,000 reads associated with them, so not quite hundreds of thousands. For this number of queries per barcode do you think this is still a problem?

Regarding the multithreading you say its parallelised over the list of barcodes? If this was the case I would assume an initial high CPU load with a gradual decrease as only barcodes with many reads remain. In this case the load is more or less constant.

Another side note, would it not be preferable to parallelise over offsets instead of barcodes? This would be more efficient regardless for the linked-read technology. It would also help when doing single queries.

Ps. The queries in question finished after about 5.5 hours.

from lrez.

pontushojer avatar pontushojer commented on June 23, 2024

I run LRez stats on the output BAM to check the reads per barcode distribution for the one of the query lists that finished. This is the output

Number of barcodes: 2128
Number of mapped reads: 2328152

Number of reads per barcode:
	 1st quantile: 53
	 median: 170
	 3rd quantile: 591

Thats about 1000 reads per barcode on average, but with a median of 170 the distribution is quite skewed.

from lrez.

clemaitre avatar clemaitre commented on June 23, 2024

Hi,

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d71).

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Best,
Claire

from lrez.

pontushojer avatar pontushojer commented on June 23, 2024

Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.

It has not been used outside our lab to my knowledge, so not to many datasets are available at the moment. But asside from this issue and #8 I have had no major issue using it on this data.

The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats so that min and max values are also reported in the output (commit cd56d71).

I pulled the latest version and run in on the same output BAM

min: 1
1st quantile: 53
median: 170
3rd quantile: 591
max: 20736

So the maximum is 20736, again no too many in my opinion.

I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.

Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !

Thanks for the kind words!

from lrez.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.