Comments (5)
Hi,
Thank you for the feedback.
We have already encoutered a similar slowness problem for a particular linked-read dataset (from TELL-seq) in which the barcode distribution was excessively skewed, with most barcodes appearing on only one or two read pairs and a very small number of barcodes shared by hundreds of thousands of read pairs. In this case, if the list contains such a barcode with an excessive number of reads to extract, it can be very slow and the multi-threading seems not to work since the parallelization is performed by splitting the barcode list.
Could it be the case for your dataset ?
What is the linked-read technology used for your dataset ?
Best,
Claire
from lrez.
Hi Clarie,
We are using DBS linked reads as described here. The resulting data is quite similar to TELL-Seq but with 20 bp barcodes. The distribution of reads per barcode is for sure skewed in the way you describe. So you are saying that barcodes with high number of reads would cause this slowdown? The upper-end barcodes for my dataset have in the order of ~10,000 reads associated with them, so not quite hundreds of thousands. For this number of queries per barcode do you think this is still a problem?
Regarding the multithreading you say its parallelised over the list of barcodes? If this was the case I would assume an initial high CPU load with a gradual decrease as only barcodes with many reads remain. In this case the load is more or less constant.
Another side note, would it not be preferable to parallelise over offsets instead of barcodes? This would be more efficient regardless for the linked-read technology. It would also help when doing single queries.
Ps. The queries in question finished after about 5.5 hours.
from lrez.
I run LRez stats
on the output BAM to check the reads per barcode distribution for the one of the query lists that finished. This is the output
Number of barcodes: 2128
Number of mapped reads: 2328152
Number of reads per barcode:
1st quantile: 53
median: 170
3rd quantile: 591
Thats about 1000 reads per barcode on average, but with a median of 170 the distribution is quite skewed.
from lrez.
Hi,
Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.
The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated LRez stats
so that min and max values are also reported in the output (commit cd56d71).
I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.
Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !
Best,
Claire
from lrez.
Thank you for the information on this technology. We were somewhat aware of its existence, but we have never tested LRez on such datasets.
It has not been used outside our lab to my knowledge, so not to many datasets are available at the moment. But asside from this issue and #8 I have had no major issue using it on this data.
The order of ~10,000 reads per barcode seems to me not so high to slow that much the query process. If you want to check the maximal number of reads per barcode, I just updated
LRez stats
so that min and max values are also reported in the output (commit cd56d71).
I pulled the latest version and run in on the same output BAM
min: 1
1st quantile: 53
median: 170
3rd quantile: 591
max: 20736
So the maximum is 20736, again no too many in my opinion.
I have checked, the multithreading is effectively performed buy splitting the barcode list. I agree with you that we should expect an initial high CPU load and then a decrease. I do not know why we do not see this. Concerning parallelising over offsets instead of barcodes, we initially did not think it necessary since barcodes had few reads in our initial read datasets and query time was so low for only one barcode. But this is definitely an interesting avenue to explore for future developments.
Thank you for these useful comments, and by the way thank you also for creating and keeping up to date the awsome repository https://github.com/pontushojer/awesome-linked-reads !
Thanks for the kind words!
from lrez.
Related Issues (7)
- query bam with option `-H` generates malformatted SAM HOT 1
- "LRez index fastq": unable to index a FASTQ file not gzipped
- LRez does not handle Haplotagging and stLFR barcodes ending with "-1"
- "stoi" error when indexing bam positions HOT 1
- Include barcode integer suffix in index. HOT 2
- Unrecognized sequencing technology HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lrez.