Comments (3)
First of all I made a simple profiling analysis, with flamegraph 10 % of rasusa runtime is spent in HashSet
contains
and remove
. So it's not where rasusa spent most time but improve it is easy and have an impact.
I write a small benchmark in my branch indices_storage I try to be closer as possible to rasusa process.
First I compare different method to create indices storage (0.2 % of rasusa runtime).
hash: use classic HashSet
rustc_hash: use rustc_hash::FxHashSet
vector: store boolean in vector true if indices must be select.
bitvec: similar to vector but use a bitvec (I'm not happy with this implementation)
create/hash time: [235.87 us 238.62 us 242.05 us]
create/rustc_hash time: [86.565 us 87.738 us 89.163 us]
create/vector time: [13.918 us 13.978 us 14.044 us]
create/bitvec time: [732.96 us 748.85 us 765.80 us]
Faster method is vector, 17 times than hash. About memory usage hash and rustc_hash memory usage is harder to estimate but it's linear in number of selected read, for vector and bitvec it's linear in total number of read. For vector and bitvec we can save some ram by truncate to last selected value.
Second I compare method to read indices storage structure (10 % of rasusa runtime).
remove: when we reach selected indices we remove it from HashSet, when set is empty we stop iteration
count: when we reach a selected indices we just decrease a counter, when the counter reach 0 we stop iteration
full: we iterate over all vector
read/hash_remove time: [2.1765 ms 2.2002 ms 2.2265 ms]
read/hash_count time: [2.0811 ms 2.1786 ms 2.2642 ms]
read/rustc_hash_remove time: [834.15 us 844.61 us 856.79 us]
read/rustc_hash_count time: [450.59 us 460.17 us 470.73 us]
read/vector_full time: [130.23 us 131.96 us 133.79 us]
read/vector_count time: [173.10 us 177.20 us 181.82 us]
read/bitvec_full time: [442.90 us 458.02 us 475.87 us]
read/bitvec_count time: [396.12 us 403.98 us 412.27 us]
Again vector is faster, 16 times than hash. I assume full method is faster than count method because these methods avoid a test.
I think vector is a very good solution and memory usage draw back isn't too large (1 bytes per reads), but if you prefer keep set approach replace standard set by rustc set is easy and efficient (~ 5 times faster).
from rasusa.
Thanks for this @natir. I appreciate the suggestions!
I had originally intended to use a bit vector actually. The reason I decided on a set though was because of the following reason:
When I am outputting the reads that we are keeping in the subsample, we remove the index from the set. So the set gradually shrinks to zero. When the set is empty we exit the subsampling function
Lines 177 to 179 in 1135210
This way we save a small amount of time (theoretically) by not iterating through the entire file again - unless the last read in the file is part of the subsample.
However, upon reflection, a bit vector may still be faster as the lookup time is probably faster in real terms, as we do not need to hash anything and as you've pointed out, hashing can be expensive. (Although I suspect it is only on the order if milli/micro seconds overall).
I would be interest to see a benchmark between the three approaches: current, set with different hashing, and bit vector. However, I appreciate that is a bit of work. If you want to put together a benchmark I would very much appreciate it and would happily take whatever approach is the fastest - provided memory usage stays low(ish).
from rasusa.
Thank you so much for this @natir! This is amazing. It seems a plain ol' vector is definitely the best solution. Even with 1 byte per read, the file would have to have a billion reads (1GB) for the memory to really be of any concern to be.
If you would like to switch the implementation to use a vector I will gladly accept a PR, otherwise I will try and do this in the next week or two.
from rasusa.
Related Issues (20)
- Finer control over coverage HOT 2
- Multi-threading approach implementation HOT 3
- Lower than expected coverage for illumina paired end reads HOT 9
- Suggestion: replace needletail by noodles and niffler HOT 1
- Compress output when requested HOT 5
- Input parameter for number of bases in addition to coverage and genome size HOT 6
- Calculate genome size from an index file HOT 1
- Add number and fraction options
- Docker image issue HOT 9
- Suggestion: Raise warning when desired level of coverage is not possible HOT 1
- Feature: Min/max coverage threshold HOT 2
- Multi-threading approach HOT 7
- Does rasusa outputs reads that still cover the entire original genome? HOT 1
- Random sampling based on bases for the metagenomic dataset HOT 1
- illumina read input error HOT 2
- Question: there is a way to start from a bam file instead of a fastq? HOT 1
- won't detect input file HOT 1
- Generating only half of the target depth HOT 6
- Error: unable to gather read lengths for the first input file HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rasusa.