Hello, During my small reading of rasusa code I see a possible speed

First of all I made a simple profiling analysis, with <a href="https://github.com/flam

Thanks for this <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Thank you so much for this <a class="user-mention notranslate" data-hovercard-type="us

`HashSet` is slow about rasusa HOT 3 CLOSED

mbhall88 commented on June 9, 2024

`HashSet` is slow

from rasusa.

Comments (3)

natir commented on June 9, 2024 1

First of all I made a simple profiling analysis, with flamegraph 10 % of rasusa runtime is spent in HashSet contains and remove. So it's not where rasusa spent most time but improve it is easy and have an impact.

I write a small benchmark in my branch indices_storage I try to be closer as possible to rasusa process.

First I compare different method to create indices storage (0.2 % of rasusa runtime).
hash: use classic HashSet
rustc_hash: use rustc_hash::FxHashSet
vector: store boolean in vector true if indices must be select.
bitvec: similar to vector but use a bitvec (I'm not happy with this implementation)

create/hash             time:   [235.87 us 238.62 us 242.05 us]
create/rustc_hash       time:   [86.565 us 87.738 us 89.163 us]
create/vector           time:   [13.918 us 13.978 us 14.044 us]
create/bitvec           time:   [732.96 us 748.85 us 765.80 us]

Faster method is vector, 17 times than hash. About memory usage hash and rustc_hash memory usage is harder to estimate but it's linear in number of selected read, for vector and bitvec it's linear in total number of read. For vector and bitvec we can save some ram by truncate to last selected value.

Second I compare method to read indices storage structure (10 % of rasusa runtime).
remove: when we reach selected indices we remove it from HashSet, when set is empty we stop iteration
count: when we reach a selected indices we just decrease a counter, when the counter reach 0 we stop iteration
full: we iterate over all vector

read/hash_remove        time:   [2.1765 ms 2.2002 ms 2.2265 ms]
read/hash_count         time:   [2.0811 ms 2.1786 ms 2.2642 ms]
read/rustc_hash_remove  time:   [834.15 us 844.61 us 856.79 us]
read/rustc_hash_count   time:   [450.59 us 460.17 us 470.73 us]
read/vector_full        time:   [130.23 us 131.96 us 133.79 us]
read/vector_count       time:   [173.10 us 177.20 us 181.82 us]
read/bitvec_full        time:   [442.90 us 458.02 us 475.87 us]
read/bitvec_count       time:   [396.12 us 403.98 us 412.27 us]

Again vector is faster, 16 times than hash. I assume full method is faster than count method because these methods avoid a test.

I think vector is a very good solution and memory usage draw back isn't too large (1 bytes per reads), but if you prefer keep set approach replace standard set by rustc set is easy and efficient (~ 5 times faster).

from rasusa.

mbhall88 commented on June 9, 2024

Thanks for this @natir. I appreciate the suggestions!

I had originally intended to use a bit vector actually. The reason I decided on a set though was because of the following reason:
When I am outputting the reads that we are keeping in the subsample, we remove the index from the set. So the set gradually shrinks to zero. When the set is empty we exit the subsampling function

rasusa/src/fastx.rs

Lines 177 to 179 in 1135210

    
           if reads_to_keep.is_empty() { 
        
               return Ok(total_len); 
        
           }

This way we save a small amount of time (theoretically) by not iterating through the entire file again - unless the last read in the file is part of the subsample.

However, upon reflection, a bit vector may still be faster as the lookup time is probably faster in real terms, as we do not need to hash anything and as you've pointed out, hashing can be expensive. (Although I suspect it is only on the order if milli/micro seconds overall).

I would be interest to see a benchmark between the three approaches: current, set with different hashing, and bit vector. However, I appreciate that is a bit of work. If you want to put together a benchmark I would very much appreciate it and would happily take whatever approach is the fastest - provided memory usage stays low(ish).

from rasusa.

mbhall88 commented on June 9, 2024

Thank you so much for this @natir! This is amazing. It seems a plain ol' vector is definitely the best solution. Even with 1 byte per read, the file would have to have a billion reads (1GB) for the memory to really be of any concern to be.

If you would like to switch the implementation to use a vector I will gladly accept a PR, otherwise I will try and do this in the next week or two.

from rasusa.

`HashSet` is slow about rasusa HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent