Coder Social home page Coder Social logo

`HashSet` is slow about rasusa HOT 3 CLOSED

mbhall88 avatar mbhall88 commented on June 9, 2024
`HashSet` is slow

from rasusa.

Comments (3)

natir avatar natir commented on June 9, 2024 1

First of all I made a simple profiling analysis, with flamegraph 10 % of rasusa runtime is spent in HashSet contains and remove. So it's not where rasusa spent most time but improve it is easy and have an impact.

I write a small benchmark in my branch indices_storage I try to be closer as possible to rasusa process.

First I compare different method to create indices storage (0.2 % of rasusa runtime).
hash: use classic HashSet
rustc_hash: use rustc_hash::FxHashSet
vector: store boolean in vector true if indices must be select.
bitvec: similar to vector but use a bitvec (I'm not happy with this implementation)

create/hash             time:   [235.87 us 238.62 us 242.05 us]
create/rustc_hash       time:   [86.565 us 87.738 us 89.163 us]
create/vector           time:   [13.918 us 13.978 us 14.044 us]
create/bitvec           time:   [732.96 us 748.85 us 765.80 us]

Faster method is vector, 17 times than hash. About memory usage hash and rustc_hash memory usage is harder to estimate but it's linear in number of selected read, for vector and bitvec it's linear in total number of read. For vector and bitvec we can save some ram by truncate to last selected value.

Second I compare method to read indices storage structure (10 % of rasusa runtime).
remove: when we reach selected indices we remove it from HashSet, when set is empty we stop iteration
count: when we reach a selected indices we just decrease a counter, when the counter reach 0 we stop iteration
full: we iterate over all vector

read/hash_remove        time:   [2.1765 ms 2.2002 ms 2.2265 ms]
read/hash_count         time:   [2.0811 ms 2.1786 ms 2.2642 ms]
read/rustc_hash_remove  time:   [834.15 us 844.61 us 856.79 us]
read/rustc_hash_count   time:   [450.59 us 460.17 us 470.73 us]
read/vector_full        time:   [130.23 us 131.96 us 133.79 us]
read/vector_count       time:   [173.10 us 177.20 us 181.82 us]
read/bitvec_full        time:   [442.90 us 458.02 us 475.87 us]
read/bitvec_count       time:   [396.12 us 403.98 us 412.27 us]

Again vector is faster, 16 times than hash. I assume full method is faster than count method because these methods avoid a test.

I think vector is a very good solution and memory usage draw back isn't too large (1 bytes per reads), but if you prefer keep set approach replace standard set by rustc set is easy and efficient (~ 5 times faster).

from rasusa.

mbhall88 avatar mbhall88 commented on June 9, 2024

Thanks for this @natir. I appreciate the suggestions!

I had originally intended to use a bit vector actually. The reason I decided on a set though was because of the following reason:
When I am outputting the reads that we are keeping in the subsample, we remove the index from the set. So the set gradually shrinks to zero. When the set is empty we exit the subsampling function

rasusa/src/fastx.rs

Lines 177 to 179 in 1135210

if reads_to_keep.is_empty() {
return Ok(total_len);
}

This way we save a small amount of time (theoretically) by not iterating through the entire file again - unless the last read in the file is part of the subsample.

However, upon reflection, a bit vector may still be faster as the lookup time is probably faster in real terms, as we do not need to hash anything and as you've pointed out, hashing can be expensive. (Although I suspect it is only on the order if milli/micro seconds overall).

I would be interest to see a benchmark between the three approaches: current, set with different hashing, and bit vector. However, I appreciate that is a bit of work. If you want to put together a benchmark I would very much appreciate it and would happily take whatever approach is the fastest - provided memory usage stays low(ish).

from rasusa.

mbhall88 avatar mbhall88 commented on June 9, 2024

Thank you so much for this @natir! This is amazing. It seems a plain ol' vector is definitely the best solution. Even with 1 byte per read, the file would have to have a billion reads (1GB) for the memory to really be of any concern to be.

If you would like to switch the implementation to use a vector I will gladly accept a PR, otherwise I will try and do this in the next week or two.

from rasusa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.