Coder Social home page Coder Social logo

juliadynamics / streamsampling.jl Goto Github PK

View Code? Open in Web Editor NEW
19.0 3.0 0.0 954 KB

Sampling methods for data streams

License: MIT License

Julia 100.00%
julia statistics sampling-methods sampling sampling-theory reservoir-sampling streams stream-sampling

streamsampling.jl's Introduction

StreamSampling.jl

CI codecov Aqua QA

The scope of this package is to provide general methods to sample from any stream in a single pass through the data, even when the number of items contained in the stream is unknown.

This has some advantages over other sampling procedures:

  • If the iterable is lazy, the memory required grows in relation to the size of the sample, instead of the all population.
  • The sample collected is a random sample of the portion of the stream seen thus far at any point of the sampling process.
  • In some cases, sampling with the techniques implemented in this library can bring considerable performance gains, since the population of items doesn't need to be previously stored in memory.

Brief overview of the functionalities

The itsample function allows to consume all the stream at once and return the sample collected:

julia> using StreamSampling

julia> st = 1:100;

julia> itsample(st, 5)
5-element Vector{Int64}:
  9
 15
 52
 96
 91

In some cases, one needs to control the updates the ReservoirSample will be subject to. In this case you can simply use the update! function to fit new values in the reservoir:

julia> using StreamSampling

julia> rs = ReservoirSample(Int, 5);

julia> for x in 1:100
           update!(rs, x)
       end

julia> value(rs)
5-element Vector{Int64}:
  7
  9
 20
 49
 74

Consult the API page for more information on these and other functionalities.

Benchmark

As stated in the first section, using these sampling techniques can bring down considerably the memory usage of the program, but there are cases where they are also more time efficient, as demostrated below with a comparison with the equivalent methods of StatsBase.sample:

julia> using StreamSampling

julia> using BenchmarkTools, Random, StatsBase

julia> rng = Xoshiro(42);

julia> iter = Iterators.filter(x -> x != 10, 1:10^7);

julia> wv(el) = 1.0;

julia> @btime itsample($rng, $iter, 10^4, algRSWRSKIP);
  12.209 ms (8 allocations: 156.47 KiB)

julia> @btime sample($rng, collect($iter), 10^4; replace=true);
  134.622 ms (20 allocations: 146.91 MiB)

julia> @btime itsample($rng, $iter, 10^4, algL);
  10.450 ms (6 allocations: 78.30 KiB)

julia> @btime sample($rng, collect($iter), 10^4; replace=false);
  135.039 ms (27 allocations: 147.05 MiB)

julia> @btime itsample($rng, $iter, $wv, 10^4, algWRSWRSKIP);
  14.017 ms (13 allocations: 568.84 KiB)

julia> @btime sample($rng, collect($iter), Weights($wv.($iter)), 10^4; replace=true);
  543.582 ms (45 allocations: 702.33 MiB)

julia> @btime itsample($rng, $iter, $wv, 10^4, algAExpJ);
  20.968 ms (9 allocations: 234.73 KiB)

julia> @btime sample($rng, collect($iter), Weights($wv.($iter)), 10^4; replace=false);
  305.226 ms (43 allocations: 370.19 MiB)

Contributing

Contributions are welcome! If you encounter any issues, have suggestions for improvements, or would like to add new features, feel free to open an issue or submit a pull request.

streamsampling.jl's People

Contributors

tortar avatar dependabot[bot] avatar imgbot[bot] avatar

Stargazers

 avatar Rafael Guerra avatar Yi-Xin Liu avatar Daniele Pessina avatar J. R. Williams avatar WooKyoung Noh avatar  avatar David Métivier avatar  avatar  avatar Matt Turner avatar Lasse Peters avatar Marius Fersigan avatar Giorgio avatar Elias Carvalho avatar RomeoV avatar Nils avatar Ujjwal Panda avatar Jay-sanjay avatar

Watchers

Tim DuBois avatar  avatar  avatar

streamsampling.jl's Issues

`calculate_eltype(iter)` produces wrong type inference

I am trying to use this package to sample from Combinatorics.jl objects (which typically produce super large iterator you don't want to collect, thus your package is perfect!).

The following works, but the result is a Vector{Any} instead of Vector{Vector{Int64}}

X = 1:6
k = 2
iter = multiset_combinations(X, k)
itsample(iter, 2; replace = false, ordered = false)

if we do

itsample(iter, 20; replace = false, ordered = false) # ask for more element than the 15 in the iter

it errors ERROR: TypeError: in typeassert, expected Vector{Any}, got a value of type Vector{Vector{Int64}}

I believe this is due to usage of calculate_eltype(iter) i.e. Base.@default_eltype(iter).

Pkg name

I found this doc link IteratorSampling.jl where the package is called IteratorSampling and in the confusion I was able to add IteratorSampling.
However add StreamSampling does not work and the dev doc StreamSampling.jl seem less complete.
What is the correct package name (I am guessing StreamSampling)?

Weighted version of sampling methods

Create a weighted version of the sampling methods.

2 possible strategies:

  • Passing a function weight accepting an element of the iterable as argument which defines its sampling weight
  • A weight vector is passed to the method

The first methodology is more aligned with the scope of the library so I think it should be the right implementation

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Stoppable iterator sampling

related also to #20 it could be useful to be able to run the sampling in a for loop such that the user can stop the sampling process at any moment, e.g.

sample, state = init_sample!(iter, n, replace=true, ordered=false)
while condition_is_true
    update_sample!(iter, sample, state) # this just update the sample one iteration at a time
end

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.