Coder Social home page Coder Social logo

barrust / pyprobables Goto Github PK

View Code? Open in Web Editor NEW
111.0 7.0 9.0 4.35 MB

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

License: MIT License

Python 100.00%
probabilistic-programming probability bloom-filter count-min-sketch python data-science data-structures data-mining data-analysis cuckoo-filter

pyprobables's People

Contributors

barrust avatar cunla avatar dekoza avatar dependabot[bot] avatar dnanto avatar kolanich avatar leonhma avatar volker48 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyprobables's Issues

bloom filter intersection failure

Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

Works fine if both have est_elements=16000001.

If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

quotient filter: additional functionality

Additional functionality to add to the quotient filter:

  • Resize / Merge
  • Delete element
  • Import / Export

Something to consider would be to use a form of bit packing to make it more compact, perhaps as a second class

bloom filter initialization

Perhaps the bloom filter initialization could be combined with the main init function. Would need to figure out order of operations, etc.

Update code to reflect end-of-life of Python 2 and 3.5

How do you feel about modernizing the codebase? Py2 is no more and Py3.6 is the lowest supported version right now. I'm preparing Py3.8+ version on my fork if you're interested. The code can also be improved by using black+isort for formatting and incorporating typehints.

constants file

Moving constants out of data structure definitions to a constants file could clean up the code some.

Several problems with the default hash

Hi, I found some problems with the default fnv hash used. Even though it is recommended to provide custom hashes, some users may expect the defaults to work properly.

First off, the results differ from standardized implementations:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
3411864951044856955 # should be 15902901984413996407
$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
1047262940628067782 # should be 16101355973854746

This is caused by wrong hval value here

hval = 14695981039346656073
(should be 14695981039346656037 instead of 14695981039346656073). Changing this constant helps:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
15902901984413996407
$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
16101355973854746

The second problem is in the @hash_with_depth_int wrapper once more hashes than one are computed. Because the value of the first hash is used as a seed for the subsequent hashes, once we get a collision in the first hash, all other hashes are identical:

$ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("gMPflVXtwGDXbIhP73TX", 3))'
[10362567113185002004, 14351534809307984379, 3092021042139682764]
$ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("LtHf1prlU1bCeYZEdqWf", 3))'
[10362567113185002004, 14351534809307984379, 3092021042139682764]

This makes all Count*Sketch data structures much less accurate, since they rely on small probabilities of collision in all hash functions involved.

Update Documentation theme

Based on changes to readthedocs.org and the use of the sphinx-rtd-theme, it will be necessary to fix the custom theme by changing the name. Also recommend updating the theme to the latest release of sphinx-rtd-theme

frombytes support

Support having objects loaded directly from bytes to remove the requirement of having to store on disk in some situations.

See this comment

@KOLANICH

Missing method to aggregate count-min sketches

Count-min sketch has in theory property that 2 tables can be summed together which allows parallel count-min sketch building, but I don't see it implemented there.
Should I make pull request which implements it?

hashes library

It could be beneficial to provide a library of standard hashing strategies using different hashing methods. Ideally they would be usable by each data structure

Implement Rolling Bloom Filter

A rolling bloom filter is similar to the expanding bloom filter but, as it expands, it is capped at the number of expansions. Once another expansion is necessary, pop off the first set and continue. It would be like a timed bloom filter of sorts.

Count-Min Sketch check method

Not sure I like how the check method works with selecting a 'type'; I want it to be able to support multiple types of queries, but need to think of a better way to select it.

Math domain error

Hello,

I'm getting the following error when using print(bloom_filter).

File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
    self.estimate_elements(),
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
    log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
ValueError: math domain error

I'm running the latest version, downloaded from pipit only the other day and I'm using python version 3.8.6.

Initial Update

Hi ๐Ÿ‘Š

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! ๐Ÿค–

fix the submodule import

After installing from pypi, the blooms, countminsketch, and cuckoo submodules are not found. I believe this has to do with the setup.py packages section. Need to look into this.

How to cPickle count min sketch instance

I encounter this error when using cPickle to save count min sketch instance:

Traceback (most recent call last): File "test.py", line 14, in <module> pkl.dump(cms, f) File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 77, in _reduce_ex raise TypeError("a class that defines __slots__ without " TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

Murmur hash example in the README.rst doesn't work

Some of the examples in the readme use pyprobables as the package name, but the package installs as probables.
After fixing that, the example fails:

In [2]: import mmh3  # murmur hash 3 implementation (pip install mmh3)
   ...: from probables.hashes import hash_with_depth_bytes
   ...: from probables import BloomFilter
   ...: 
   ...: 
   ...: @hash_with_depth_bytes
   ...: def my_hash(key):
   ...:     return mmh3.hash_bytes(key)
   ...: 
   ...: 
   ...: blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

In [3]: blm.add("google.com")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <module>
----> 1 blm.add("google.com")

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/blooms/bloom.py:250, in BloomFilter.add(self, key)
    245 def add(self, key: KeyT) -> None:
    246     """Add the key to the Bloom Filter
    247 
    248     Args:
    249         key (str): The element to be inserted"""
--> 250     self.add_alt(self.hashes(key))

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/blooms/bloom.py:243, in BloomFilter.hashes(self, key, depth)
    235 """Return the hashes based on the provided key
    236 
    237 Args:
   (...)
    240 Returns:
    241     List(int): A list of the hashes for the key in int form"""
    242 tmp = depth if depth is not None else self._number_hashes
--> 243 return self._hash_func(key, tmp)

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/hashes.py:36, in hash_with_depth_bytes.<locals>.hashing_func(key, depth)
     34 tmp = key if not isinstance(key, str) else key.encode("utf-8")
     35 for idx in range(depth):
---> 36     tmp = func(tmp, idx)
     37     res.append(unpack("Q", tmp[:8])[0])  # turn into 64 bit number
     38 return res

TypeError: my_hash() takes 1 positional argument but 2 were given

Simplify BloomFilter Implementation

The BloomFilter implementation is overly complex for what it is trying to do; the use of the BaseBloom class adds to the complexity and may actually slow down computations. Try removing BaseBloom to see if it reduces overhead and complexity.

No changes to the tests should be needed once this is completed, if it works!

Bloom Filter clear

It might be useful to be able to clear a bloom filter to reset it for use again

Wrong result with large filter

I expect that if I ask the filter to check for a membership and it tells me FALSE, then its definitely NOT a member. I did the following:

def verifyMembership(key):
    global bloom
    if key in bloom:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
bloom = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(key)
bloom.add(key)
verifyMembership(key)
bloom.export(filterFile)

I called my script twice and the output is:

Definitly not in
Its possibly in
Definitly not in
Its possibly in

But I would expect:

Definitly not in
Its possibly in
Its possibly in
Its possibly in

If i am reducing the est_elements to lets say 10000, then its fine.

Better load factor test

There are times, during the tests, that the cuckoo filter load test fails due to it having to expand. Since the cuckoo algorithm is partially random, a better test should be found.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.