The pyprobables from barrust

bloom filter intersection failure

Tried an intersection of 2 bloom filters both with est_elements=16000000, got a list index out of range error

Works fine if both have est_elements=16000001.

If one is 160000000 and the other is 16000001, get a None return on the intersection, rather than throwing an error explaining what the problem is.

quotient filter: Import and Export

Add import and export functionality to the quotient filter

see #112

Bloom Import not storing elements added

quotient filter: additional functionality

Additional functionality to add to the quotient filter:

Resize / Merge
Delete element
Import / Export

Something to consider would be to use a form of bit packing to make it more compact, perhaps as a second class

bloom filter initialization

Perhaps the bloom filter initialization could be combined with the main init function. Would need to figure out order of operations, etc.

Update code to reflect end-of-life of Python 2 and 3.5

How do you feel about modernizing the codebase? Py2 is no more and Py3.6 is the lowest supported version right now. I'm preparing Py3.8+ version on my fork if you're interested. The code can also be improved by using black+isort for formatting and incorporating typehints.

constants file

Moving constants out of data structure definitions to a constants file could clean up the code some.

Several problems with the default hash

Hi, I found some problems with the default fnv hash used. Even though it is recommended to provide custom hashes, some users may expect the defaults to work properly.

First off, the results differ from standardized implementations:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
3411864951044856955 # should be 15902901984413996407
$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
1047262940628067782 # should be 16101355973854746

This is caused by wrong hval value here

pyprobables/probables/hashes.py

Line 85 in beb73f2

hval = 14695981039346656073

(should be 14695981039346656037 instead of 14695981039346656073). Changing this constant helps:

$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("foo"))'
15902901984413996407
$ python -c 'from probables.hashes import fnv_1a; print(fnv_1a("bar"))'
16101355973854746

The second problem is in the @hash_with_depth_int wrapper once more hashes than one are computed. Because the value of the first hash is used as a seed for the subsequent hashes, once we get a collision in the first hash, all other hashes are identical:

$ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("gMPflVXtwGDXbIhP73TX", 3))'
[10362567113185002004, 14351534809307984379, 3092021042139682764]
$ python3 -c 'from probables.hashes import default_fnv_1a; print(default_fnv_1a("LtHf1prlU1bCeYZEdqWf", 3))'
[10362567113185002004, 14351534809307984379, 3092021042139682764]

This makes all Count*Sketch data structures much less accurate, since they rely on small probabilities of collision in all hash functions involved.

Move to github actions

Move from travis-ci to github actions for development and deployment of pyprobables

Update Documentation theme

Based on changes to readthedocs.org and the use of the sphinx-rtd-theme, it will be necessary to fix the custom theme by changing the name. Also recommend updating the theme to the latest release of sphinx-rtd-theme

quotient filter: Merge

Add functionality to merge multiple quotient filters

see #112

frombytes support

Support having objects loaded directly from bytes to remove the requirement of having to store on disk in some situations.

See this comment

@KOLANICH

counting cuckoo filter

Update python support to python 3.7 or higher

first step: pyupgrade

str for HeavyHitters and StreamThreshold

The library should augment the str for both heavy hitters and stream threshold

Inefficient count-min sketch table representation

Table representation for count-min sketch uses python list. Using numpy would be more efficient.
Should I make pull request with this?

Missing method to aggregate count-min sketches

Count-min sketch has in theory property that 2 tables can be summed together which allows parallel count-min sketch building, but I don't see it implemented there.
Should I make pull request which implements it?

quotient filter: remove element

Add a function to insert raw `bytes` into a filter.

Complete generating documentation

Better, more complete documentation would be very beneficial.

hashes library

It could be beneficial to provide a library of standard hashing strategies using different hashing methods. Ideally they would be usable by each data structure

Counting Bloom continuation

It would be good if the counting bloom supported:

Union
Intersection
Jaccard Index

self expanding cuckoofilter

unique inserts into cuckoo filter

Implement Rolling Bloom Filter

A rolling bloom filter is similar to the expanding bloom filter but, as it expands, it is capped at the number of expansions. Once another expansion is necessary, pop off the first set and continue. It would be like a timed bloom filter of sorts.

Count-Min Sketch check method

Not sure I like how the check method works with selecting a 'type'; I want it to be able to support multiple types of queries, but need to think of a better way to select it.

changing fingerprint size on Cuckoo Filter

Being able to change the fingerprint size could make the filter more or less probable and potentially smaller.

Math domain error

Hello,

I'm getting the following error when using print(bloom_filter).

File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 127, in __str__
    self.estimate_elements(),
  File "/home/user/.conda/envs/biopython/lib/python3.9/site-packages/probables/blooms/bloom.py", line 350, in estimate_elements
    log_n = math.log(1 - (float(setbits) / float(self.number_bits)))
ValueError: math domain error

I'm running the latest version, downloaded from pipit only the other day and I'm using python version 3.8.6.

Cuckoo Filter string representation

It would be helpful to have a string representation of the cuckoo filter object similar to the bloom filter and count-min sketch's

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create separate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

fix the submodule import

After installing from pypi, the blooms, countminsketch, and cuckoo submodules are not found. I believe this has to do with the setup.py packages section. Need to look into this.

fix readthedocs build

Implement an arg controlling error rate of cuckoo filter

How to cPickle count min sketch instance

I encounter this error when using cPickle to save count min sketch instance:

Traceback (most recent call last): File "test.py", line 14, in <module> pkl.dump(cms, f) File "/usr/local/Cellar/python@2/2.7.16/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 77, in _reduce_ex raise TypeError("a class that defines __slots__ without " TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

Counting Bloom Implementation

Providing a counting bloom would be a nice addition

Murmur hash example in the README.rst doesn't work

Some of the examples in the readme use pyprobables as the package name, but the package installs as probables.
After fixing that, the example fails:

In [2]: import mmh3  # murmur hash 3 implementation (pip install mmh3)
   ...: from probables.hashes import hash_with_depth_bytes
   ...: from probables import BloomFilter
   ...: 
   ...: 
   ...: @hash_with_depth_bytes
   ...: def my_hash(key):
   ...:     return mmh3.hash_bytes(key)
   ...: 
   ...: 
   ...: blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

In [3]: blm.add("google.com")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [3], in <module>
----> 1 blm.add("google.com")

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/blooms/bloom.py:250, in BloomFilter.add(self, key)
    245 def add(self, key: KeyT) -> None:
    246     """Add the key to the Bloom Filter
    247 
    248     Args:
    249         key (str): The element to be inserted"""
--> 250     self.add_alt(self.hashes(key))

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/blooms/bloom.py:243, in BloomFilter.hashes(self, key, depth)
    235 """Return the hashes based on the provided key
    236 
    237 Args:
   (...)
    240 Returns:
    241     List(int): A list of the hashes for the key in int form"""
    242 tmp = depth if depth is not None else self._number_hashes
--> 243 return self._hash_func(key, tmp)

File ~/.pyenv/versions/3.9.2/envs/gh392/lib/python3.9/site-packages/probables/hashes.py:36, in hash_with_depth_bytes.<locals>.hashing_func(key, depth)
     34 tmp = key if not isinstance(key, str) else key.encode("utf-8")
     35 for idx in range(depth):
---> 36     tmp = func(tmp, idx)
     37     res.append(unpack("Q", tmp[:8])[0])  # turn into 64 bit number
     38 return res

TypeError: my_hash() takes 1 positional argument but 2 were given

def verifyMembership(key):
    global bloom
    if key in bloom:
        print('Its possibly in')
    else:
        print('Definitly not in')

key = 'some'
filterFile = 'index.dat'
bloom = BloomFilter(est_elements=100000000, false_positive_rate=0.03, filepath=filterFile)
verifyMembership(key)
bloom.add(key)
verifyMembership(key)
bloom.export(filterFile)

I called my script twice and the output is:

Definitly not in
Its possibly in
Definitly not in
Its possibly in

But I would expect:

Definitly not in
Its possibly in
Its possibly in
Its possibly in

If i am reducing the est_elements to lets say 10000, then its fine.

barrust / pyprobables Goto Github PK

pyprobables's People

Contributors

Stargazers

Watchers

Forkers

pyprobables's Issues

Recommend Projects

Recommend Topics

Recommend Org