jvirkki / libbloom Goto Github PK

View Code? Open in Web Editor NEW

282.0 11.0 112.0 2.54 MB

A simple and small bloom filter implementation in plain C.

License: BSD 2-Clause "Simplified" License

C 77.90% Makefile 10.84% Perl 7.16% Shell 0.72% Roff 3.38%

bloom-filter c

libbloom's People

Contributors

Stargazers

Watchers

Forkers

lemonhall czinsane rahulnanda cen5bin kdm9 yyang19 ae-s mahall alfiyazi arminius2 m0nkeyc0der everythingme jakejesi hongkunge songbingyu pombredanne decarboxy orange-sys yowcow emaxerrno hk185 shadowsocks kuree rogers0 cute beingmeta jingfg123 utumen scriptking meridianz mnunberg hmarzan normanx chrisdew edwardbetts p0pr0ck5 xpgongmath sherlockwu fkrestan jun-wen misc-useful robguima nayutaco 0xe0f pratikone bssrdf baotonglu light4 kiwiroy kata-rina beyonddream-productions lato kylewu11 clayne junkrat77 liwenkaixxx cosmossun innerout secp8x32 melvinto konstantin89 welloptimized eztam- velovec jasontangjs ycstech howtolovechina jaketbrown soujanyaponnapalli mayhemheroes ff12323 zhangw 5l1v3r1 zane-a-karl shenhang-seu

libbloom's Issues

Bloom struct types limit filter size

When trying to build fairly large, low-error rate filter (10^10 entries, 0.00001 error rate), bloom_init fails because entries, bits, and bytes are signed integers. Would you consider a patch that adjusts these type definitions to something like uint64_t, or something similar/appropriate, to allow for significantly large filter sizes? I understand that wide platform support is a goal for this project.

For the curious, my use case was building a filter for the HIBP password list (~500 million entries) with varying error levels.

checksum function

I propose to have some checksum function that the user can call on request.

Lets to call this new funtion bloom_checksum

int bloom_checksum(struct bloom * bloom)
{
	return (murmurhash2(bloom->bf, bloom->bytes, 0x9747b28c) == bloom->checksum);
}

This depend also of some internal variable called checksum in the bloom filter structure, the hash function can be murmurhash2, xxhash64 or some other.

Actually im doing this with sha256 outside of the library.

The user can call to another function to indicate that the data in the bloom filter is not going to be modified anymore and construct the internal checksum and then call the bloom checksum function every 2 days or some else trigger.

When someone run an application that only runs for some seconds there is no need to call that function, but what happen in the long run, what if there is some application that use the bloom filter to do something "critical" for some months or a year, well in those case we need to run some checksum to ensure that the memory is not corrupted by some memory error, electrical failures, radiation or even cosmic particles. I know those scenarios are unlikely to happened, but the history show us that it actually happen some times ( https://www.youtube.com/watch?v=AaZ_RSt0KP8 ).

Usually ECC memory solve most of those cases but what happened if you don't have access to that kind of solutions.

Anyway this is only a suggestion for your excellent library. i'm using it for a long time with a lot of modifications to handle more than 128 GB of bf data

Best regards!

-m32 gcc options seems not existing on armel which is 32-bit processor

I'm trying to make debian package. and now it compiles OK for i386 and amd64.
however, it fails on armel:

cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c murmur2/MurmurHash2.c -o /<<PKGBUILDDIR>>/build/murmurhash2.o
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c bloom.c -o /<<PKGBUILDDIR>>/build/bloom.o
cc: error: unrecognized command line option '-m32'
cc: error: unrecognized command line option '-m32'

Full build log:
http://debomatic-armel.debian.net/distribution#experimental/libbloom/1.4-1~exp1/buildlog

Why test_bit_set_bit() use a mask?

Hi, I am confusing about test_bit_set_bit() implements.

inline static int test_bit_set_bit(unsigned char * buf,
                                   unsigned int x, int set_bit)
{
  unsigned int byte = x >> 3;           // I think this a 8-bytes align.
  unsigned char c = buf[byte];        // "expensive memory access" Then we access a byte
  unsigned int mask = 1 << (x % 8); // Why mask will left shift not equal to "x % 8"

  if (c & mask) {
    return 1;
  } else {
    if (set_bit) {
      buf[byte] = c | mask;
    }
    return 0;
  }
}

about thread safe in multi-thread bloom creation

Hi this is just an observation not a bug.

I'm working with this bloom library but with with several changes.

One of the problems what i notice with the currents branches master and development is that those both are not thread safe.

Let me explain a little of my case, I'm building a really large bloom filter, in my program and usually the process is slow for one single thread, to speed up this I decide to work with multi-threads but when I test it for my purpose for the first time i notice that some times the program die without reason, after debugging for a time I realize that it need a mutex to protect the read and write of the bloom->bf variable.

After change a little the code i mange to keep the bloom as pthread safe like this...

Add a pthread_mutex_t variable to the internal bloom structure bloom.c:

pthread_mutex_t mutex;

Change the function bloom_check_add to protect the call to the test_bit_set_bit function

	pthread_mutex_lock(&bloom->mutex);
	r = test_bit_set_bit(bloom->bf, x, add);
	pthread_mutex_unlock(&bloom->mutex);
    if (r) {

obviously i add the r variable to keep it outside of the if call

add the thread header

#include <pthread.h>

I know, this make the libbloom OS dependent but there are similar calls from windows.

And yes this was only for let you know about the multi-thread problem I don't know if some other user report it before I search in the closed issues and there is nothing related with it.

Best regards!

How to implement delete?

It seems adds work by looping over all hash functions and setting the corresponding bit in the bitbuffer assigned to each hash function.

How could deletes be implemented? Looping over all hash functions and zeroing the bit from the bitbuffer of every hash function?

Also why are there multiple hash functions? Why not just 1 hash function and 1 bit buffer?

Strong correlation between filter size and real error / collision number

During some experiments with this lib, I discover an undesirable correlation between filter size and collision number.

Filters tested for 0.0001 estimated FP accuracy (0.01%).
But real accuracy, in particular ranges of filter sizes, is far from given.

Example of accuracy behavior near filter size 218793:

Globally, it looks much more strange:

filters-statistics.txt

For some filter sizes error jumps 70 times from 0.01% up to 0.7%

Test code.

Publish to conan

Could we get libbloom published to the Conan repository, in order to streamline dependency management for downstream projects?

https://conan.io/center/

Feature Request: upgrade to Murmurhash3

According to the original author,

which is the latest version in the series of MurmurHash functions - the new version is faster, more robust, and its variants can produce 32- and 128-bit hash values efficiently on both x86 and x64 platforms.

The source file can be found at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp

Add bloom_get_indexes function

This simple yet wonderful library is great! :) Thanks for all the hard work. I've been using it lately to implement some cryptographic protocols that use bloom filters as building blocks and having a function which returns the indexes output by each hash function given an entry would be enormously convenient.

This function would check whether a given element exists (probabilistically) in the bloom filter, and then hash it with each of the "bloom->hashes"-many hash functions associated with the bloom filter. Finally, it would store the output index (digest) in an output argument array.

Here's what I've been using to accomplish this task so far:

void bloom_get_indexes(unsigned long * indexes,
		       void * element, unsigned long len, bloom_filter_t * bloom)
{
    if (1 == bloom_check(bloom, element, len)) {
	unsigned int a = murmurhash2(element, len, 0x9747b28c); // Magic number from bloom_check_add()
	unsigned int b = murmurhash2(element, len, a);

	for (unsigned long i = 0; i < bloom->hashes; i++) {
	    indexes[i] = ((a + b*i) % bloom->bits) >> 3;
	}
    }
}

Here's what I would expect the function prototype to look like matching your header file's style.

/** ***************************************************************************
 * Retrieve and store the indexes of a given bloom filter element.
 *
 * Upon return, the indexes holds the output of the hash function
 * applied to the given element.
 *
 * Parameters:
 * -----------
 *     indexes - Pointer to an allocated array of length bloom->hashes.
 *     buffer  - Pointer to buffer containing element.
 *     len     - Size of 'buffer'.
 *     bloom   - Pointer to an allocated struct bloom (see above).
 *
 * Return: none
 *
 */
void bloom_get_indexes(unsigned long * indexes,
		       void * element, unsigned long len, bloom_filter_t * bloom);

If this isn't something you feel fits with the minimalism of the library I totally understand. Also, if I made some kind of logic error in the implementation I would love to know.

Build fails on 32 bit arm (option -m64)

Hi, there seems to be a problem deciding between -m32 or -m64 options in Makefile on 32 bit arm:

$ make
mkdir -p /home/khu/test/libbloom/build
cc   -Wall -O3 -m64 -std=c99 -fPIC -DBLOOM_VERSION=1.5 -I/home/khu/test/libbloom -I/home/khu/test/libbloom/murmur2 -c murmur2/MurmurHash2.c -o /home/khu/test/libbloom/build/murmurhash2.o
gcc: error: unrecognized command line option ‘-m64’
make: *** [Makefile:117: /home/khu/test/libbloom/build/murmurhash2.o] Error 1

unit test sometimes fails

I find the unit test sometimes (less than 20%, but maybe not accurate) fails.

Please find the build log on debian builldd:
https://buildd.debian.org/status/fetch.php?pkg=libbloom&arch=amd64&ver=1.4-6&stamp=1494930635&raw=0

----- Running basic tests -----
----- basic -----
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30
->entries = 102
->error = 0.100000
->bits = 488
->bits per elem = 4.792529
->bytes = 61
->hash functions = 4
----- add_random(10, 0.100000, 10, 0, 1, 32, 1) -----
bloom at 0x7ffc4a988b20
->entries = 10
->error = 0.100000
->bits = 47
->bits per elem = 4.792529
->bytes = 6
->hash functions = 4
entries: 10, error: 0.100000, count: 10, coll: 2, error: 0.200000, bytes: 6
error: expected error 0.100000 but observed 0.200000
Makefile:124: recipe for target 'test' failed

Is the combined value after each hash operation evenly distributed in the bitmap?

The statement

x = (a + i*b) % bloom->bits;

in bloom.c in function bloom_check_add uses two hash result and i to combine a new value to determine which bit should be set 1, so, is every x evenly distributed in the bitmap?

total number of bytes decrement with a high number of items

I'm already using the version Two it works very well for some numer of items but I notice that the unsigned int bit limits the amount of real reserved memory.

bytes is always ceil bits/8

This result in a maximum number of bits: 4,294,967,295 but only 536870912 bytes

With some high value of items it result in some cases to a random decrement of bytes

I made this code as a proof of concept

#include <stdio.h>
#include <stdlib.h>
#include "bloom/bloom.h"

struct bloom bloom;

unsigned int items[10] = {4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648};

int main(int argc, char **argv)  {
  int i = 0;
  while(i < 10) {
    if(bloom_init2(&bloom,items[i],0.001)  == 1)  {
      fprintf(stderr,"error bloom_init\n");
      exit(0);
    }
    printf("Items %u bloom filter bytes %u\n",items[i],bloom.bytes);
    bloom_reset(&bloom);
    i++;
  }
}

His output is:

Items 4194304 bloom filter bytes 7537997
Items 8388608 bloom filter bytes 15075994
Items 16777216 bloom filter bytes 30151987
Items 33554432 bloom filter bytes 60303973
Items 67108864 bloom filter bytes 120607946
Items 134217728 bloom filter bytes 241215893
Items 268435456 bloom filter bytes 482431785
Items 536870912 bloom filter bytes 427992657
Items 1073741824 bloom filter bytes 319114402
Items 2147483648 bloom filter bytes 101357891

The last line is only 96 MB of RAM for a bloom filter of 2147483648 items.

I solve this in my code changing the bits variable type for "unsigned long long int" instead of "unsigned int"

Kinds regards