jvirkki / libbloom Goto Github PK
View Code? Open in Web Editor NEWA simple and small bloom filter implementation in plain C.
License: BSD 2-Clause "Simplified" License
A simple and small bloom filter implementation in plain C.
License: BSD 2-Clause "Simplified" License
When trying to build fairly large, low-error rate filter (10^10 entries, 0.00001 error rate), bloom_init
fails because entries
, bits
, and bytes
are signed integers. Would you consider a patch that adjusts these type definitions to something like uint64_t
, or something similar/appropriate, to allow for significantly large filter sizes? I understand that wide platform support is a goal for this project.
For the curious, my use case was building a filter for the HIBP password list (~500 million entries) with varying error levels.
I propose to have some checksum function that the user can call on request.
Lets to call this new funtion bloom_checksum
int bloom_checksum(struct bloom * bloom)
{
return (murmurhash2(bloom->bf, bloom->bytes, 0x9747b28c) == bloom->checksum);
}
This depend also of some internal variable called checksum in the bloom filter structure, the hash function can be murmurhash2, xxhash64 or some other.
Actually im doing this with sha256 outside of the library.
The user can call to another function to indicate that the data in the bloom filter is not going to be modified anymore and construct the internal checksum and then call the bloom checksum function every 2 days or some else trigger.
When someone run an application that only runs for some seconds there is no need to call that function, but what happen in the long run, what if there is some application that use the bloom filter to do something "critical" for some months or a year, well in those case we need to run some checksum to ensure that the memory is not corrupted by some memory error, electrical failures, radiation or even cosmic particles. I know those scenarios are unlikely to happened, but the history show us that it actually happen some times ( https://www.youtube.com/watch?v=AaZ_RSt0KP8 ).
Usually ECC memory solve most of those cases but what happened if you don't have access to that kind of solutions.
Anyway this is only a suggestion for your excellent library. i'm using it for a long time with a lot of modifications to handle more than 128 GB of bf data
Best regards!
I'm trying to make debian package. and now it compiles OK for i386 and amd64.
however, it fails on armel:
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c murmur2/MurmurHash2.c -o /<<PKGBUILDDIR>>/build/murmurhash2.o
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c bloom.c -o /<<PKGBUILDDIR>>/build/bloom.o
cc: error: unrecognized command line option '-m32'
cc: error: unrecognized command line option '-m32'
Full build log:
http://debomatic-armel.debian.net/distribution#experimental/libbloom/1.4-1~exp1/buildlog
Hi, I am confusing about test_bit_set_bit() implements.
inline static int test_bit_set_bit(unsigned char * buf,
unsigned int x, int set_bit)
{
unsigned int byte = x >> 3; // I think this a 8-bytes align.
unsigned char c = buf[byte]; // "expensive memory access" Then we access a byte
unsigned int mask = 1 << (x % 8); // Why mask will left shift not equal to "x % 8"
if (c & mask) {
return 1;
} else {
if (set_bit) {
buf[byte] = c | mask;
}
return 0;
}
}
Hi this is just an observation not a bug.
I'm working with this bloom library but with with several changes.
One of the problems what i notice with the currents branches master
and development
is that those both are not thread safe.
Let me explain a little of my case, I'm building a really large bloom filter, in my program and usually the process is slow for one single thread, to speed up this I decide to work with multi-threads but when I test it for my purpose for the first time i notice that some times the program die without reason, after debugging for a time I realize that it need a mutex to protect the read and write of the bloom->bf variable.
After change a little the code i mange to keep the bloom as pthread safe like this...
Add a pthread_mutex_t
variable to the internal bloom structure bloom.c
:
pthread_mutex_t mutex;
Change the function bloom_check_add
to protect the call to the test_bit_set_bit
function
pthread_mutex_lock(&bloom->mutex);
r = test_bit_set_bit(bloom->bf, x, add);
pthread_mutex_unlock(&bloom->mutex);
if (r) {
obviously i add the r
variable to keep it outside of the if
call
add the thread header
#include <pthread.h>
I know, this make the libbloom OS dependent but there are similar calls from windows.
And yes this was only for let you know about the multi-thread problem I don't know if some other user report it before I search in the closed issues and there is nothing related with it.
Best regards!
It seems adds work by looping over all hash functions and setting the corresponding bit in the bitbuffer assigned to each hash function.
How could deletes be implemented? Looping over all hash functions and zeroing the bit from the bitbuffer of every hash function?
Also why are there multiple hash functions? Why not just 1 hash function and 1 bit buffer?
During some experiments with this lib, I discover an undesirable correlation between filter size and collision number.
Filters tested for 0.0001 estimated FP accuracy (0.01%).
But real accuracy, in particular ranges of filter sizes, is far from given.
Example of accuracy behavior near filter size 218793:
Globally, it looks much more strange:
For some filter sizes error jumps 70 times from 0.01% up to 0.7%
Could we get libbloom published to the Conan repository, in order to streamline dependency management for downstream projects?
According to the original author,
which is the latest version in the series of MurmurHash functions - the new version is faster, more robust, and its variants can produce 32- and 128-bit hash values efficiently on both x86 and x64 platforms.
The source file can be found at https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
This simple yet wonderful library is great! :) Thanks for all the hard work. I've been using it lately to implement some cryptographic protocols that use bloom filters as building blocks and having a function which returns the indexes output by each hash function given an entry would be enormously convenient.
This function would check whether a given element exists (probabilistically) in the bloom filter, and then hash it with each of the "bloom->hashes"-many hash functions associated with the bloom filter. Finally, it would store the output index (digest) in an output argument array.
Here's what I've been using to accomplish this task so far:
void bloom_get_indexes(unsigned long * indexes,
void * element, unsigned long len, bloom_filter_t * bloom)
{
if (1 == bloom_check(bloom, element, len)) {
unsigned int a = murmurhash2(element, len, 0x9747b28c); // Magic number from bloom_check_add()
unsigned int b = murmurhash2(element, len, a);
for (unsigned long i = 0; i < bloom->hashes; i++) {
indexes[i] = ((a + b*i) % bloom->bits) >> 3;
}
}
}
Here's what I would expect the function prototype to look like matching your header file's style.
/** ***************************************************************************
* Retrieve and store the indexes of a given bloom filter element.
*
* Upon return, the indexes holds the output of the hash function
* applied to the given element.
*
* Parameters:
* -----------
* indexes - Pointer to an allocated array of length bloom->hashes.
* buffer - Pointer to buffer containing element.
* len - Size of 'buffer'.
* bloom - Pointer to an allocated struct bloom (see above).
*
* Return: none
*
*/
void bloom_get_indexes(unsigned long * indexes,
void * element, unsigned long len, bloom_filter_t * bloom);
If this isn't something you feel fits with the minimalism of the library I totally understand. Also, if I made some kind of logic error in the implementation I would love to know.
Hi, there seems to be a problem deciding between -m32 or -m64 options in Makefile
on 32 bit arm:
$ make
mkdir -p /home/khu/test/libbloom/build
cc -Wall -O3 -m64 -std=c99 -fPIC -DBLOOM_VERSION=1.5 -I/home/khu/test/libbloom -I/home/khu/test/libbloom/murmur2 -c murmur2/MurmurHash2.c -o /home/khu/test/libbloom/build/murmurhash2.o
gcc: error: unrecognized command line option ‘-m64’
make: *** [Makefile:117: /home/khu/test/libbloom/build/murmurhash2.o] Error 1
I find the unit test sometimes (less than 20%, but maybe not accurate) fails.
Please find the build log on debian builldd:
https://buildd.debian.org/status/fetch.php?pkg=libbloom&arch=amd64&ver=1.4-6&stamp=1494930635&raw=0
----- Running basic tests -----
----- basic -----
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30
->entries = 102
->error = 0.100000
->bits = 488
->bits per elem = 4.792529
->bytes = 61
->hash functions = 4
----- add_random(10, 0.100000, 10, 0, 1, 32, 1) -----
bloom at 0x7ffc4a988b20
->entries = 10
->error = 0.100000
->bits = 47
->bits per elem = 4.792529
->bytes = 6
->hash functions = 4
entries: 10, error: 0.100000, count: 10, coll: 2, error: 0.200000, bytes: 6
error: expected error 0.100000 but observed 0.200000
Makefile:124: recipe for target 'test' failed
The statement
x = (a + i*b) % bloom->bits;
in bloom.c
in function bloom_check_add
uses two hash result and i
to combine a new value to determine which bit should be set 1, so, is every x evenly distributed in the bitmap?
I'm already using the version Two it works very well for some numer of items but I notice that the unsigned int bit limits the amount of real reserved memory.
bytes is always ceil bits/8
This result in a maximum number of bits: 4,294,967,295 but only 536870912 bytes
With some high value of items it result in some cases to a random decrement of bytes
I made this code as a proof of concept
#include <stdio.h>
#include <stdlib.h>
#include "bloom/bloom.h"
struct bloom bloom;
unsigned int items[10] = {4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648};
int main(int argc, char **argv) {
int i = 0;
while(i < 10) {
if(bloom_init2(&bloom,items[i],0.001) == 1) {
fprintf(stderr,"error bloom_init\n");
exit(0);
}
printf("Items %u bloom filter bytes %u\n",items[i],bloom.bytes);
bloom_reset(&bloom);
i++;
}
}
His output is:
Items 4194304 bloom filter bytes 7537997
Items 8388608 bloom filter bytes 15075994
Items 16777216 bloom filter bytes 30151987
Items 33554432 bloom filter bytes 60303973
Items 67108864 bloom filter bytes 120607946
Items 134217728 bloom filter bytes 241215893
Items 268435456 bloom filter bytes 482431785
Items 536870912 bloom filter bytes 427992657
Items 1073741824 bloom filter bytes 319114402
Items 2147483648 bloom filter bytes 101357891
The last line is only 96 MB of RAM for a bloom filter of 2147483648 items.
I solve this in my code changing the bits variable type for "unsigned long long int" instead of "unsigned int"
Kinds regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.