Coder Social home page Coder Social logo

libbloom's People

Contributors

edwardbetts avatar emaxerrno avatar jvirkki avatar mnunberg avatar p0pr0ck5 avatar robguima avatar rogers0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libbloom's Issues

Bloom struct types limit filter size

When trying to build fairly large, low-error rate filter (10^10 entries, 0.00001 error rate), bloom_init fails because entries, bits, and bytes are signed integers. Would you consider a patch that adjusts these type definitions to something like uint64_t, or something similar/appropriate, to allow for significantly large filter sizes? I understand that wide platform support is a goal for this project.

For the curious, my use case was building a filter for the HIBP password list (~500 million entries) with varying error levels.

checksum function

I propose to have some checksum function that the user can call on request.

Lets to call this new funtion bloom_checksum

int bloom_checksum(struct bloom * bloom)
{
	return (murmurhash2(bloom->bf, bloom->bytes, 0x9747b28c) == bloom->checksum);
}

This depend also of some internal variable called checksum in the bloom filter structure, the hash function can be murmurhash2, xxhash64 or some other.

Actually im doing this with sha256 outside of the library.

The user can call to another function to indicate that the data in the bloom filter is not going to be modified anymore and construct the internal checksum and then call the bloom checksum function every 2 days or some else trigger.

When someone run an application that only runs for some seconds there is no need to call that function, but what happen in the long run, what if there is some application that use the bloom filter to do something "critical" for some months or a year, well in those case we need to run some checksum to ensure that the memory is not corrupted by some memory error, electrical failures, radiation or even cosmic particles. I know those scenarios are unlikely to happened, but the history show us that it actually happen some times ( https://www.youtube.com/watch?v=AaZ_RSt0KP8 ).

Usually ECC memory solve most of those cases but what happened if you don't have access to that kind of solutions.

Anyway this is only a suggestion for your excellent library. i'm using it for a long time with a lot of modifications to handle more than 128 GB of bf data

Best regards!

-m32 gcc options seems not existing on armel which is 32-bit processor

I'm trying to make debian package. and now it compiles OK for i386 and amd64.
however, it fails on armel:

cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c murmur2/MurmurHash2.c -o /<<PKGBUILDDIR>>/build/murmurhash2.o
cc -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -I/usr/include/arm-linux-gnueabi -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -O3 -m32 -std=c99 -fPIC -DBLOOM_VERSION=1.4 -I/<<PKGBUILDDIR>> -I/<<PKGBUILDDIR>>/murmur2 -c bloom.c -o /<<PKGBUILDDIR>>/build/bloom.o
cc: error: unrecognized command line option '-m32'
cc: error: unrecognized command line option '-m32'

Full build log:
http://debomatic-armel.debian.net/distribution#experimental/libbloom/1.4-1~exp1/buildlog

Why test_bit_set_bit() use a mask?

Hi, I am confusing about test_bit_set_bit() implements.

inline static int test_bit_set_bit(unsigned char * buf,
                                   unsigned int x, int set_bit)
{
  unsigned int byte = x >> 3;           // I think this a 8-bytes align.
  unsigned char c = buf[byte];        // "expensive memory access" Then we access a byte
  unsigned int mask = 1 << (x % 8); // Why mask will left shift not equal to "x % 8"

  if (c & mask) {
    return 1;
  } else {
    if (set_bit) {
      buf[byte] = c | mask;
    }
    return 0;
  }
}

about thread safe in multi-thread bloom creation

Hi this is just an observation not a bug.

I'm working with this bloom library but with with several changes.

One of the problems what i notice with the currents branches master and development is that those both are not thread safe.

Let me explain a little of my case, I'm building a really large bloom filter, in my program and usually the process is slow for one single thread, to speed up this I decide to work with multi-threads but when I test it for my purpose for the first time i notice that some times the program die without reason, after debugging for a time I realize that it need a mutex to protect the read and write of the bloom->bf variable.

After change a little the code i mange to keep the bloom as pthread safe like this...

Add a pthread_mutex_t variable to the internal bloom structure bloom.c:

pthread_mutex_t mutex;

Change the function bloom_check_add to protect the call to the test_bit_set_bit function

	pthread_mutex_lock(&bloom->mutex);
	r = test_bit_set_bit(bloom->bf, x, add);
	pthread_mutex_unlock(&bloom->mutex);
    if (r) {

obviously i add the r variable to keep it outside of the if call

add the thread header

#include <pthread.h>

I know, this make the libbloom OS dependent but there are similar calls from windows.

And yes this was only for let you know about the multi-thread problem I don't know if some other user report it before I search in the closed issues and there is nothing related with it.

Best regards!

How to implement delete?

It seems adds work by looping over all hash functions and setting the corresponding bit in the bitbuffer assigned to each hash function.

How could deletes be implemented? Looping over all hash functions and zeroing the bit from the bitbuffer of every hash function?

Also why are there multiple hash functions? Why not just 1 hash function and 1 bit buffer?

Strong correlation between filter size and real error / collision number

During some experiments with this lib, I discover an undesirable correlation between filter size and collision number.

Filters tested for 0.0001 estimated FP accuracy (0.01%).
But real accuracy, in particular ranges of filter sizes, is far from given.

Example of accuracy behavior near filter size 218793:

error-filtersize

Globally, it looks much more strange:

filters-statistics.txt

error-filtersize02

For some filter sizes error jumps 70 times from 0.01% up to 0.7%

Test code.

Add bloom_get_indexes function

This simple yet wonderful library is great! :) Thanks for all the hard work. I've been using it lately to implement some cryptographic protocols that use bloom filters as building blocks and having a function which returns the indexes output by each hash function given an entry would be enormously convenient.

This function would check whether a given element exists (probabilistically) in the bloom filter, and then hash it with each of the "bloom->hashes"-many hash functions associated with the bloom filter. Finally, it would store the output index (digest) in an output argument array.

Here's what I've been using to accomplish this task so far:

void bloom_get_indexes(unsigned long * indexes,
		       void * element, unsigned long len, bloom_filter_t * bloom)
{
    if (1 == bloom_check(bloom, element, len)) {
	unsigned int a = murmurhash2(element, len, 0x9747b28c); // Magic number from bloom_check_add()
	unsigned int b = murmurhash2(element, len, a);

	for (unsigned long i = 0; i < bloom->hashes; i++) {
	    indexes[i] = ((a + b*i) % bloom->bits) >> 3;
	}
    }
}

Here's what I would expect the function prototype to look like matching your header file's style.

/** ***************************************************************************
 * Retrieve and store the indexes of a given bloom filter element.
 *
 * Upon return, the indexes holds the output of the hash function
 * applied to the given element.
 *
 * Parameters:
 * -----------
 *     indexes - Pointer to an allocated array of length bloom->hashes.
 *     buffer  - Pointer to buffer containing element.
 *     len     - Size of 'buffer'.
 *     bloom   - Pointer to an allocated struct bloom (see above).
 *
 * Return: none
 *
 */
void bloom_get_indexes(unsigned long * indexes,
		       void * element, unsigned long len, bloom_filter_t * bloom);

If this isn't something you feel fits with the minimalism of the library I totally understand. Also, if I made some kind of logic error in the implementation I would love to know.

Build fails on 32 bit arm (option -m64)

Hi, there seems to be a problem deciding between -m32 or -m64 options in Makefile on 32 bit arm:

$ make
mkdir -p /home/khu/test/libbloom/build
cc   -Wall -O3 -m64 -std=c99 -fPIC -DBLOOM_VERSION=1.5 -I/home/khu/test/libbloom -I/home/khu/test/libbloom/murmur2 -c murmur2/MurmurHash2.c -o /home/khu/test/libbloom/build/murmurhash2.o
gcc: error: unrecognized command line option ‘-m64’
make: *** [Makefile:117: /home/khu/test/libbloom/build/murmurhash2.o] Error 1

unit test sometimes fails

I find the unit test sometimes (less than 20%, but maybe not accurate) fails.

Please find the build log on debian builldd:
https://buildd.debian.org/status/fetch.php?pkg=libbloom&arch=amd64&ver=1.4-6&stamp=1494930635&raw=0

----- Running basic tests -----
----- basic -----
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30 not initialized!
bloom at 0x7ffc4a988c30
->entries = 102
->error = 0.100000
->bits = 488
->bits per elem = 4.792529
->bytes = 61
->hash functions = 4
----- add_random(10, 0.100000, 10, 0, 1, 32, 1) -----
bloom at 0x7ffc4a988b20
->entries = 10
->error = 0.100000
->bits = 47
->bits per elem = 4.792529
->bytes = 6
->hash functions = 4
entries: 10, error: 0.100000, count: 10, coll: 2, error: 0.200000, bytes: 6
error: expected error 0.100000 but observed 0.200000
Makefile:124: recipe for target 'test' failed

total number of bytes decrement with a high number of items

I'm already using the version Two it works very well for some numer of items but I notice that the unsigned int bit limits the amount of real reserved memory.

bytes is always ceil bits/8

This result in a maximum number of bits: 4,294,967,295 but only 536870912 bytes

With some high value of items it result in some cases to a random decrement of bytes

I made this code as a proof of concept

#include <stdio.h>
#include <stdlib.h>
#include "bloom/bloom.h"

struct bloom bloom;

unsigned int items[10] = {4194304,8388608,16777216,33554432,67108864,134217728,268435456,536870912,1073741824,2147483648};

int main(int argc, char **argv)  {
  int i = 0;
  while(i < 10) {
    if(bloom_init2(&bloom,items[i],0.001)  == 1)  {
      fprintf(stderr,"error bloom_init\n");
      exit(0);
    }
    printf("Items %u bloom filter bytes %u\n",items[i],bloom.bytes);
    bloom_reset(&bloom);
    i++;
  }
}

His output is:

Items 4194304 bloom filter bytes 7537997
Items 8388608 bloom filter bytes 15075994
Items 16777216 bloom filter bytes 30151987
Items 33554432 bloom filter bytes 60303973
Items 67108864 bloom filter bytes 120607946
Items 134217728 bloom filter bytes 241215893
Items 268435456 bloom filter bytes 482431785
Items 536870912 bloom filter bytes 427992657
Items 1073741824 bloom filter bytes 319114402
Items 2147483648 bloom filter bytes 101357891

The last line is only 96 MB of RAM for a bloom filter of 2147483648 items.

I solve this in my code changing the bits variable type for "unsigned long long int" instead of "unsigned int"

Kinds regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.