Coder Social home page Coder Social logo

bloom's Introduction

Bloom

A highly efficient bloom filter implementation for Go

GoDoc Build Status

Bloom is a simple tool that provides a very efficient implementation of Bloom filters for the go language. It provides a command line tool that can be used to easily create Bloom filters with desired capacity and false positive probability. Values can be added to filters through standard input, which makes it easy to use the tool in a pipeline workflow.

Usage

NAME:
   Bloom Filter - Utility to work with bloom filters

USAGE:
   bloom [global options] command [command options] [arguments...]

VERSION:
   0.2.2

COMMANDS:
     create, cr         Create a new Bloom filter and store it in the given filename.
     insert, i          Inserts new values into an existing Bloom filter.
     join, j, merge, m  Joins two Bloom filters into one.
     check, c           Checks values against an existing Bloom filter.
     set-data, sd       Sets the data associated with the Bloom filter.
     get-data, gd       Prints the data associated with the Bloom filter.
     show, s            Shows various details about a given Bloom filter.
     help, h            Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --gzip, --gz                      compress bloom file with gzip
   --interactive, -i                 interactively add values to the filter
   --split, -s                       split the input string
   --each, -e                        print each match of a split string individually
   --delimiter value, -d value       delimiter to use for splitting (default: ",")
   --fields value, -f value          fields of split output to use in filter (a single number or a comma-separated list of numbers, zero-indexed)
   --print-fields value, --pf value  fields of split output to print for a successful match (a single number or a comma-separated list of numbers, zero-indexed).
   --help, -h                        show help
   --version, -v                     print the version

Examples

To create a new bloom filter with a desired capacity and false positive probability, you can use the create command:

#will create a gzipped Bloom filter with 100.000 capacity and a 0.1 % false positive probability
bloom --gzip create -p 0.001 -n 100000 test.bloom.gz

To insert values, you can use the insert command and pipe some input to it (each line will be treated as one value):

cat values | bloom --gzip insert test.bloom.gz

You can also interactively add values to the filter by specifying the --interactive command line option:

bloom --gzip --interactive insert test.bloom.gz

To check if a given value or a list of values is in the filter, you can use the check command:

cat values | bloom --gzip check test.bloom.gz

This will return a list of all values in the filter.

Advanced Usage

Sometimes it is useful to attach additional information to a string that we want to check against the Bloom filter, such as a timestamp or the original line content. To make passing along this additional information easier within a shell context, the Bloom tool provides an option for splitting the input string by a given delimiter and checking the filter against the resulting field values. Example:

# will check the Bloom filter for the values foo, bar and baz
cat "foo,bar,baz" | bloom -s filter.bloom

# uses a different delimiter (--magic-delimiter--)
cat "foo--ac5ba--bar--ac5ba--baz" | bloom  -d "--ac5ba--" -s filter.bloom

# will check the Bloom filter against the second field value only
cat "foo,bar,baz" | bloom -f 1 -s filter.bloom

# will check the Bloom filter against the second and third field values only
cat "foo,bar,baz" | bloom -f 1,2 -s filter.bloom

# will print one line for each field value that matched against the filter
cat "foo,bar,baz" | bloom -e -s filter.bloom

# will print the last field value for each line whose fields matched against the filter
cat "foo,bar,baz" | bloom -e -s --pf -1 filter.bloom

This functionality is especially handy when using CSV data, as it allows you to filter CSV rows by checking individual columns against the filter without having to use external tools to split and reassemble the lines.

Installation

Installation on Debian-based systems

Debian command line tool:

sudo apt install golang-github-dcso-bloom-cli

Installation via go get:

go get github.com/DCSO/bloom/...

Installation from source

These need to be run from within the GOPATH source directory for this project (e.g. $GOPATH/src/github.com/DCSO/bloom). To install the command line tool:

make install

To run the tests:

make test

To run the benchmarks:

make bench

Cross-Compiling

To compile a binary, simply specify the target architecture and go:

#Windows, 64 bit
env GOOS=windows GOARCH=amd64 go build -v -o bloom.exe github.com/DCSO/bloom
#Windows, 32 bit
env GOOS=windows GOARCH=i386 go build -v -o /tmp/bloom github.com/DCSO/bloom

bloom's People

Contributors

0mbi avatar adewes avatar dthadi3 avatar rhaist avatar satta avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bloom's Issues

Multiply Overflow in Fingerprint

The Issue:

bloom/bloom.go

Line 201 in 691ea61

hn = (hn * g) % m

produces the an multiply overflow.

	for i := uint64(0); i < s.k; i++ {
		hn = (hn * g) % m
		fingerprint[i] = uint64(hn % s.m)
	}

hn * g overflows in the unit tests. I am not sure if this is really an issue or if this behaviour is intended.

To reproduce:

Add a overflow check function like (as taken
from https://www.programming-idioms.org/idiom/86/check-if-integer-multiplication-will-overflow)

func multiplyWillOverflow(x, y uint64) bool {
   if x <= 1 || y <= 1 {
     return false
   }
   d := x * y
   return d/y != x
}

Add the following after
hn = (hn * g) %m:

if multiplyWillOverflow(hn, g) {
  panic("Multiply overflow occured")
}

And run go test ./..

Bug for bit sizes being a power of two (and maybe other values)

Implementing the following test wouldn't make the test passing anymore.

--- FAIL: TestBugFalsePositives (1.24s)
    bloom_test.go:330: False positive probability is too high at  20.18335038131724 % vs  0.602097140729787 %
FAIL
exit status 1
FAIL	github.com/DCSO/bloom	1.763s
func TestBugFalsePositives(t *testing.T) {
	// this capacity + p would produce a power of 2 bit size
	capacity := uint64(109397)
	p := float64(0.01)
	fillingFactor := 0.9
	N := uint64(float64(capacity) * fillingFactor)
	filter, _ := GenerateExampleFilter(capacity, p, N)
	pAcceptable := math.Pow(1-math.Exp(-float64(filter.k)*float64(N)/float64(filter.m)), float64(filter.k))
	fingerprint := make([]uint64, filter.k)
	cnt := 0.0
	matches := 0.0
	for {
		cnt++
		value := GenerateTestValue(100)
		filter.Fingerprint(value, fingerprint)
		if filter.CheckFingerprint(fingerprint) {
			matches++
		}
		if cnt > float64(capacity)*10 {
			break
		}
	}
	//this might still fail sometimes...
	//we allow for a probability that is two times higher than the normally acceptable probability
	if matches/cnt > pAcceptable*2 {
		t.Error("False positive probability is too high at ", matches/cnt*100, "% vs ", pAcceptable*100, "%")
	}
}

After analysis, it is possible there is a flaw in the fingerprint generation.
Here is my theory (not verified mathematically). Your algorithm generates fingerprints with:

	for i := uint64(0); i < s.k; i++ {
		hn = (hn * g) % m
		fingerprint[i] = uint64(hn % s.m)

However the value of m used (i.e. 0xffffffffffffffc5) is so big that what the code does is equivalent for any x = (hn * g); x < m (which is very likely as we are working with uint64) to the following

	for i := uint64(0); i < s.k; i++ {
		hn = hn * g
		fingerprint[i] = uint64(hn % s.m)

So under those conditions hn is always a multiple of h0 (fnv hash) multiplied by g power i, very likely creating lots of collisions for very specific cases, such as this one.

NB: I did not verify if other bit sizes create the same behavior

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.