irfansharif / cfilter Goto Github PK

View Code? Open in Web Editor NEW

765.0 22.0 37.0 21 KB

Cuckoo Filter implementation in Go, better than Bloom Filters (unmaintained)

License: MIT License

Go 100.00%

cuckoo-filter bloom-filter filter go

cfilter's People

Contributors

Stargazers

Watchers

cfilter's Issues

feature request: add a way to persist/load the filter

I would like to persist the filter via something like io.WriterTo and read it back with io.ReaderFrom to let it suvive application restarts.

Italian spelling mistake

Hi!

Thanks for this amazing package, looking forward to reading the source!
I'm not sure if this was done on purpose, but "good morning" in Italian is buongiorno. Let me know if you want me to make a PR with the typo fix!

Thanks again
Daniel

Cuckoo filter fails after 500,000 records

I am inserting strings of the format:
str:="http://ajaybodhe.com/abc/zrjer/frenfrefnre/efnjerfrje/?%v"
Where "%v" is replaced with 1 to 500,000
And then when I try to search "http://ajaybodhe.com/abc/zrjer/frenfrefnre/efnjerfrje/?1", the lookup fails.

When I tried it for 200,000 records it worked, but failed again for 300,000 records.
Another thing is how can I get the total memory footprint of this?

Here is the code:

       cf := cfilter.New()
	t= time.Now()
	for i:=0; i<500000;i++ {
		st := fmt.Sprintf(str, i+1)
		cf.Insert([]byte(st))
	}
	//fmt.Println("size(bytes) of cuckoo2 filter for 500000: ", ck.GetSize())
	//fmt.Println("size(MB)    of cuckoo2 filter for 500000", float64(ck.GetSize())/1024.00/1024.00)
	//
	fmt.Println()
	fmt.Println("cuckoo2:time taken to insert 500000 urls: ", time.Since(t))
	fmt.Println("cuckoo2:avg  time taken to insert 1 urls: ", time.Since(t)/500000)
	
	t= time.Now()
	for i:=0; i<500000; i++{
		st := fmt.Sprintf(str, i+1)
		b := cf.Lookup([]byte(st))
		if b == false {
			fmt.Println(st + " is not presnt: ")
			panic(1)
		}
	}
	for i:=500000; i<500000*2; i++{
		st := fmt.Sprintf(str, i+1)
		b := cf.Lookup([]byte(st))
		if b == true {
			fmt.Println(st + " is  presnt: ")
			panic(1)
		}
	}
	fmt.Println()
	fmt.Println("cuckoo2:time taken to search 1000000 urls: ", time.Since(t))
	fmt.Println("cuckoo2::avg  time taken to search 1 urls: ",time.Since(t)/1000000)

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

The ideas described in the paper here are pretty interesting and doable.

bug in bucket.go, rand bucket selection

bucket.go, line 41
should be corrected to:

i := rand.Intn(len(b)) // - 1)

with 4 buckets rand.Intn(4) will then yield 0,1,2,3 as it is, it will only yield 0,1,2 so last bucket can never be selected. I know the go documentation says it returns 0 to n, but in the current version 1.7 it will never return n. This can easily be demonstrated with:

for {
fmt.Println("1",rand.Intn(1)) // always zero never one
fmt.Println("2",rand.Intn(2)) // always zero or one
fmt
}

https://play.golang.org/p/heTsvQLLbc

low load fill and missing items

The paper suggests it is possible to obtain 95% fill. However the current hash methodology I think is the problem with bucket distrobution.

How to confirm/reproduce the issue.

Generate 100MM random data records and write these to a file using following method.

func generate() func() []byte {
	size:=8
	return func() []byte {	
		bytes := make([]byte,size,size)
		for i :=range bytes {
			bytes[i] = byte(rand.Intn(255))
		}
		return bytes
	}
}

func main() {
	next := generate()
	mf,_:= os.Create("./dgen")
	defer mf.Close()
	out:=bufio.NewWriter(mf)
	for i:=0;i<100000000;i++ {
		out.WriteString(hex.EncodeToString(next())+"\n")
	}
	out.Flush()
}

The resulting file should be 1.7GB and then confirm they all are unique with
$ uniq -u dgen dgen2

-rw-r--r-- 1 mac staff 1700000000 Jan 5 14:55 dgen
-rw-r--r-- 1 mac staff 1700000000 Jan 5 15:31 dgen2

Use this 100MM randomized data as the source file, then run the following code.

func main() {
	cf := cfilter.New(cfilter.FingerprintSize(8), cfilter.BucketSize(4), cfilter.Size(35000000))
	f, _ := os.Open("dgen")
	defer f.Close()
	scanner := bufio.NewScanner(f)
	i := 0
	t0 := time.Now()
	for scanner.Scan() {
		i++
		st := scanner.Text()
		if chk:= cf.Lookup([]byte(st)); !chk {
			if chk := cf.Insert([]byte(st)); !chk { 
				fmt.Println(i,st,"stuffed")
				break
			}				
			} else {
				fmt.Println(i,st,"exists")
			}
		}
	fmt.Println(time.Since(t0), cf.Count(), float64(cf.Count()) / float64(time.Since(t0).Seconds()))
	f2, _ := os.Open("dgen")
	defer f2.Close()
	scanner2 := bufio.NewScanner(f2)
	i2 := 0
	missing := 0
	t1 := time.Now()
	for scanner2.Scan() {
		i2++
		st := scanner2.Text()
		chk := cf.Lookup([]byte(st))
		if !chk {
			//fmt.Println(i2, st, "missing")
			missing++
		}
	}
	fmt.Println(time.Since(t1), i2, float64(i2) / float64(time.Since(t1).Seconds()),missing)
}

Based on the paper, it should be able to handle 95% fill. 100MM into 120MM is only 83.33% filled, so it should be fine, however running this crashes 50,895,993

50895993 7cae124e8473d992 stuffed

Now change the parameters to the following to give it tons of room.

cf := cfilter.New(cfilter.FingerprintSize(8), cfilter.BucketSize(8), cfilter.Size(20000000))

This time it will complete the inserts without getting overstuffed (only 62.5% full), however we have a new problem. There are 5162 missing when we try to do a lookup on them.

$ go run cf.go
3m14.974607833s 100000000 512887.29832959676
1m44.974537314s 100000000 952611.959256893 5162

in fingerprint.go, commenting out line 20-22 to account for hashes that hashed to zero some how

	/*if fp == nil {
		fp[0] += 7
	}*/

and run again shows that we are still missing 5162

3m16.405886172s 100000000 509149.7095800179
1m39.876159299s 100000000 1.0012399394257959e+06 5162

So I think there are two issues.

The hashfp() doesn't distribute to the buckets randomly enough since the paper talks about "Empirically, a filter containing up to 4 billion items can effectively fill its hash table with loads of 95% when each bucket holds four fingerprints that are 6 bits or larger." The hashfp() is the only thing I can think of adjusting that could have an impact and allow 95% load. I'm only testing with 100 million samples and can not achieve > 50% load with the current design.
What happened to the 5162 that were inserted but can't be found with a subsequent lookup?

Any ideas?

optimisations, memory and cpu

was pointed towards an issue raised here, two possible optimisations that could be made:

using array instead of slice for bucket and fingerprint, sizeof(slice) is 24 bytes. also every slice is a pointer indirection, given bucket and fingerprint sizes don't vary an array would fare better
additionally the original paper proposes usage of 12-bit fingerprints, 8-bit variants have lower memory efficiency for the same false positive probability.

Insert fails when entering rebalance kick loop

In cfiter.go, the Insert method fails during rebalance with an index calculated to be greater than the the size of the filter. The index should be equal to or less than the max size of the filter.

Script to reproduce the error condition.

func main() {
	cf := cfilter.New(cfilter.FingerprintSize(3), cfilter.Size(10))
	for i := 0; i < 29; i++ { // 28 will work 29 ... 40 will fail

		cf.Insert([]byte("hello, fir" + strconv.Itoa(i+3) + "+st timer" + strconv.Itoa(i) + "aa"))
	}
	cf.Dump() // visualize filter
	fmt.Println(cf.Count())
}

// visualize contents
func (cf *CFilter) Dump() {
	for a, b := range cf.buckets {
		fmt.Print(a)
		for _, c := range b {
			fmt.Print("[", hex.EncodeToString(c), "]")
		}
		fmt.Println()
	}
}

When all the buckets are full, execution will drop down to the rebalance section and do at max 500 kicks, however it crashes early with the following error:

panic: runtime error: index out of range
.../cfilter.go:64 +0x786

Introducing the following commented code shows i (before and after the % mod). Notice that i is calculated to 13 however the filter size is 10, so it will panic with an index out of range error and will not do the kick and rebalance.

// Insert adds an element (in byte-array form) to the Cuckoo filter,
// returns true if successful and false otherwise.
func (cf *CFilter) Insert(item []byte) bool {
	f := fprint(item, cf.fpSize, cf.hashfn)
	j := hashfp(item) % cf.size
	k := (j ^ hashfp(f)) % cf.size

	if cf.buckets[j].insert(f) || cf.buckets[k].insert(f) {
		cf.count++
		return true
	}

	i := [2]uint{j, k}[rand.Intn(2)]
	for n := uint(0); n < cf.kicks; n++ {
		f = cf.buckets[i].swap(f)
		fmt.Println(hex.EncodeToString(f), j, k, i) // show the values
		i ^= hashfp(f) % cf.size
		fmt.Println(hex.EncodeToString(f), j, k, i) // show the values

		if cf.buckets[i].insert(f) {
			cf.count++
			return true
		}
	}

	return false

Sample output.
Running with 28 inserts works because no kicks happen.

0[fc84f2][48f321][36bd37][]
1[09f92d][65de72][87fa10][32ed51]
2[069309][b8a269][69c550][e4ec1f]
3[0a2d31][83995a][][]
4[9cd789][be9358][eb6ed0][]
5[c955b3][][][]
6[77a10b][6bf9c9][371ed3][]
7[99cb92][5784ec][][]
8[92c257][54d3ef][edcb06][beb287]
9[6cafbb][8aa567][][]
28

Running with 29 inserts (falls into kick rebalance section) and crashes because 13 is calculated index, but 10 is filter size:

92c257 1 8 8
92c257 1 8 8
edcb06 1 8 8
edcb06 1 8 13
panic: runtime error: index out of range
...cfilter.go:66 +0x786

Setting fingerprints to 8 will the cause same failure to occur however it happens at insert number 34, but this time the buckets j,k were equal.

0[fc84f2e7ca2d5e96][48f321721bda24b2][36bd3783fab8ee80][8b4daa6ff23af8ee]
1[09f92dfded564d0b][65de720dfb99e4c9][87fa10567aab693f][1ba6b55473eb9b1f]
2[069309e6acb574ee][b8a269498459a140][69c55042abe22ab6][e4ec1f7924272d3c]
3[6cafbb55b62421f6][83995a56a985e5c7][][]
4[**9cd78960d30fbb5a**][be9358f5060bdf40][eb6ed07472612f90][6015587b8b8596fa]
5[0a2d31eb08846f48][32ed51cfd8d0f3ba][c955b301a3e8fe27][e584dc5296c215a7]
6[77a10b937c13ec62][6bf9c92e470384c6][371ed307bfc0bf10][8279d0bc9502ca72]
7[5784ec3b63a36bbf][][][]
8[92c25770e1271a9c][54d3ef79af523eae][edcb06ae5ff37f20][99cb926b6f4bce38]
9[8aa56766d6b5765f][beb287779de631b8][][]
33

With 34 inserts, failure occurs due to index calculation of 12.

9cd78960d30fbb5a 4 4 4
9cd78960d30fbb5a 4 4 12

The ultimate result of this error is that it when the kicks need to be invoked the program crashes thus preventing a graceful rebalance or on hitting 500 kicks, it should report that the filter is full. It is unclear to me how to rectify this even after reading the paper. The logic seems correct with Algorithm 1, so the issue appears to be related the hashfp() and % mod to get the filters bucket row number as this is calculating an index greater than cf.size. This is a bug that breaks ability to maximize the load of the DB. I first noticed this with a POC with a cfilter sized to 1MM, and stepped it back to 10 to be able to visualize what what was happening.

irfansharif / cfilter Goto Github PK

cfilter's People

Contributors

Stargazers

Watchers

Forkers

cfilter's Issues

feature request: add a way to persist/load the filter

Italian spelling mistake

Cuckoo filter fails after 500,000 records

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

bug in bucket.go, rand bucket selection

low load fill and missing items

optimisations, memory and cpu

Insert fails when entering rebalance kick loop

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent