Coder Social home page Coder Social logo

cfilter's People

Contributors

irfansharif avatar jgeiger avatar pirosb3 avatar ridwanmsharif avatar schumacherfm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cfilter's Issues

Italian spelling mistake

Hi!

Thanks for this amazing package, looking forward to reading the source!
I'm not sure if this was done on purpose, but "good morning" in Italian is buongiorno. Let me know if you want me to make a PR with the typo fix!

Thanks again
Daniel

Cuckoo filter fails after 500,000 records

I am inserting strings of the format:
str:="http://ajaybodhe.com/abc/zrjer/frenfrefnre/efnjerfrje/?%v"
Where "%v" is replaced with 1 to 500,000
And then when I try to search "http://ajaybodhe.com/abc/zrjer/frenfrefnre/efnjerfrje/?1", the lookup fails.

When I tried it for 200,000 records it worked, but failed again for 300,000 records.
Another thing is how can I get the total memory footprint of this?

Here is the code:

       cf := cfilter.New()
	t= time.Now()
	for i:=0; i<500000;i++ {
		st := fmt.Sprintf(str, i+1)
		cf.Insert([]byte(st))
	}
	//fmt.Println("size(bytes) of cuckoo2 filter for 500000: ", ck.GetSize())
	//fmt.Println("size(MB)    of cuckoo2 filter for 500000", float64(ck.GetSize())/1024.00/1024.00)
	//
	fmt.Println()
	fmt.Println("cuckoo2:time taken to insert 500000 urls: ", time.Since(t))
	fmt.Println("cuckoo2:avg  time taken to insert 1 urls: ", time.Since(t)/500000)
	
	t= time.Now()
	for i:=0; i<500000; i++{
		st := fmt.Sprintf(str, i+1)
		b := cf.Lookup([]byte(st))
		if b == false {
			fmt.Println(st + " is not presnt: ")
			panic(1)
		}
	}
	for i:=500000; i<500000*2; i++{
		st := fmt.Sprintf(str, i+1)
		b := cf.Lookup([]byte(st))
		if b == true {
			fmt.Println(st + " is  presnt: ")
			panic(1)
		}
	}
	fmt.Println()
	fmt.Println("cuckoo2:time taken to search 1000000 urls: ", time.Since(t))
	fmt.Println("cuckoo2::avg  time taken to search 1 urls: ",time.Since(t)/1000000)

bug in bucket.go, rand bucket selection

bucket.go, line 41
should be corrected to:

i := rand.Intn(len(b)) // - 1)

with 4 buckets rand.Intn(4) will then yield 0,1,2,3 as it is, it will only yield 0,1,2 so last bucket can never be selected. I know the go documentation says it returns 0 to n, but in the current version 1.7 it will never return n. This can easily be demonstrated with:

for {
fmt.Println("1",rand.Intn(1)) // always zero never one
fmt.Println("2",rand.Intn(2)) // always zero or one
fmt
}

https://play.golang.org/p/heTsvQLLbc

low load fill and missing items

The paper suggests it is possible to obtain 95% fill. However the current hash methodology I think is the problem with bucket distrobution.

How to confirm/reproduce the issue.

Generate 100MM random data records and write these to a file using following method.

func generate() func() []byte {
	size:=8
	return func() []byte {	
		bytes := make([]byte,size,size)
		for i :=range bytes {
			bytes[i] = byte(rand.Intn(255))
		}
		return bytes
	}
}

func main() {
	next := generate()
	mf,_:= os.Create("./dgen")
	defer mf.Close()
	out:=bufio.NewWriter(mf)
	for i:=0;i<100000000;i++ {
		out.WriteString(hex.EncodeToString(next())+"\n")
	}
	out.Flush()
}

The resulting file should be 1.7GB and then confirm they all are unique with
$ uniq -u dgen dgen2

-rw-r--r-- 1 mac staff 1700000000 Jan 5 14:55 dgen
-rw-r--r-- 1 mac staff 1700000000 Jan 5 15:31 dgen2

Use this 100MM randomized data as the source file, then run the following code.

func main() {
	cf := cfilter.New(cfilter.FingerprintSize(8), cfilter.BucketSize(4), cfilter.Size(35000000))
	f, _ := os.Open("dgen")
	defer f.Close()
	scanner := bufio.NewScanner(f)
	i := 0
	t0 := time.Now()
	for scanner.Scan() {
		i++
		st := scanner.Text()
		if chk:= cf.Lookup([]byte(st)); !chk {
			if chk := cf.Insert([]byte(st)); !chk { 
				fmt.Println(i,st,"stuffed")
				break
			}				
			} else {
				fmt.Println(i,st,"exists")
			}
		}
	fmt.Println(time.Since(t0), cf.Count(), float64(cf.Count()) / float64(time.Since(t0).Seconds()))
	f2, _ := os.Open("dgen")
	defer f2.Close()
	scanner2 := bufio.NewScanner(f2)
	i2 := 0
	missing := 0
	t1 := time.Now()
	for scanner2.Scan() {
		i2++
		st := scanner2.Text()
		chk := cf.Lookup([]byte(st))
		if !chk {
			//fmt.Println(i2, st, "missing")
			missing++
		}
	}
	fmt.Println(time.Since(t1), i2, float64(i2) / float64(time.Since(t1).Seconds()),missing)
}

Based on the paper, it should be able to handle 95% fill. 100MM into 120MM is only 83.33% filled, so it should be fine, however running this crashes 50,895,993

50895993 7cae124e8473d992 stuffed

Now change the parameters to the following to give it tons of room.

cf := cfilter.New(cfilter.FingerprintSize(8), cfilter.BucketSize(8), cfilter.Size(20000000))

This time it will complete the inserts without getting overstuffed (only 62.5% full), however we have a new problem. There are 5162 missing when we try to do a lookup on them.

$ go run cf.go
3m14.974607833s 100000000 512887.29832959676
1m44.974537314s 100000000 952611.959256893 5162

in fingerprint.go, commenting out line 20-22 to account for hashes that hashed to zero some how

	/*if fp == nil {
		fp[0] += 7
	}*/

and run again shows that we are still missing 5162

3m16.405886172s 100000000 509149.7095800179
1m39.876159299s 100000000 1.0012399394257959e+06 5162

So I think there are two issues.

  1. The hashfp() doesn't distribute to the buckets randomly enough since the paper talks about "Empirically, a filter containing up to 4 billion items can effectively fill its hash table with loads of 95% when each bucket holds four fingerprints that are 6 bits or larger." The hashfp() is the only thing I can think of adjusting that could have an impact and allow 95% load. I'm only testing with 100 million samples and can not achieve > 50% load with the current design.

  2. What happened to the 5162 that were inserted but can't be found with a subsequent lookup?

Any ideas?

optimisations, memory and cpu

was pointed towards an issue raised here, two possible optimisations that could be made:

  • using array instead of slice for bucket and fingerprint, sizeof(slice) is 24 bytes. also every slice is a pointer indirection, given bucket and fingerprint sizes don't vary an array would fare better
  • additionally the original paper proposes usage of 12-bit fingerprints, 8-bit variants have lower memory efficiency for the same false positive probability.

Insert fails when entering rebalance kick loop

In cfiter.go, the Insert method fails during rebalance with an index calculated to be greater than the the size of the filter. The index should be equal to or less than the max size of the filter.

Script to reproduce the error condition.

func main() {
	cf := cfilter.New(cfilter.FingerprintSize(3), cfilter.Size(10))
	for i := 0; i < 29; i++ { // 28 will work 29 ... 40 will fail

		cf.Insert([]byte("hello, fir" + strconv.Itoa(i+3) + "+st timer" + strconv.Itoa(i) + "aa"))
	}
	cf.Dump() // visualize filter
	fmt.Println(cf.Count())
}

// visualize contents
func (cf *CFilter) Dump() {
	for a, b := range cf.buckets {
		fmt.Print(a)
		for _, c := range b {
			fmt.Print("[", hex.EncodeToString(c), "]")
		}
		fmt.Println()
	}
}

When all the buckets are full, execution will drop down to the rebalance section and do at max 500 kicks, however it crashes early with the following error:

panic: runtime error: index out of range
.../cfilter.go:64 +0x786

Introducing the following commented code shows i (before and after the % mod). Notice that i is calculated to 13 however the filter size is 10, so it will panic with an index out of range error and will not do the kick and rebalance.

// Insert adds an element (in byte-array form) to the Cuckoo filter,
// returns true if successful and false otherwise.
func (cf *CFilter) Insert(item []byte) bool {
	f := fprint(item, cf.fpSize, cf.hashfn)
	j := hashfp(item) % cf.size
	k := (j ^ hashfp(f)) % cf.size

	if cf.buckets[j].insert(f) || cf.buckets[k].insert(f) {
		cf.count++
		return true
	}

	i := [2]uint{j, k}[rand.Intn(2)]
	for n := uint(0); n < cf.kicks; n++ {
		f = cf.buckets[i].swap(f)
		fmt.Println(hex.EncodeToString(f), j, k, i) // show the values
		i ^= hashfp(f) % cf.size
		fmt.Println(hex.EncodeToString(f), j, k, i) // show the values

		if cf.buckets[i].insert(f) {
			cf.count++
			return true
		}
	}

	return false

Sample output.
Running with 28 inserts works because no kicks happen.

0[fc84f2][48f321][36bd37][]
1[09f92d][65de72][87fa10][32ed51]
2[069309][b8a269][69c550][e4ec1f]
3[0a2d31][83995a][][]
4[9cd789][be9358][eb6ed0][]
5[c955b3][][][]
6[77a10b][6bf9c9][371ed3][]
7[99cb92][5784ec][][]
8[92c257][54d3ef][edcb06][beb287]
9[6cafbb][8aa567][][]
28

Running with 29 inserts (falls into kick rebalance section) and crashes because 13 is calculated index, but 10 is filter size:

92c257 1 8 8
92c257 1 8 8
edcb06 1 8 8
edcb06 1 8 13
panic: runtime error: index out of range
...cfilter.go:66 +0x786

Setting fingerprints to 8 will the cause same failure to occur however it happens at insert number 34, but this time the buckets j,k were equal.

0[fc84f2e7ca2d5e96][48f321721bda24b2][36bd3783fab8ee80][8b4daa6ff23af8ee]
1[09f92dfded564d0b][65de720dfb99e4c9][87fa10567aab693f][1ba6b55473eb9b1f]
2[069309e6acb574ee][b8a269498459a140][69c55042abe22ab6][e4ec1f7924272d3c]
3[6cafbb55b62421f6][83995a56a985e5c7][][]
4[**9cd78960d30fbb5a**][be9358f5060bdf40][eb6ed07472612f90][6015587b8b8596fa]
5[0a2d31eb08846f48][32ed51cfd8d0f3ba][c955b301a3e8fe27][e584dc5296c215a7]
6[77a10b937c13ec62][6bf9c92e470384c6][371ed307bfc0bf10][8279d0bc9502ca72]
7[5784ec3b63a36bbf][][][]
8[92c25770e1271a9c][54d3ef79af523eae][edcb06ae5ff37f20][99cb926b6f4bce38]
9[8aa56766d6b5765f][beb287779de631b8][][]
33

With 34 inserts, failure occurs due to index calculation of 12.

9cd78960d30fbb5a 4 4 4
9cd78960d30fbb5a 4 4 12

The ultimate result of this error is that it when the kicks need to be invoked the program crashes thus preventing a graceful rebalance or on hitting 500 kicks, it should report that the filter is full. It is unclear to me how to rectify this even after reading the paper. The logic seems correct with Algorithm 1, so the issue appears to be related the hashfp() and % mod to get the filters bucket row number as this is calculating an index greater than cf.size. This is a bug that breaks ability to maximize the load of the DB. I first noticed this with a POC with a cfilter sized to 1MM, and stepped it back to 10 to be able to visualize what what was happening.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.