alexandrnikitin / bloom-filter-scala Goto Github PK

View Code? Open in Web Editor NEW

375.0 19.0 57.0 284 KB

Bloom filter for Scala, the fastest for JVM

Home Page: https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/

License: MIT License

Scala 85.55% Java 14.45%

bloom-filter probabilistic high-performance datastructures scala

bloom-filter-scala's People

Contributors

Stargazers

Watchers

Forkers

narayana-glassbeam miguelperalvo bingoyang melrief kenvanderheyden blazingsiyan cmarxer orangeds mindis gosubpl pzz2011 carrot-garden qingniufly hyqgod dantodor eyalfa thomascharbonnel fediq ppiotrow rainbowlay growthring hevensun kai2002 corruptmemory joesan chenhe4490 colemakdvorak sidweng zy4 kwanccc vivigiggle409 lizunmvn wangshup catherinena helianthuslulu dav009 keevol rite2nikhil timothyklim seanrohead jeevan86 cscor debiaoliiu ly-liuxing jcurr tonyvelichko zhaihao kiminh philippus safibaig ornicar lishiyucn frankuc nyuweb

bloom-filter-scala's Issues

Can't find implicit value for canGenerateHash

I'm trying to use a BloomFilter inside of a class that uses a TypeTag:

class MyClass[T: TypeTag](numberOfItems: Long, falsePositiveRate: Double) {
  val bloomFilter = BloomFilter[T](numberOfItems, falsePositiveRate)
}

However at compile time I get could not find implicit value for parameter canGenerateHash: bloomfilter.CanGenerateHashFrom[T]. Is there a way around this?

Thank you,
Chris

Hi,
currently the BloomFilter and its buffer are not serializable, can this be added?
same goes for kryo support, can be added with an optional dependency (I've actually already implemented this as part of my spark application)

another related issue, the provided to/from stream methods work one long at a time. haven't really benchmarked this but my intuition is that this is slowwwww. I suggest modify the current methods or add new ones that takes/allocates a buffer and leverage unsafe's ability to copy memory ranges (spark tungsten's style)

will you be willing to accept pull requests?

thanks,
Eyal.

Stable Bloom filter

Serialization Support

Add support for Java and Kryo serialization so that the Bloom Filter can be used in distributed applications such as Apache Spark.

How to get the filter element size

How to get the filter element size.
Google guava has a method approximateElementCount() to get elements size.

Use xxHash instead of MurmurHash3

http://cyan4973.github.io/xxHash/

Opposite bloom filter wanted

The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive.

It is handy for deduplication in realtime while unique values shouldn't be filtered out and possible non-filtered dublicates should be limited by some defined rate.

Why the bloomfilter returns false?

When i use the example on the main page
i add element to the bloomfilter, however, when using mightContain method to check whether an element in a set. it returns false.Why?

val expectedElements = 1000000
val falsePositiveRate = 0.1
val bf = BloomFilter[String](expectedElements, falsePositiveRate)

// Put an element
bf.add("lee")

// Check whether an element in a set
bf.mightContain("lee") // return false

// Dispose the instance
bf.dispose()

approximateElementCount() is wrong after intersect or union

Should calculate bitCount in combine()

Scaladoc support

reconsider UnsafeBitArray

I'm using the library with Apache-Spark,
specifically I'm optimizing a sparse join by creating a bloom filter per partition, this required wrapping the bloom filter instances with a class that implements finalaize.
If I hadn't done so I'd get a massive memory leak 😢 , the wrapper solved part of my problems but not all as Spark is unaware of the memory usage of the bloom filters which may lead to situations where it 'misses' a memory spill (spark's way of freeing up memory).

my thought is that implementing UnsafeBitArray in terms of a plain Java long array (Long[]) should not harm capabilities or performance of this class while still allowing for proper GC of these objects.
I think an even further possibility is using multiple arrays to avoid huge allocations, JVM heap (like many other memory allocators) 'suffers' both from tiny and huge allocations, each having its own merits.

@alexandrnikitin , what do you think? does it worth a benchmark? (perhaps in a separate branch to avoid complexities of side by side support for both code paths)?

NoClassDefFoundError Product when instatiating a new BloomFilter

Instantiating a simple bloom filter with the following command

    val expectedElements = 10000
    val falsePositiveRate: Double = 0.1
    val bf = BloomFilter[String](expectedElements, falsePositiveRate)

at the last line ( val bf = ...)
I am getting the following error


[error] (run-main-0) java.lang.NoClassDefFoundError: scala/Product$class
[error] java.lang.NoClassDefFoundError: scala/Product$class
[error] 	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<init>(CanGenerateHashFrom.scala:20)
[error] 	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<clinit>(CanGenerateHashFrom.scala)

Scala 2.12.1 support?

With 2.12 scala.concurrent.util.Unsafe is removed impacting this project. Are you planning to update accordingly? I'm curious about your thoughts on how to go about it. Thanks

Scala 2.13 in Maven Repo

Could you please publish to Maven binary for Scala 2.13 ?

Better utilization of allocated memory

UnsafeBitArray always allocates a multiple of 64 bits. For example if numberOfBits is 250 then 256 bits are allocated but 6 bits remain always unused.

Wouldn't it make sense that Bloomfilter.optimalNumberOfBits(..) returns the next higher number which is a multiple of 64? The false positive rate would become little better while all allocated memory is potentially used.

got SIGSEGV using in Spark

I tried to use your implementation in Spark and got that

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x000000010d75d414, pid=34100, tid=48387

JRE version: Java(TM) SE Runtime Environment (8.0_51-b16) (build 1.8.0_51-b16)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.51-b03 mixed mode bsd-amd64 compressed oops)
Problematic frame:
V [libjvm.dylib+0x55d414]

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:
hs_err_pid34100.log
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488

If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp

hs_err_pid34100.txt

SIGSEGV Error in a Web Application

I'm trying to use the filter for one of my projects which is a web application built using the Play framework. As I tried to insert elements into the filter, I got this error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd90fd1f9ee, pid=330518, tid=330806
#
# JRE version: OpenJDK Runtime Environment (11.0.10+9) (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xdd59ee]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/joesan/Projects/Private/scala-projects/yoke/core.330518)
#
# An error report file with more information is saved as:
# /home/joesan/Projects/Private/scala-projects/yoke/hs_err_pid330518.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#
Aborted (core dumped)
[hs_err_pid330518.log](https://github.com/alexandrnikitin/bloom-filter-scala/files/6262164/hs_err_pid330518.log)

Please see the attached file for the detailed error. I'm using the following JDK:

joesan@joesan-InfinityBook-S-14-v5:~/Projects/Private/scala-projects/yoke$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

hs_err_pid330518.log

Support JDK above 8.x

I'm using JDK 11 (azul 11.0.7) with Scala 2.12.11 and get the exception below when adding an element to the Bloom Filter. The issue appears to be that the value field of class String was changed in JDK 9 from a char[] to a byte[], which causes a ClassCastException on this line:

val value = unsafe.getObject(from, valueOffset).asInstanceOf[Array[Char]]

Exception in thread "main" java.lang.ClassCastException: class [B cannot be cast to class [C ([B and [C are in module java.base of loader 'bootstrap')
	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:27)
	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:20)
	at bloomfilter.mutable.BloomFilter.add(BloomFilter.scala:18)
	at Main$.delayedEndpoint$Main$1(Main.scala:12)
	at Main$delayedInit$body.apply(Main.scala:3)
	at scala.Function0.apply$mcV$sp(Function0.scala:39)
	at scala.Function0.apply$mcV$sp$(Function0.scala:39)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
	at scala.App.$anonfun$main$1$adapted(App.scala:80)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.App.main(App.scala:80)
	at scala.App.main$(App.scala:78)
	at Main$.main(Main.scala:3)
	at Main.main(Main.scala)

Is this bloom filter threadsafe?

That is, is calling .add and .mightContain concurrently from many different threads going to work?

Meta-Learning Neural Bloom Filters wanted

https://arxiv.org/pdf/1906.04304.pdf

Union two Bloom Filters

Add support for computing bloom filters in a distributed fashion by generating and OR'ing multiple bloom filters in to one.

Breeze is a good reference point for this issue: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala#L84

Issue when using `bloom-filter-scala` on String

@alexandrnikitin,

Awesome work on bloom-filter-scala, I am using bloom-filter-scala on String, but seems like false positive is too high. My demo example as below:

package sample

import bloomfilter.mutable.BloomFilter

object Sample extends App {

  // scalastyle:off
  println("""val bf = BloomFilter[String](10000, 0.1)""")
  val bf = BloomFilter[String](10000, 0.1)
  println("""bf.add("a") """)
  bf.add("a")
  println(s"""bf.mightContain("s"): ${bf.mightContain("s")}""")
  println(s"""bf.mightContain("1"): ${bf.mightContain("1")}""")
  println(s"""bf.mightContain(""): ${bf.mightContain("")}""")
  // scalastyle:on

}

Expecting all return false, but seems always return at least 2 true. May I know if I am using bloom-filter-scala correctly?

Thank you very much.

Cuckoo Filter

The original paper: "Cuckoo Filter: Practically Better Than Bloom" by Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher

Implementations:
CPP from one of the authors: https://github.com/efficient/cuckoofilter
Java: https://github.com/bdupras/guava-probably
https://github.com/lrodero/java_cuckoo_filter
Go: https://github.com/seiflotfy/cuckoofilter

Morton Filters wanted

"Morton Filters: Faster, Space-Efficient Cuckoo Filters via Biasing, Compression, and Decoupled Logical Sparsity" by Alex D. Breslow and Nuwan S. Jayasena

can the add() method return a boolean indicating a 'fresh' insert

Guava already has this as part of its interface

CanGenerateHashFromString is broken in JDK 9+ when string contains non-latin characters or +XX:-CompactStrings JVM flag is used

CanGenerateHashFromStringByteArray, which is used for JDK9+, assumes that the string is stored using the UTF-8 character encoding and that the length of the underlying byte[] is the same as the length of the string. This assumption only holds true if the string only contains characters from the ISO-8859-1/Latin-1 character set. If the string contains other characters, the string is stored in the underlying byte array as UTF-16 characters and the length of the byte array is 2x the number of characters in the string. Additionally, it is possible to disable this storage optimization using the +XX:-CompactStrings JVM flag in which case all strings are stored as UTF-16 characters. See here and here for more information.