alexandrnikitin / bloom-filter-scala Goto Github PK
View Code? Open in Web Editor NEWBloom filter for Scala, the fastest for JVM
Home Page: https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/
License: MIT License
Bloom filter for Scala, the fastest for JVM
Home Page: https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/
License: MIT License
I'm trying to use a BloomFilter
inside of a class that uses a TypeTag
:
class MyClass[T: TypeTag](numberOfItems: Long, falsePositiveRate: Double) {
val bloomFilter = BloomFilter[T](numberOfItems, falsePositiveRate)
}
However at compile time I get could not find implicit value for parameter canGenerateHash: bloomfilter.CanGenerateHashFrom[T]
. Is there a way around this?
Thank you,
Chris
nice to have: 2.12 and dotty
Hi,
currently the BloomFilter and its buffer are not serializable, can this be added?
same goes for kryo support, can be added with an optional dependency (I've actually already implemented this as part of my spark application)
another related issue, the provided to/from stream methods work one long at a time. haven't really benchmarked this but my intuition is that this is slowwwww. I suggest modify the current methods or add new ones that takes/allocates a buffer and leverage unsafe's ability to copy memory ranges (spark tungsten's style)
will you be willing to accept pull requests?
thanks,
Eyal.
Add support for Java and Kryo serialization so that the Bloom Filter can be used in distributed applications such as Apache Spark.
How to get the filter element size.
Google guava has a method approximateElementCount() to get elements size.
The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive.
It is handy for deduplication in realtime while unique values shouldn't be filtered out and possible non-filtered dublicates should be limited by some defined rate.
When i use the example on the main page
i add element to the bloomfilter, however, when using mightContain method to check whether an element in a set. it returns false.Why?
val expectedElements = 1000000
val falsePositiveRate = 0.1
val bf = BloomFilter[String](expectedElements, falsePositiveRate)
// Put an element
bf.add("lee")
// Check whether an element in a set
bf.mightContain("lee") // return false
// Dispose the instance
bf.dispose()
Should calculate bitCount in combine()
I'm using the library with Apache-Spark,
specifically I'm optimizing a sparse join by creating a bloom filter per partition, this required wrapping the bloom filter instances with a class that implements finalaize
.
If I hadn't done so I'd get a massive memory leak ๐ข , the wrapper solved part of my problems but not all as Spark is unaware of the memory usage of the bloom filters which may lead to situations where it 'misses' a memory spill (spark's way of freeing up memory).
my thought is that implementing UnsafeBitArray in terms of a plain Java long array (Long[]
) should not harm capabilities or performance of this class while still allowing for proper GC of these objects.
I think an even further possibility is using multiple arrays to avoid huge allocations, JVM heap (like many other memory allocators) 'suffers' both from tiny and huge allocations, each having its own merits.
@alexandrnikitin , what do you think? does it worth a benchmark? (perhaps in a separate branch to avoid complexities of side by side support for both code paths)?
Instantiating a simple bloom filter with the following command
val expectedElements = 10000
val falsePositiveRate: Double = 0.1
val bf = BloomFilter[String](expectedElements, falsePositiveRate)
at the last line ( val bf = ...
)
I am getting the following error
[error] (run-main-0) java.lang.NoClassDefFoundError: scala/Product$class
[error] java.lang.NoClassDefFoundError: scala/Product$class
[error] at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<init>(CanGenerateHashFrom.scala:20)
[error] at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<clinit>(CanGenerateHashFrom.scala)
With 2.12 scala.concurrent.util.Unsafe is removed impacting this project. Are you planning to update accordingly? I'm curious about your thoughts on how to go about it. Thanks
Could you please publish to Maven binary for Scala 2.13 ?
UnsafeBitArray
always allocates a multiple of 64 bits. For example if numberOfBits
is 250 then 256 bits are allocated but 6 bits remain always unused.
Wouldn't it make sense that Bloomfilter.optimalNumberOfBits(..)
returns the next higher number which is a multiple of 64? The false positive rate would become little better while all allocated memory is potentially used.
I tried to use your implementation in Spark and got that
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x000000010d75d414, pid=34100, tid=48387
JRE version: Java(TM) SE Runtime Environment (8.0_51-b16) (build 1.8.0_51-b16)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.51-b03 mixed mode bsd-amd64 compressed oops)
Problematic frame:
V [libjvm.dylib+0x55d414]Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
An error report file with more information is saved as:
hs_err_pid34100.log
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
I'm trying to use the filter for one of my projects which is a web application built using the Play framework. As I tried to insert elements into the filter, I got this error:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fd90fd1f9ee, pid=330518, tid=330806
#
# JRE version: OpenJDK Runtime Environment (11.0.10+9) (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xdd59ee]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/joesan/Projects/Private/scala-projects/yoke/core.330518)
#
# An error report file with more information is saved as:
# /home/joesan/Projects/Private/scala-projects/yoke/hs_err_pid330518.log
#
# If you would like to submit a bug report, please visit:
# https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#
Aborted (core dumped)
[hs_err_pid330518.log](https://github.com/alexandrnikitin/bloom-filter-scala/files/6262164/hs_err_pid330518.log)
Please see the attached file for the detailed error. I'm using the following JDK:
joesan@joesan-InfinityBook-S-14-v5:~/Projects/Private/scala-projects/yoke$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
I'm using JDK 11 (azul 11.0.7) with Scala 2.12.11 and get the exception below when adding an element to the Bloom Filter. The issue appears to be that the value field of class String was changed in JDK 9 from a char[] to a byte[], which causes a ClassCastException on this line:
val value = unsafe.getObject(from, valueOffset).asInstanceOf[Array[Char]]
Exception in thread "main" java.lang.ClassCastException: class [B cannot be cast to class [C ([B and [C are in module java.base of loader 'bootstrap')
at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:27)
at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:20)
at bloomfilter.mutable.BloomFilter.add(BloomFilter.scala:18)
at Main$.delayedEndpoint$Main$1(Main.scala:12)
at Main$delayedInit$body.apply(Main.scala:3)
at scala.Function0.apply$mcV$sp(Function0.scala:39)
at scala.Function0.apply$mcV$sp$(Function0.scala:39)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
at scala.App.$anonfun$main$1$adapted(App.scala:80)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.App.main(App.scala:80)
at scala.App.main$(App.scala:78)
at Main$.main(Main.scala:3)
at Main.main(Main.scala)
That is, is calling .add
and .mightContain
concurrently from many different threads going to work?
Add support for computing bloom filters in a distributed fashion by generating and OR'ing multiple bloom filters in to one.
Breeze is a good reference point for this issue: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/util/BloomFilter.scala#L84
Awesome work on bloom-filter-scala
, I am using bloom-filter-scala
on String
, but seems like false positive is too high. My demo example as below:
package sample
import bloomfilter.mutable.BloomFilter
object Sample extends App {
// scalastyle:off
println("""val bf = BloomFilter[String](10000, 0.1)""")
val bf = BloomFilter[String](10000, 0.1)
println("""bf.add("a") """)
bf.add("a")
println(s"""bf.mightContain("s"): ${bf.mightContain("s")}""")
println(s"""bf.mightContain("1"): ${bf.mightContain("1")}""")
println(s"""bf.mightContain(""): ${bf.mightContain("")}""")
// scalastyle:on
}
Expecting all return false
, but seems always return at least 2 true
. May I know if I am using bloom-filter-scala
correctly?
Thank you very much.
The original paper: "Cuckoo Filter: Practically Better Than Bloom" by Bin Fan, David G. Andersen, Michael Kaminsky, Michael D. Mitzenmacher
Implementations:
CPP from one of the authors: https://github.com/efficient/cuckoofilter
Java: https://github.com/bdupras/guava-probably
https://github.com/lrodero/java_cuckoo_filter
Go: https://github.com/seiflotfy/cuckoofilter
Guava already has this as part of its interface
CanGenerateHashFromStringByteArray, which is used for JDK9+, assumes that the string is stored using the UTF-8 character encoding and that the length of the underlying byte[] is the same as the length of the string. This assumption only holds true if the string only contains characters from the ISO-8859-1/Latin-1 character set. If the string contains other characters, the string is stored in the underlying byte array as UTF-16 characters and the length of the byte array is 2x the number of characters in the string. Additionally, it is possible to disable this storage optimization using the +XX:-CompactStrings JVM flag in which case all strings are stored as UTF-16 characters. See here and here for more information.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.