Coder Social home page Coder Social logo

alexandrnikitin / bloom-filter-scala Goto Github PK

View Code? Open in Web Editor NEW
375.0 19.0 57.0 284 KB

Bloom filter for Scala, the fastest for JVM

Home Page: https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/

License: MIT License

Scala 85.55% Java 14.45%
bloom-filter probabilistic high-performance datastructures scala

bloom-filter-scala's People

Contributors

alexandrnikitin avatar cmarxer avatar eyalfa avatar fediq avatar melrief avatar seanrohead avatar sidweng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bloom-filter-scala's Issues

Can't find implicit value for canGenerateHash

I'm trying to use a BloomFilter inside of a class that uses a TypeTag:

class MyClass[T: TypeTag](numberOfItems: Long, falsePositiveRate: Double) {
  val bloomFilter = BloomFilter[T](numberOfItems, falsePositiveRate)
}

However at compile time I get could not find implicit value for parameter canGenerateHash: bloomfilter.CanGenerateHashFrom[T]. Is there a way around this?

Thank you,
Chris

serialization + kryo support

Hi,
currently the BloomFilter and its buffer are not serializable, can this be added?
same goes for kryo support, can be added with an optional dependency (I've actually already implemented this as part of my spark application)

another related issue, the provided to/from stream methods work one long at a time. haven't really benchmarked this but my intuition is that this is slowwwww. I suggest modify the current methods or add new ones that takes/allocates a buffer and leverage unsafe's ability to copy memory ranges (spark tungsten's style)

will you be willing to accept pull requests?

thanks,
Eyal.

Serialization Support

Add support for Java and Kryo serialization so that the Bloom Filter can be used in distributed applications such as Apache Spark.

Opposite bloom filter wanted

The opposite of a Bloom filter is a data structure that may report a false negative, but can never report a false positive.

It is handy for deduplication in realtime while unique values shouldn't be filtered out and possible non-filtered dublicates should be limited by some defined rate.

Why the bloomfilter returns false?

When i use the example on the main page
i add element to the bloomfilter, however, when using mightContain method to check whether an element in a set. it returns false.Why?

val expectedElements = 1000000
val falsePositiveRate = 0.1
val bf = BloomFilter[String](expectedElements, falsePositiveRate)

// Put an element
bf.add("lee")

// Check whether an element in a set
bf.mightContain("lee") // return false

// Dispose the instance
bf.dispose()

reconsider UnsafeBitArray

I'm using the library with Apache-Spark,
specifically I'm optimizing a sparse join by creating a bloom filter per partition, this required wrapping the bloom filter instances with a class that implements finalaize.
If I hadn't done so I'd get a massive memory leak ๐Ÿ˜ข , the wrapper solved part of my problems but not all as Spark is unaware of the memory usage of the bloom filters which may lead to situations where it 'misses' a memory spill (spark's way of freeing up memory).

my thought is that implementing UnsafeBitArray in terms of a plain Java long array (Long[]) should not harm capabilities or performance of this class while still allowing for proper GC of these objects.
I think an even further possibility is using multiple arrays to avoid huge allocations, JVM heap (like many other memory allocators) 'suffers' both from tiny and huge allocations, each having its own merits.

@alexandrnikitin , what do you think? does it worth a benchmark? (perhaps in a separate branch to avoid complexities of side by side support for both code paths)?

NoClassDefFoundError Product when instatiating a new BloomFilter

Instantiating a simple bloom filter with the following command

    val expectedElements = 10000
    val falsePositiveRate: Double = 0.1
    val bf = BloomFilter[String](expectedElements, falsePositiveRate)

at the last line ( val bf = ...)
I am getting the following error


[error] (run-main-0) java.lang.NoClassDefFoundError: scala/Product$class
[error] java.lang.NoClassDefFoundError: scala/Product$class
[error] 	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<init>(CanGenerateHashFrom.scala:20)
[error] 	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.<clinit>(CanGenerateHashFrom.scala)

Scala 2.12.1 support?

With 2.12 scala.concurrent.util.Unsafe is removed impacting this project. Are you planning to update accordingly? I'm curious about your thoughts on how to go about it. Thanks

Better utilization of allocated memory

UnsafeBitArray always allocates a multiple of 64 bits. For example if numberOfBits is 250 then 256 bits are allocated but 6 bits remain always unused.

Wouldn't it make sense that Bloomfilter.optimalNumberOfBits(..) returns the next higher number which is a multiple of 64? The false positive rate would become little better while all allocated memory is potentially used.

got SIGSEGV using in Spark

I tried to use your implementation in Spark and got that

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x000000010d75d414, pid=34100, tid=48387

JRE version: Java(TM) SE Runtime Environment (8.0_51-b16) (build 1.8.0_51-b16)
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.51-b03 mixed mode bsd-amd64 compressed oops)
Problematic frame:
V [libjvm.dylib+0x55d414]

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:
hs_err_pid34100.log
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488
Compiled method (nm) 10145 2030 n 0 sun.misc.Unsafe::getObject (native)
total in heap [0x000000010e7b98d0,0x000000010e7b9c28] = 856
relocation [0x000000010e7b99f8,0x000000010e7b9a38] = 64
main code [0x000000010e7b9a40,0x000000010e7b9c28] = 488

If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp

hs_err_pid34100.txt

SIGSEGV Error in a Web Application

I'm trying to use the filter for one of my projects which is a web application built using the Play framework. As I tried to insert elements into the filter, I got this error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fd90fd1f9ee, pid=330518, tid=330806
#
# JRE version: OpenJDK Runtime Environment (11.0.10+9) (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
# Java VM: OpenJDK 64-Bit Server VM (11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xdd59ee]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/joesan/Projects/Private/scala-projects/yoke/core.330518)
#
# An error report file with more information is saved as:
# /home/joesan/Projects/Private/scala-projects/yoke/hs_err_pid330518.log
#
# If you would like to submit a bug report, please visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-lts
#
Aborted (core dumped)
[hs_err_pid330518.log](https://github.com/alexandrnikitin/bloom-filter-scala/files/6262164/hs_err_pid330518.log)

Please see the attached file for the detailed error. I'm using the following JDK:

joesan@joesan-InfinityBook-S-14-v5:~/Projects/Private/scala-projects/yoke$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

hs_err_pid330518.log

Support JDK above 8.x

I'm using JDK 11 (azul 11.0.7) with Scala 2.12.11 and get the exception below when adding an element to the Bloom Filter. The issue appears to be that the value field of class String was changed in JDK 9 from a char[] to a byte[], which causes a ClassCastException on this line:

val value = unsafe.getObject(from, valueOffset).asInstanceOf[Array[Char]]

Exception in thread "main" java.lang.ClassCastException: class [B cannot be cast to class [C ([B and [C are in module java.base of loader 'bootstrap')
	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:27)
	at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:20)
	at bloomfilter.mutable.BloomFilter.add(BloomFilter.scala:18)
	at Main$.delayedEndpoint$Main$1(Main.scala:12)
	at Main$delayedInit$body.apply(Main.scala:3)
	at scala.Function0.apply$mcV$sp(Function0.scala:39)
	at scala.Function0.apply$mcV$sp$(Function0.scala:39)
	at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
	at scala.App.$anonfun$main$1$adapted(App.scala:80)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.App.main(App.scala:80)
	at scala.App.main$(App.scala:78)
	at Main$.main(Main.scala:3)
	at Main.main(Main.scala)

Issue when using `bloom-filter-scala` on String

@alexandrnikitin,

Awesome work on bloom-filter-scala, I am using bloom-filter-scala on String, but seems like false positive is too high. My demo example as below:

package sample

import bloomfilter.mutable.BloomFilter

object Sample extends App {

  // scalastyle:off
  println("""val bf = BloomFilter[String](10000, 0.1)""")
  val bf = BloomFilter[String](10000, 0.1)
  println("""bf.add("a") """)
  bf.add("a")
  println(s"""bf.mightContain("s"): ${bf.mightContain("s")}""")
  println(s"""bf.mightContain("1"): ${bf.mightContain("1")}""")
  println(s"""bf.mightContain(""): ${bf.mightContain("")}""")
  // scalastyle:on

}

Expecting all return false, but seems always return at least 2 true. May I know if I am using bloom-filter-scala correctly?

Thank you very much.

CanGenerateHashFromString is broken in JDK 9+ when string contains non-latin characters or +XX:-CompactStrings JVM flag is used

CanGenerateHashFromStringByteArray, which is used for JDK9+, assumes that the string is stored using the UTF-8 character encoding and that the length of the underlying byte[] is the same as the length of the string. This assumption only holds true if the string only contains characters from the ISO-8859-1/Latin-1 character set. If the string contains other characters, the string is stored in the underlying byte array as UTF-16 characters and the length of the byte array is 2x the number of characters in the string. Additionally, it is possible to disable this storage optimization using the +XX:-CompactStrings JVM flag in which case all strings are stored as UTF-16 characters. See here and here for more information.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.