Coder Social home page Coder Social logo

callidon / bloom-filters Goto Github PK

View Code? Open in Web Editor NEW
358.0 358.0 40.0 9.27 MB

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash

Home Page: https://callidon.github.io/bloom-filters/

License: MIT License

JavaScript 37.92% TypeScript 62.08%
bloom-filter count-min-sketch cuckoo-filter hyperloglog invertible-bloom-filter javascript minhash probabilistic topk

bloom-filters's People

Contributors

callidon avatar dependabot[bot] avatar dleppik avatar folkvir avatar jonahharris avatar justrox avatar lqmanh avatar zergity avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bloom-filters's Issues

Better examples for CustomHashing

Is your feature request related to a problem? Please describe.

It was confusing to figure out how to use the CustomHashing feature, and I discovered that naively constructing numbers from hex encoded cryptographic hashes, such as sha256 or shake256 can lead to scenarios where the filter always returns true.

Describe the solution you'd like

An example showing how to use cryptographic hashes, more precisely, how to safely convert from a hex encoded string to a number.... is it ok to truncate? Why not use BigInt?

Some explanation of how to use standard hash functions safely would be awesome.

Acceptation criterias

unit tests for sha256 and shake256.

Additional context

in some environments (regulated / security oriented) only specific hash functions are allowed,
it would be nice to have examples that are compliant to build from.

saveAsJSON() method not found

The README says that you can call .saveAsJSON() on any bloom filter to get a serialization of the bloom filter state that can be persisted or sent over the network. When I try to use it, however, my Typescript code fails to compile with the error: error TS2339: Property 'saveAsJSON' does not exist on type 'BloomFilter'.

My code looks like:

import { BloomFilter } from 'bloom-filters'

// [...]
const bloomFilterEntries = new Set<string>()
// [...]

const bloomFilter = BloomFilter.from(bloomFilterEntries, 0.01)
const bloomJSON = bloomFilter.saveAsJSON()

ERROR TypeError: "bloom_filters__WEBPACK_IMPORTED_MODULE_2__ is not a constructor"

In angular project
Step 1 : run npm install bloom-filters --save
Step 2: Add "node_modules/bloom-filters/bloom-filters.js" in angular.json
Step 3: Import in component file as import * as BloomFilter from 'bloom-filters'

my package.json is
{
"name": "",
"version": "0.0.0",
"scripts": {
"ng": "ng",
"start": "ng serve",
"build": "ng build",
"test": "ng test",
"lint": "ng lint",
"e2e": "ng e2e"
},
"private": true,
"dependencies": {
"@angular/animations": "~7.1.0",
"@angular/cdk": "~7.1.0",
"@angular/common": "~7.1.0",
"@angular/compiler": "~7.1.0",
"@angular/core": "~7.1.0",
"@angular/forms": "~7.1.0",
"@angular/platform-browser": "~7.1.0",
"@angular/platform-browser-dynamic": "~7.1.0",
"@angular/router": "~7.1.0",
"@types/axios": "^0.14.0",
"aes-js": "^3.1.2",
"angular-filepond": "^1.0.5",
"bloom-filters": "^0.5.2",
"chart.js": "^2.7.3",
"core-js": "^2.5.4",
"d3": "^5.7.0",
"datatables.net": "^1.10.19",
"datatables.net-dt": "^1.10.19",
"echarts": "^4.2.0-rc.2",
"js-sha256": "^0.9.0",
"jssha": "^2.3.1",
"ng2-charts": "^1.6.0",
"ng2-pdf-viewer": "^5.2.3",
"ng2-toastr": "^4.1.2",
"ngx-echarts": "^4.1.0",
"ngx-extended-pdf-viewer": "^0.9.14",
"rxjs": "~6.3.3",
"tslib": "^1.9.0",
"zone.js": "~0.8.26"
},let filter = new BloomFilter(15, 0.01)
"devDependencies": {
"@angular-devkit/build-angular": "~0.11.0",
"@angular/cli": "~7.1.2",
"@angular/compiler-cli": "~7.1.0",
"@angular/language-service": "~7.1.0",
"@schematics/angular": "~7.1.0",
"@types/echarts": "^4.1.3",
"@types/jasmine": "~2.8.8",
"@types/jasminewd2": "~2.0.3",
"@types/node": "^8.9.5",
"codelyzer": "~4.5.0",
"jasmine-core": "~2.99.1",
"jasmine-spec-reporter": "~4.2.1",
"karma": "~3.1.1",
"karma-chrome-launcher": "~2.2.0",
"karma-coverage-istanbul-reporter": "~2.0.1",
"karma-jasmine": "~1.1.2",
"karma-jasmine-html-reporter": "^0.2.2",
"protractor": "~5.4.0",
"ts-node": "~7.0.0",
"tslint": "~5.11.0",
"typescript": "~3.1.6"
}
}

GOT the title error

Solution might be
1.

2. how to import node npm module in angular project

Bug: Endless recursion in getDistinctIndices with certain parameters.

Explanation
The getDistinctIndicesBis function within getDistinctIndices will fail with a Maximum call stack size exceeded error when there are too few distinct indizes being returned while n < size.

How to reproduce

const filter = new BloomFilter(39, 28);
filter.add("da5e21f8a67c4163f1a53ef43515bd027967da305ecfc741b2c3f40f832b7f82");

This code will trigger the issue.

Why does this happen?

// This returns the same hashes after n is larger than size
const hashes = hashTwice(elem, true, seed! + (size % n));
// This return from the doubleHashing function will also return the same values when called with the same hashes and different n values
return Math.abs((hashA + n * hashB) % size);

This combined with the filtering of double indizes in getDistinctIndicesBis will lead to the issue.
Its important to note that this will only happen in certain cases with high hash count and a low bit count.

Hyperloglog Example Not working

The example hyperloglog that is provided in the documentation does not work as expected. When counting the sketch when initialized without any updates, I get 71.36002532672464, when I update the sketch with alice, the count returns the same number. Is there something that I'm missing?

Counting bloom filter

Hello,

is there any chance that you could implement a counting bloom filter?
Would be great!

Thanks in advance!

Filters created with version 1.3.4 and exported to json do not work when imported to version 1.3.9

We use this library in as part of the Ceramic Network. We are publishing batches of work done on the network along with a JSON-serialized bloom filter to make it easier to look up what happened in that batch. Our hosted service making these batches and publishing these bloom filters seems to be using version 1.3.4 of the bloom-filters package. Importing those bloom filters with the current 1.3.9 version of the library causes the .has() method of the reconstituted bloom filter to always return false, even for entries that are definitely present in the filter. Reverting to version 1.3.4 and rebuilding the bloom filter from the same input data causes it to behave correctly. It seems that there was a format change somewhere between version 1.3.4 and version 1.3.9 that causes filters created on the old version, exported with saveAsJSON(), and then imported with fromJSON() on the new version to not build the bloom filter correctly.

Code that builds the filter: https://github.com/ceramicnetwork/ceramic-anchor-service/blob/9ff9e1a20e46c65036fcbed600a0abf1b09d0bec/src/merkle/merkle-objects.ts#L247-L249

Unexpected behaviour in CuckooFilter

See the example. Many false positives (It happens especially after the first result in a row.) What am I doing wrong?

const { CuckooFilter } = require('bloom-filters')
let filter = new CuckooFilter(15, 3, 2)

 filter.add('alice')
 filter.add('andrew')
 filter.add('bob')
 filter.add('sam')

 filter.add('alice')
 filter.add('andrew')
 filter.add('bob')
 filter.add('sam')

// lookup for some data
console.log(filter.has('samx')) // output: false [ok]
console.log(filter.has('samy')) // output: true [?]
console.log(filter.has('alice')) // output: true [ok]
console.log(filter.has('joe')) // output: true [?]
console.log(filter.has('joe')) // output: true [?]

Version 0.7.1

`Buffer is not defined` error when using in the browser.

Hi I was trying to create a BloomFilter instance in the browser and it keeps throwing the error Buffer is not defined error even though I'm only importing the BloomFilter export which doesn't use the Buffer class.

Is there a workaround for this?

I'm using the library with the vite bundler on firefox 99, if that helps.

saveToJSON doesn't export default seed

This code imports the cuckoo filter to the localStorage in v3.0.0, but upon importing it again from localStorage, it'll find that the default seed wasn't serialized, and so it'll throw an error

!('_seed' in json)

import { CuckooFilter } from "bloom-filters";

// Check if there's a saved filter in localStorage.
let filter;
const savedFilter = JSON.parse(localStorage.getItem("--kiwi-news-read-links"));
console.log(savedFilter);
try {
  // Try to load the filter from localStorage.
  filter = CuckooFilter.fromJSON(savedFilter);
} catch (error) {
  console.log(error);
  // If an error is thrown, create a new filter.
  filter = CuckooFilter.create(1000, 0.01);
}

// Keep a queue of the most recent N links.
let queue = [];

// Check the current path.
const path = window.location.pathname;

if (path === "/" || path === "/new") {
  // Extract all the links on page load.
  const links = Array.from(document.querySelectorAll("a.story-link")).map(
    (a) => a.href,
  );

  links.forEach((link) => {
    if (!filter.has(link)) {
      const parentElement = document
        .querySelector(`a[href="${link}"]`)
        .closest("div");
      const domainElement = parentElement.querySelector(".story-domain");
      domainElement.textContent += " โ€ข";
      addLink(link);
    }
  });

  console.log("safe", filter.saveAsJSON());
  // Save the filter to localStorage.
  localStorage.setItem(
    "--kiwi-news-read-links",
    JSON.stringify(filter.saveAsJSON()),
  );
}

function addLink(link) {
  // If the filter is full, remove the oldest link.
  if (queue.length >= 1000) {
    const oldestLink = queue.shift();
    filter.remove(oldestLink);
  }

  // Add the new link to the filter and the queue.
  filter.add(link);
  queue.push(link);
}

however, after adding a custom seed manually, the seed gets exported as JSON and the problem is resolved

try {
  8   // Try to load the filter from localStorage.
  9   filter = CuckooFilter.fromJSON(savedFilter);
 10 } catch (error) {
 11   console.log(error);
 12   // If an error is thrown, create a new filter.
 13   filter = CuckooFilter.create(1000, 0.01);
 14   filter.seed = 0xabcd;
 15 }

still, I think this should also work with the default seed, shouldn't it?

[Bug?] npm install produces only a api.js file

npm install installs version 1.3.7.
The npm package only contains a api.js in the dist directory. Did I missed something?
The package can not run because all files that are required in the api.js file are not there.

Todo list

  • HyperLogLog
  • Top-K or Mine Top-K: Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams, Younghee Kim et Al.
  • Min Hash Wikipedia source, need to find the paper
  • Scalable Bloom Filter, as the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.
  • XorFilter #29

Feel free to modify this list or add suggestions in comments.

Browser friendliness?

I'd really like to use this module in browsers but the bundle size ends up being very large, just importing the CuckooFilter adds over 50KB of minified code to the bundle.

Using esbuild to bundle just this script:

import { CuckooFilter } from 'bloom-filters'

const filter = new CuckooFilter(1, 2, 3, 4)
console.info(filter.has('hello world'))

Creates a 216KB bundle (63KB minified), and I have to shim some node internals (e.g. the Buffer class) for it to work:

image

The CuckooFilter implementation itself is 8.4KB un-minified so there's a lot of unused code here.

Would you be open to some PRs that make this module more browser friendly?

Some low hanging fruit I can see immediately:

  1. Switch tsc to output ESM for better tree shaking
  2. Replace use of node Buffers with built-in Uint8Arrays
  3. Remove, replace or optimise use of lodash
  4. Remove long dependency and use built-in BigInts

The first is a breaking change but will yield the biggest benefit - the breaking change is that consumers will no longer be able to require the module, they must import it via static or dynamic imports.

The second is breaking where Buffers are used as return values which from what I can see only appears to be in the Invertible Bloom Lookup Table implementation.

The rest are just internal changes so are non-breaking.

Implements Scalable Bloom Filter

From #8 , we need to implements Scalable Bloom Filter, since the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.

Question about implementation of double hashing

Hi. I'm just looking into the hashing implementation here and I'm baffled by several aspects of the implementation, hoping you can tell me if there's anything I'm missing.

For example, in getIndices, you're rehashing the input on every iteration of the loop:

bloom-filters/src/utils.ts

Lines 206 to 209 in 5c81fa4

for (let i = 1; i <= hashCount; i++) {
const hashes = hashTwice(element, true, seed + size % i)
arr.push(doubleHashing(i, hashes.first, hashes.second, size))
}

Surely this defeats the point of double hashing, to simulate k independent hash functions given only two real ones? Not using double hashing at all would need to do one hash per index, your implementation does two hashes per index.

It's true that the hashes you're calculating on each loop aren't quite the same, because you're adding size % i -- that's the number of cells modulo the loop iteration -- to the seed each time. But why? That seems like a really strange thing to add to the seed. It doesn't guarantee that the seed is different on each loop (eg if the number of cells is even it'll be 0 for the first 2 iterations). But again that's not something you should want/need anyway in double hashing.

Am I missing something?

Serialized filters: are safe to expose publicly?

Hi, im working on a blockchain, on-chain scallable filtering solution, and I wish to know if an attacker could get advantages if the entire serialized filter is stored publickly.

If thats the case, are there any suggestions to follow, like keeping some parts of the filter off-chain maybe? filter should be exposed somehow to decentralize the filtering.

The error rate should stay the lowest and filter content size can vary.

I'm analyzing to use this nice library.

Btw: tryed error rate to zero just for testing and browser cpu usage got crazy, I know it was an utopic case but just notifying if it's usefull for you guys,

Persistence

Thanks for making this library.

Is there any way to serialise the structures into something (e.g. json or a base64 encoded string) so that it can be persisted somewhere and re-instantiated?

Error in Browser Chrome 105

environment is a quasar+vite+vue3

vue-router.mjs:3428 ReferenceError: Buffer is not defined
at cell.js:178:76
at node_modules/bloom-filters/dist/iblt/cell.js (cell.js:200:1)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at node_modules/bloom-filters/dist/iblt/invertible-bloom-lookup-tables.js (invertible-bloom-lookup-tables.js:84:30)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at node_modules/bloom-filters/dist/api.js (api.js:54:40)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at dep:bloom-filters:1:16

Reusing Hashes for faster lookups

Is your feature request related to a problem? Please describe.

My Problem: I replicate bloom filters from many clients (up to 1000) to my server.
The server then has to lookup a given key "foobar" and check which of the replicated filters contains the key.
My bloom filters are pretty big and need about 13 hashes.
To check which replicated filter has the key, the server would then have to do 13*1000 hash operations.

Describe the solution you'd like

A way to precalculate the hash once and then lookup the existence in multiple bloom filters without having to hash again.

import { hash } from 'bloom-filters';
const lookupHash = hash(13, 'foobar');

const foundInFilters = bloomFilters.filter(bf => bf.hasHash(lookupHash));

Acceptation criterias

For testing we could run a test that ensures that lookups for pre-calculated hashes are exactly equal to normal key lookups.

Additional context

Also a question: Is it possible to use a different hash function? My plan is to run the hashing inside of WebAssembly for better performance, so maybe even an async hash function would be useful.

Exported JSON isn't portable

I was evaluating this library for a project I'm working on, where I need to export the bloom filter as a file, store it, then load it again from the file.

If I try using saveAsJSON(), the binary data is stored as an array of booleans, which, when saved to a file, looks like this:

[true, false, true, true, true, ...]

The bloom filter for a 25mb dataset ended up being ~120mb!

It would be nice to be able to save the file in a more space-efficient format.

failed to load bloom-filter module after npm installation

$ npm install --save bloom-filters
$ node
Welcome to Node.js v12.12.0.
Type ".help" for more information.
> const { BloomFilter } = require('bloom-filters')
Thrown:
Error: Cannot find module '/Users/henry/tmp/bloom/node_modules/bloom-filters/bloom-filters.js'. Please verify that the package.json has a valid "main" entry
    at tryPackage (internal/modules/cjs/loader.js:294:19)
    at Function.Module._findPath (internal/modules/cjs/loader.js:525:18)
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:781:27)
    at Function.Module._load (internal/modules/cjs/loader.js:687:27)
    at Module.require (internal/modules/cjs/loader.js:849:19)
    at require (internal/modules/cjs/helpers.js:74:18) {
  code: 'MODULE_NOT_FOUND',
  path: '/Users/henry/tmp/bloom/node_modules/bloom-filters/package.json',
  requestPath: 'bloom-filters'
}

but if I use the version 0.8.0 , it works well.

CuckooFilter returns false negative

I'm using CuckooFilter to test the membership of the following items. The filter returns false - definitely not in the list - for 2 items. These cases are false negatives which should not happen.

Is this a bug or am I missing something?
Version: "bloom-filters": "^3.0.1"

import { CuckooFilter } from "bloom-filters";
const items = [
  "https://www.youtube.com/watch?v=HJjxN05ewEc",
  "https://www.youtube.com/watch?v=BZNUo7orS3k",
  "https://www.youtube.com/watch?v=SD-McWZz_pk",
  "https://www.youtube.com/watch?v=De4QjH9fpgo",
  "https://www.youtube.com/watch?v=Hzko-cjHhTg",
  "https://www.youtube.com/watch?v=vqR-8lgOmBE",
  "https://www.youtube.com/watch?v=j6u0LH67YLk",
  "https://www.youtube.com/watch?v=B2z8ikGLRh8",
  "https://www.youtube.com/watch?v=N3ftBeP16TA",
  "https://www.youtube.com/watch?v=38RBRPwODUk",
  "https://www.youtube.com/watch?v=Ry8nSUfX6fY",
  "https://www.youtube.com/watch?v=-KrYohUJvYw",
  "https://www.youtube.com/watch?v=zRpl7Pr0fs4",
  "https://www.youtube.com/watch?v=uYYiypp6WaY",
  "https://www.youtube.com/watch?v=EPap21FBGbE",
  "https://www.youtube.com/watch?v=zN2_0WC7UfU",
  "https://www.youtube.com/watch?v=DNVwnkgTzbY",
];

const errorRate = 0.04; // 4 % error rate
const filter = CuckooFilter.from(items, errorRate);
console.log(filter.has("hello")); // return false (ok)
console.log(filter.has("https://www.youtube.com/watch?v=HJjxN05ewEc")); // return true (ok)
console.log(filter.has("https://www.youtube.com/watch?v=38RBRPwODUk")); // return false (false negative)
console.log(filter.has("https://www.youtube.com/watch?v=-KrYohUJvYw")); // return false (false negative)

Compact serialized form for BloomFilter

The serialized form of BloomFilter is often larger than the original data. I haven't tested in-memory usage, but since it uses an array of 64-bit 1's and 0's, the same is likely true. This implies that there is no advantage to a BloomFilter over a JavaScript Set when hashing short strings.

For example, an 8.2 Mb file (3.6 Mb gzip compressed) of 1 million leaked passwords serializes to 27 Mb (2.3 Mb compressed). This is with false positive tolerance of 0.1%.

I have written a fix for this which replaces the array with a BitSet backed by a Uint8Array and serializes the array with base64 encoding. This changes the 27 Mb filter to 2.3 Mb (1.3 Mb compressed).

I'm prepared to make a pull request for this fix, if you give me permission. Otherwise I can send you the changes a different way.

import / export BloomFilter not working as expected

Hi, these functionalities seemed have issue:

static create (items = 1000, errorRate = 0.001, seed = utils.getDefaultSeed()) {
const size = fm.optimalFilterSize(items, errorRate)
const hashes = fm.optimalHashes(size, items)
const filter = new BloomFilter(size, hashes)
filter.seed = seed
return filter
}

  • BloomFilter.create should be the correct way to construct a filter since it calculates sizing internally, but forgot to save its _errorRate for serialization later on

const BloomFilterSpecs = {
export: cloneObject('BloomFilter', '_errorRate', '_size', '_length', '_nbHashes', '_filter', '_seed'),
import: (FilterConstructor, json) => {
if ((json.type !== 'BloomFilter') || !assertFields(json, '_errorRate', '_size', '_length', '_nbHashes', '_filter', '_seed')) {
throw new Error('Cannot create a BloomFilter from a JSON export which does not represent a bloom filter')
}
const filter = new FilterConstructor(json._capacity, json._errorRate)
filter.seed = json._seed
filter._size = json._size
filter._nbHashes = json._nbHashes
filter._filter = json._filter.slice(0)
filter._length = json._length
return filter
}
}

  • because _errorRate has not been saved, it left as undefined during exports, then got omitted via JSON.stringify, which later leading to throw an exception from assertFields by import
  • _capacity has not been set onto the instance nor serialization, but treat as size to feed into FilterConstructor

class BloomFilter extends Exportable {
/**
* Constructor
* @param {int} size - the number of cells
* @param {number} nbHashes - the number of hash functions used
*/
constructor (size = 1000, nbHashes = 4) {

Question: v8 release?

I would like to use a cuckoo filter, and saw some work on improving it (ergonomics, and looks like speed?) in the master branch that hasn't been released. When do you think it will be in shape to go out?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.