Coder Social home page Coder Social logo

vigna / sux Goto Github PK

View Code? Open in Web Editor NEW
79.0 9.0 16.0 865 KB

Succinct data structures in C/C++

License: Other

C++ 97.82% Makefile 1.51% Python 0.67%
fenwick-trees ranking selection succinct-data-structure minimal-perfect-hash elias-fano cplusplus

sux's Introduction

Sux 1.0.3

Welcome to the C++ part of the Sux project.

Available classes

The classes we provide fall into three categories:

All classes are heavily asserted. For testing speed, remember to use -DNDEBUG.

Documentation can be generated by running doxygen.

All provided classes are templates, so you just have to copy the files in the sux directory somewhere in your include path.

Benchmarks

The commands make ranksel, make recsplit, make fenwick and make dynranksek will generate binaries in bin with which you can test the speed of RecSplit, rank/select static structures, compact Fenwick trees and dynamic rank/select structures. Note that you can set the make variable LEAF to change the leaf size of RecSplit, as in make recsplit LEAF=4, and the variable ALLOC_TYPE to the possible values of sux::util::AllocType to experiment, for example, with transparent huge pages on Linux.

For ranking and selection, we generate one binary for each type of structure, with some variation on parameters (see the makefile for more details). Beside the number of bits, you can provide one or two probabilities. Bits will be set to one with the given probability in the first half of the test array, and with the second probability in the second half (if no second probability is specified, it is assumed to be equal to the first one). This setup is necessary to show the slow behaviour of naive implementations on half-almost-empty-half-almost-full arrays.

For RecSplit, we provide dump/load binaries which dump on disk a minimal perfect hash function, and test it. The standard version uses a keys file for the keys, whereas the โ€œ128โ€ version uses 128-bit random keys. We suggest the latter for benchmarking as in any case the first step in RecSplit construction is mapping to 128-bit hashes.

Licensing

Sux is licensed exactly like libstdc++ (GPLv3 + GCC Runtime Library Exception), which essentially means you can use it everywhere, exactly like libstdc++.

seba (mailto:[email protected])

sux's People

Contributors

beling avatar bytehamster avatar dominikbez avatar herlez avatar micheleandreata avatar pierlauro avatar smarchini avatar vigna avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sux's Issues

Segmentation fault in RecSplit

Hello,

I have the following code:

#include "sux/function/RecSplit.hpp"

using namespace sux::function;

int main() {
	vector<hash128_t> keys;
	for (int i = 0; i < 10000; i++) {
		keys.push_back(hash128_t(0, i));
	}
	RecSplit<8> RS(keys, 100);
}

Compiled with g++ -std=c++17, this crashes with a segmentation fault. If the loop goes up to only 1000, it works. Why is that? Thanks.

Improve description

I don't understand exactly what do you mean by "Succinct data structures in C/C++" and so will most people.
A short description of the project would help to find it and research about it.

RecSplit MPHF mapping

I am trying to get the mapping from keys to unique indices out of a RecSplit MPHF. I created a file with 4 strings and passed it to recsplit_dump_8, creating an MPHF. I modified recsplit_load.cpp (shown below) to display the mapping. However, the mapping is not a bijection. I also tried with a million keys and could not get an MPHF.

I've only been working with the tool for a few hours today, but I can't see how I'm using the interface incorrectly. Any help would be appreciated.

Here is the output of my run:

$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   benchmark/function/recsplit_load.cpp

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	test.keys
	test.mphf

no changes added to commit (use "git add" and/or "git commit -a")
$ make recsplit
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump.cpp -o bin/recsplit_dump_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_dump128.cpp -o bin/recsplit_dump128_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load.cpp -o bin/recsplit_load_8
g++ -std=c++17 -I./ -O3 -DSTATS -march=native -DLEAF=8 -DALLOC_TYPE=MALLOC benchmark/function/recsplit_load128.cpp -o bin/recsplit_load128_8
$ cat test.keys
fish
cat
dog
horse
$ bin/recsplit_dump_8 test.keys 2 test.mphf
Building...
Elias-Fano cumul sizes:  3.000000 bits/bucket
Elias-Fano cumul bits:   5.000000 bits/bucket
Elias-Fano cumul sizes:  1.500000 bits/key
Elias-Fano cumul bits:   2.500000 bits/key
Rice-Golomb descriptors: 1.500000 bits/key
Total bits:              5.500000 bits/key
Construction time: 0.000 s, 15366 ns/key
$ bin/recsplit_load_8 test.keys test.mphf
fish -> 1
cat -> 1
1 Duplicate!!!
dog -> 1
1 Duplicate!!!
horse -> 2
0 Missing!!!
3 Missing!!!

Here is my modification torecsplit_load.cpp to print the mapping.

$ cat benchmark/function/recsplit_load.cpp 
#include <algorithm>
#include <chrono>
#include <cstdio>
#include <fstream>
#include <iostream>
#include <random>
#include <sux/function/RecSplit.hpp>

#define SAMPLES (11)

using namespace std;
using namespace sux::function;

int main(int argc, char **argv) {
	if (argc < 3) {
		fprintf(stderr, "Usage: %s <keys> <mphf>\n", argv[0]);
		return 1;
	}

	ifstream fin(argv[1]);
	string str;
	vector<string> keys;
	while (getline(fin, str)) keys.push_back(str);
	fin.close();

	fstream fs;
	RecSplit<LEAF, ALLOC_TYPE> rs;

	fs.exceptions(fstream::failbit | fstream::badbit);
	fs.open(argv[2], std::fstream::in | std::fstream::binary);
	fs >> rs;
	fs.close();

	uint8_t *seen = (uint8_t *)calloc(keys.size(), sizeof(uint8_t));
	
	for (int k = 0; k < keys.size(); k++) {
	  uint64_t h = rs(keys[k]);
	  printf("%s -> %lu\n", keys[k].c_str(), h);
	  if(seen[h] == 1) {
	    printf("%lu Duplicate!!!\n", h);
	  } else {
	    seen[h] = 1;
	  }
	}

	for (int k = 0; k < keys.size(); k++) {
	  if(seen[k] == 0) {
	    printf("%u Missing!!!\n", k);
	  }
	}
	
	return 0;
}

Lookup semantics

After playing around with RecSplit a bit, I came to realize that the lookup semantics (the operator()) is different from that of BBHash. That is, in RecSplit a key that is not in the map may return any value. In BBHash, a key that is not in the map will return a zero value, but may also return any value (a false positive). In my experience with BBHash, false positives are rare.

I'm wondering if RecSplit could be adjusted to a similar semantics; that is to return

  • 0 if it is known that the key is not in the map.
  • index values: 1, 2, ..., n corresponding to the n keys.
  • false positives may be allowed.

To be clear, I'm not asking that you change your implementation in this repository, only whether or not it is an inherent limitation with RecSplit's underlying data structures. I've looked at the code and the paper, but on first glance it was not trivial to understand.

Duplicate keys in RecSplit

Is there a way to discard duplicate keys at construction time? Rooting them out before constructing recsplit can be expensive.

Licensing

Hi: Thank you for this!
There is one thing though is that the in C++ your templates would always be statically linked with the code that uses it. And therefore my understanding is that the LGPL requirements would extend to the calling code (in contrast with Java where the linking is different)
Have you consider to use the GPL with a runtime exception instead like this is done in libstdc++ rather than an LGPL license?
See https://gcc.gnu.org/onlinedocs/libstdc++/faq.html#faq.license.what and https://gcc.gnu.org/onlinedocs/libstdc++/manual/license.html

Thank you for your kind consideration!

Making RecSplit MPHF thread safe

I want to call operator() of RecSplit from multiple threads, but I think the function is not currently thread safe as it modifies some internal data structure (e.g., descriptors). Is it possible to make it thread safe?

Are they any bindings for Golang?

I would like to use this library (specifically RecSplit for creating MPHFs) for some experiments in our Erigon project (if you are interested I can describe the possible use case): https://github.com/ledgerwatch/erigon
But I do need Golang binding to do so. I assume there are no bindings yet, and I was thinking about creating a bounty for someone to develop it.

Creating select_upper and selectz_upper in EliasFano.hpp

In EliasFano.hpp, line 119, upper_bits.size has the term (num_ones + (num_bits >> l) + 1), which is converted to the number of words by adding 63 and dividing by 64. To be consistent, shouldn't lines 137 and 138 call SimpleSelectHalf and SimpleSelectZeroHalf by also adding 1 to (num_ones + (num_bits >> l)? That way, in SimpleSelectHalf, line 83, and SimpleSelectZeroHalf, line 83, the value of num_words will be equal to upper_bits.size.

Recsplit core dump on large keyfiles

Here I try to build an MPFH for 2^30 keys using recsplit. I'm running on an Amazon EC2 instance of type m5.4xlarge. Such an instance has 16 cores and 64 GiB of memory.

$ wc -l test.keys 
1073741823 test.keys
$ bin/recsplit_load_8 test.keys test.mphf
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.