Coder Social home page Coder Social logo

ewahboolarray's Introduction

Compressed bitset in C++

What is this?

The class EWAHBoolArray is a compressed bitset data structure. It supports several word sizes by a template parameter (16-bit, 32-bit, 64-bit). You should expect the 64-bit word-size to provide better performance, but higher memory usage, while a 32-bit word-size might compress a bit better, at the expense of some performance.

The library also provides a basic BoolArray class which can serve as a traditional bitmap.

Real-world usage

EWAH is used to accelerate the distributed version control system Git.

The Java counterpart of this library (JavaEWAH) is part of Apache Hive and its derivatives (e.g., Apache Spark) and Eclipse JGit. It has been used in production systems for many years. It is part of major Linux distributions.

This library is used by Hustle -- A column oriented, embarrassingly distributed relational event database, by the The yt Project and by the fuzzing tool VUzzer.

When should you use a bitmap?

Sets are a fundamental abstraction in software. They can be implemented in various ways, as hash sets, as trees, and so forth. In databases and search engines, sets are often an integral part of indexes. For example, we may need to maintain a set of all documents or rows (represented by numerical identifier) that satisfy some property. Besides adding or removing elements from the set, we need fast functions to compute the intersection, the union, the difference between sets, and so on.

To implement a set of integers, a particularly appealing strategy is the bitmap (also called bitset or bit vector). Using n bits, we can represent any set made of the integers from the range [0,n): it suffices to set the ith bit is set to one if integer i is present in the set. Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can support large values of n. Intersections, unions and differences can then be implemented as bitwise AND, OR and ANDNOT operations. More complicated set functions can also be implemented as bitwise operations.

When the bitset approach is applicable, it can be orders of magnitude faster than other possible implementation of a set (e.g., as a hash set) while using several times less memory.

When should you use compressed bitmaps?

An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet and set the bit at position 1,000,000 to true and you have just over 100kB. That's over 100kB to store the position of one bit. This is wasteful even if you do not care about memory: suppose that you need to compute the intersection between this BitSet and another one that has a bit at position 1,000,001 to true, then you need to go through all these zeroes, whether you like it or not. That can become very wasteful.

This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful. For example, if you have a small universe size. E.g., your bitmaps represent sets of integers from [0,n) where n is small (e.g., n=64 or n=128). If you can use an BitSet and it does not blow up your memory usage, then compressed bitmaps are probably not useful to you. In fact, if you do not need compression, then a BitSet offers remarkable speed. One of the downsides of a compressed bitmap like those provided by EWAHBoolArray is slower random access: checking whether a bit is set to true in a compressed bitmap takes longer.

How does EWAH compares with the alternatives?

EWAH is part of a larger family of compressed bitmaps that are run-length-encoded bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word. If you have a local mix of 1s and 0, you use an uncompressed word.

There are many formats in this family beside EWAH:

  • Oracle's BBC is an obsolete format at this point: though it may provide good compression, it is likely much slower than more recent alternatives due to excessive branching.
  • WAH is a patented variation on BBC that provides better performance.
  • Concise is a variation on the patented WAH. It some specific instances, it can compress much better than WAH (up to 2x better), but it is generally slower.
  • EWAH is both free of patent, and it is faster than all the above. On the downside, it does not compress quite as well. It is faster because it allows some form of "skipping" over uncompressed words. So though none of these formats are great at random access, EWAH is better than the alternatives.

There are other alternatives however. For example, the Roaring format is not a run-length-encoded hybrid. It provides faster random access than even EWAH.

Licensing

Apache License 2.0.

Update (May 20th, 2013): though by default I use the Apache License 2.0 (which is compatible with GPL 3.0), you can also consider this library licensed under GPL 2.0.

Dependencies

None. (Will work under MacOS, Windows or Linux.)

Compilers tested: clang++, g++, Intel compiler, Microsoft Visual Studio

It works on x64 processors as well as on 32-bit ARM processors.

Versions 0.5 and above assume that the compiler supports the C++11 standard.

Usage (Linux and Linux-like systems)

cmake -B build 
cmake --build build
cd build
ctest

Usage (Visual Studio under Windows)

To build with at least Visual Studio 2017 directly in the IDE:

  • Grab the code from GitHub, e.g., by cloning it using GitHub Desktop.
  • Select the Visual C++ tools for CMake optional component when installing the C++ Development Workload within Visual Studio.
  • Within Visual Studio use File > Open > Folder... to open the CRoaring folder.
  • Right click on CMakeLists.txt in the parent directory within Solution Explorer and select Build to build the project.
  • For testing, in the Standard toolbar, drop the Select Startup Item... menu and choose one of the tests. Run the test by pressing the button to the left of the dropdown.

Quick code sample

  #include "ewah.h"
  using namespace ewah;

  typedef EWAHBoolArray<uint32_t> bitmap;

  bitmap bitset1 =
      bitmap::bitmapOf(9, 1, 2, 1000, 1001, 1002, 1003, 1007, 1009, 100000);
  std::cout << "first bitset : " << bitset1 << std::endl;
  bitmap bitset2 = bitmap::bitmapOf(5, 1, 3, 1000, 1007, 100000);
  std::cout << "second bitset : " << bitset2 << std::endl;
  bitmap bitset3 = bitmap::bitmapOf(3, 10, 11, 12);
  std::cout << "third  bitset : " << bitset3 << std::endl;
  bitmap orbitset = bitset1 | bitset2;
  bitmap andbitset = bitset1 & bitset2;
  bitmap xorbitset = bitset1 ^ bitset2;
  bitmap andnotbitset = bitset1 - bitset2;

Example

Please see examples/example.cpp. For an example with tabular data, please see example2.cpp.

Further reading

Please see

  • Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018 arXiv:1709.07821
  • Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 69 (1), pages 3-28, 2010. http://arxiv.org/abs/0901.3751
  • Owen Kaser and Daniel Lemire, Compressed bitmap indexes: beyond unions and intersections, Software: Practice and Experience 46 (2), 2016. http://arxiv.org/abs/1402.4466

Node/JavaScript wrapper

Dimitrios Vasilas wrote a wrapper for JavaScript.

You can install it by typing:

    npm install -g node-gyp
    npm install node-bitmap-ewah

Ruby wrapper

Josh Ferguson wrote a wrapper for Ruby. The implementation is packaged and installable as a ruby gem.

You can install it by typing:

    gem install ewah-bitset

Persistent storage

We do not correct for the endianess. If you use both little endian and big endian machines, you should be careful. Thankfully, big endian hardware is vanishingly rare.

ewahboolarray's People

Contributors

0cjs avatar lemire avatar mattvilim avatar sivaraam avatar zarianw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ewahboolarray's Issues

Twice called .set(N) method results in assertion error

If twice called i.e. in example.cc

bitset2.set(1000);
bitset2.set(1000);

results in

Assertion failed: (i >= sizeinbits), function set, file headers/ewah.h, line 667.
Abort trap: 6

which is pretty frustrating.

Although .set() declaration mentions that bits must be in ever-increasing order, are there any chances for setting bit N twice to be a legitimate operation?

What is the time complexity of bitset1 & bitset2 ?

I want to know the time complexity of operator bitset1 & bitset2.
I know the complexity of std::set_intersection is up to linear in 2*(count1+count2)-1 (where countX is the distance between firstX and lastX): Compares and assigns elements.
The EWAHBoolArray can faster than std::set_intersection ?
Thank.

Move constructor/ move assignment operator

Can you add move constructor / move assignment operator so that EWAHBoolArray works better inside std::vector, etc.?

Please let me know if you want me to create a PR with the change.

Encountered Error "C4146: unary minus operator applied to unsigned type, result still unsigned"

Hi - I am attempting to compile some examples using this library in Visual Studio 2015 (compiling for x64). While all your example work perfectly, I am having trouble using any of the to* methods (i.e., toArray(), toVector() etc.) as the compile throws an error:

"unary minus operator applied to unsigned type, result still unsigned"

For now, I am able to retrieve the row offsets using the iterator (code below) you provided in your example but I also saw your comment inside the source that this might be inefficient and that I should use toArray when possible.

for (EWAHBoolArray<uint32_t>::const_iterator j = bitmap.begin(); j != bitmap.end(); ++j) {
        std::cout << "term '" << key << "' appears at row index " << *j << std::endl;
}

Do you have any suggestions on how I can resolve this problem?

The exact line of code that is leading to this error is the following:

template <class uword>
std::vector<size_t> EWAHBoolArray<uword>::toArray() const {
      ...
      ...
      while (myword != 0) {
        uint64_t t = myword & -myword; // <--------------------- Compiler complaining here
        uint32_t r = numberOfTrailingZeros(t);
        ans.push_back(pos + r);
        myword ^= t;
      }
      ... 
      ...
  return ans;
}

Thank you!

Build fails in 32 bit mode on MSVC

The 32 bit build fails with the following errors:

Error 1 error C3861: '_BitScanForward64': identifier not found e:\code\forks\ewahboolarray\headers\ewahutil.h 49 1 EWAHBoolArrayTest
Error 2 error C3861: '__popcnt64': identifier not found e:\code\forks\ewahboolarray\headers\ewahutil.h 197 1 EWAHBoolArrayTest

The code needs to use _BitScanForward and __popcnt in 32 bit build. I tried changing that by adding the compile time switch (using #if _WIN64), the build was fine but then a whole bunch of unit tests were failing in 32 bit build.

The failing tests were originating from
if (!testNanJiang<uint64_t>())

I don't know if these can be ignored for the 32 bit build because they are testing uint64_t.

Unit tests fail with Visual Studio 2012

Here is the output of unit.cpp according to Kelly Sommers:

[testing EWAH set/get]
[testing EWAH set/get]
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
[testing EWAH set/get]
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
Failed test set/get
[testing PhongTran]
[testing STL compatibility]
[testing STL compatibility]
[testing STL compatibility]
[testing Hemeury]
[testing Hemeury]
[testing Hemeury]
[testing PhongTran2]
[testing PhongTran2]
[testing PhongTran2]
[testing RunningLengthWord]
[testing RunningLengthWord]
[testing RunningLengthWord]
[testing EWAHBoolArray]
[testing EWAHBoolArray]
[testing EWAHBoolArray]
[testing EWAHBoolArrayLogical]
[testing EWAHBoolArrayLogical]
[testing EWAHBoolArrayLogical]
[testing EWAHBoolArrayLogical2]
[testing EWAHBoolArrayLogical2]
[testing EWAHBoolArrayLogical2]
[testing EWAHBoolArrayAppend]
[testing EWAHBoolArrayAppend]
[testing EWAHBoolArrayAppend]
[testing JoergBukowski]
[testing JoergBukowski]
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator

failed to recover -- iterator

test failed.

[testing JoergBukowski]
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator
failed to recover -- iterator
failed to recover right number -- iterator

failed to recover -- iterator

test failed.

number of bytes in ostream::pos_type = 24
number of bytes in size_t = 8
number of bytes in int = 4
number of bytes in long = 4
Non-GCC compiler.
Got 4 failed tests!

Bitmap Sizing/Compression

We ran a test with 1000 and 10000 sequential long integers using 64-bit word size, and the array was 3.5MB in size for both. Should the RLE be able to compress this? Since the raw data is only 8K and 80K respectively I would have expected better compression.

In another test we used "roughly sequential" long integers (snowflake IDs), again with 1K and 10K values. This went to 188MB for both sets. Again the raw data is 8K and 80K in size.

Are there options to achieve better compression?

It may be that we should only use EWAHBoolArray for relatively sparse arrays, and when we have a lot of values like this we can try another structure, but we really like the capability of fast binary comparisons.

Any feedback appreciated.

sizeInBits() values are different for same bitmap.

I created a bitmap in two different ways but with the same bits set. However, the result bitmap behaves differently. Is this expected?

{    // Create a bitmap with (1,2,3,4) using logicalors.
    auto ewah1 = EWAHBoolArray<>::bitmapOf(1, 1);
    auto ewah2 = EWAHBoolArray<>::bitmapOf(1, 2);
    auto ewah3 = EWAHBoolArray<>::bitmapOf(1, 3);
    auto ewah4 = EWAHBoolArray<>::bitmapOf(1, 4);
    auto result = ewah1 | ewah2 | ewah3 | ewah4;
    std::cout << result.set(5) << std::endl; // false b/c sizeInBits() == 32.
}
{    // Create a bitmap with (1, 2, 3, 4) in a naive way.
    auto result = EWAHBoolArray<>::bitmapOf(4, 1, 2, 3, 4);
    std::cout << result.set(5) << std::endl; // true b/c sizeInBits() == 5.
}

Basic equlaity and inequality test is failing on 0.5.6 release

The below one is giving me wrong result.

template bool testInEqualityEWAHBoolArray() {
cout << "[testing testInEqualityEWAHBoolArray] sizeof(uword)=" << sizeof(uword) << endl;
EWAHBoolArray b1;
EWAHBoolArray b = EWAHBoolArray::bitmapOf(3, 1, 10, 11);

return !(b1 == b);
}
I am expecting this should return true, where as this returns false.

Segmentation fault on declaring an array of EWAHBoolArray<uint32_t>

Hi,
The following piece of code gives segmentation fault:
EWAHBoolArray<uint32_t> gpr[10][4];
gpr[0][0].set(1);
cout << gpr[0][0];

The following is the stack trace:
Program received signal SIGSEGV, Segmentation fault.
0x0804b29c in RunningLengthWord<unsigned int>::getNumberOfLiteralWords (this=0xbfffebb4) at /usr/include/runninglengthword.h:65
65 return static_cast<uword> (mydata >> (1 + runninglengthbits));
(gdb) bt
#0 0x0804b29c in RunningLengthWord<unsigned int>::getNumberOfLiteralWords (this=0xbfffebb4) at /usr/include/runninglengthword.h:65
#1 0x0804b1bd in EWAHBoolArray<unsigned int>::addLiteralWord (this=0x8055298, newdata=2) at /usr/include/ewah.h:1043
#2 0x08049c37 in EWAHBoolArray<unsigned int>::set (this=0x8055298, i=1) at /usr/include/ewah.h:816
#3 0x080490c0 in demo<BoolArray<unsigned int> > () at example.cpp:31
#4 0x08048ea1 in main () at example.cpp:94
Can you tell what i am doing wrong.
Thanks

Question about double indexing

Thank you for contributing this amazing library!

I am sorry for posting a question as an issue.

My question is, can you index doubles using a compressed bitmap?

In my understanding, I would need some kind of binning: If I have an attribute X ranging from [0-1] and 10 bins, then I would need 10 bitmaps, one for each range of values: i.e. 1 for [0-0.1], 1 for [0.1-0.2], etc. But then how do you do range queries on X?

Have you got test code for this case?

Space optimization for sparse bitmaps

Hi,
If EWAH bitmap has only a single bit set, it still consumes around a minimum of 20 bytes. In cases like these would it make sense to store the bit position as 4 byte integer and handle the bitmap creation and processing at runtime?
I am looking at options to reduce storage space as majority of our bitmaps and really sparse, at the expense of a minimal runtime overhead.
Are there other bitmap implementations that support this optimization.

intersects method returns incorrect value

Hi Daniel,
It seems from my tests (attached in the email I sent you) that the intersects method may return an incorrect value. I am just noting this here as an issue in case someone else runs into it.

Thank you for your help with this.
Mazen

numberOfOnes() returns an inaccurate number on a logicalnot() object

When using an uint64_t based bitmap, the numberOfOnes() works fine on a regular bitmap, but after inverting it by logicalnot(), numberOfOnes() returns a bigger number.

>> b = bitmap()
>> b.set(5)
>> b.numberOfOnes()
1
>>b.logicalnot()
>>b.numberOfOnes()
63

Seems like it counts the trailing bits(bits after the sizeinbits) as well. Did I use it improperly, or is it designed to behave in this way?

Thanks.

read()/write() with raw buffer

Hi,
Can you expose read()/write() with raw pointers in addition to istream/ostream?

I am using boost::iostreams::streamboost::iostreams::array_source and std::stringstream but the overhead is pretty big (2x - 10x) compared to reading/writing with raw pointers.

Thanks!

&= and |= operators

Hi
I want to &= and |= a bitset with some bitsets. In order to do that, I have to make the bitset equal to the logical and (logical or) result; for example:
original_bitset = original_bitset.logicaland(other_bitset);
Does this API give us another efficient way to get this result, without having the cost of assignment operator?

round up the size in bits to a multiple of the word size

In some cases, after some operations between bitmaps having a different size in bits, you may get bitmaps that are in an unsafe state for mutation.

The essential problem comes from a mismatch between the EWAH data and the 'size in bits' parameter. The Java library has workarounds for this problem.

A good fix would be to round up the size in bits to a multiple of the word size.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.