bodigrim / bitvec Goto Github PK

View Code? Open in Web Editor NEW

73.0 8.0 7.0 401 KB

Bit vectors: 8x less memory, up to 3500x faster than Vector Bool

Home Page: https://hackage.haskell.org/package/bitvec

License: BSD 3-Clause "New" or "Revised" License

Haskell 90.69% C 9.31%

bit-vectors vectors bitmask bitmap bit-array libgmp

bitvec's Introduction

bitvec

A newtype over Bool with a better Vector instance: 8x less memory, up to 3500x faster.

The vector package represents unboxed arrays of Bools spending 1 byte (8 bits) per boolean. This library provides a newtype wrapper Bit and a custom instance of an unboxed Vector, which packs bits densely, achieving an 8x smaller memory footprint. The performance stays mostly the same; the most significant degradation happens for random writes (up to 10% slower). On the other hand, for certain bulk bit operations Vector Bit is up to 3500x faster than Vector Bool.

Thread safety

Data.Bit is faster, but writes and flips are not thread-safe. This is because naive updates are not atomic: they read the whole word from memory, then modify a bit, then write the whole word back. Concurrently modifying non-intersecting slices of the same underlying array may also lead to unexpected results, since they can share a word in memory.
Data.Bit.ThreadSafe is slower (usually 10-20%), but writes and flips are thread-safe. Additionally, concurrently modifying non-intersecting slices of the same underlying array works as expected. However, operations that affect multiple elements are not guaranteed to be atomic.

Quick start

Consider the following (very naive) implementation of the sieve of Eratosthenes. It returns a vector with True at prime indices and False at composite indices.

import Control.Monad
import Control.Monad.ST
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as MU

eratosthenes :: U.Vector Bool
eratosthenes = runST $ do
  let len = 100
  sieve <- MU.replicate len True
  MU.write sieve 0 False
  MU.write sieve 1 False
  forM_ [2 .. floor (sqrt (fromIntegral len))] $ \p -> do
    isPrime <- MU.read sieve p
    when isPrime $
      forM_ [2 * p, 3 * p .. len - 1] $ \i ->
        MU.write sieve i False
  U.unsafeFreeze sieve

We can switch from Bool to Bit just by adding newtype constructors:

import Data.Bit

import Control.Monad
import Control.Monad.ST
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as MU

eratosthenes :: U.Vector Bit
eratosthenes = runST $ do
  let len = 100
  sieve <- MU.replicate len (Bit True)
  MU.write sieve 0 (Bit False)
  MU.write sieve 1 (Bit False)
  forM_ [2 .. floor (sqrt (fromIntegral len))] $ \p -> do
    Bit isPrime <- MU.read sieve p
    when isPrime $
      forM_ [2 * p, 3 * p .. len - 1] $ \i ->
        MU.write sieve i (Bit False)
  U.unsafeFreeze sieve

The Bit-based implementation requires 8x less memory to store the vector. For large sizes it allows to crunch more data in RAM without swapping. For smaller arrays it helps to fit into CPU caches.

> listBits eratosthenes
[2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]

There are several high-level helpers, digesting bits in bulk, which makes them up to 64x faster than the respective counterparts for Vector Bool. One can query the population count (popcount) of a vector (giving us the prime-counting function):

> countBits eratosthenes
25

And vice versa, query an address of the n-th set bit (which corresponds to the n-th prime number here):

> nthBitIndex (Bit True) 10 eratosthenes
Just 29

One may notice that the order of the inner traversal by i does not matter and get tempted to run it in several parallel threads. In this case it is vital to switch from Data.Bit to Data.Bit.ThreadSafe, because the former is not thread-safe with regards to writes. There is a moderate performance penalty (usually 10-20%) for using the thread-safe interface.

Sets

Bit vectors can be used as a blazingly fast representation of sets, as long as their elements are Enumeratable and sufficiently dense, leaving IntSet far behind.

For example, consider three possible representations of a set of Word16:

As an IntSet with a readily available union function.
As a 64k-long unboxed Vector Bool, implementing union as zipWith (||).
As a 64k-long unboxed Vector Bit, implementing union as zipBits (.|.).

When the simd flag is enabled, according to our benchmarks (see bench folder), the union of Vector Bit evaluates magnitudes faster than the union of not-too-sparse IntSets and stunningly outperforms Vector Bool. Here are benchmarks on MacBook M2:

union
  16384
    Vector Bit:
      61.2 ns ± 3.2 ns
    Vector Bool:
      96.1 μs ± 4.5 μs, 1570.84x
    IntSet:
      2.15 μs ± 211 ns, 35.06x
  32768
    Vector Bit:
      143  ns ± 7.4 ns
    Vector Bool:
      225  μs ±  16 μs, 1578.60x
    IntSet:
      4.34 μs ± 429 ns, 30.39x
  65536
    Vector Bit:
      249  ns ±  18 ns
    Vector Bool:
      483  μs ±  28 μs, 1936.42x
    IntSet:
      8.77 μs ± 835 ns, 35.18x
  131072
    Vector Bit:
      322  ns ±  30 ns
    Vector Bool:
      988  μs ±  53 μs, 3071.83x
    IntSet:
      17.6 μs ± 1.6 μs, 54.79x
  262144
    Vector Bit:
      563  ns ±  27 ns
    Vector Bool:
      2.00 ms ± 112 μs, 3555.36x
    IntSet:
      36.8 μs ± 3.3 μs, 65.40x

Binary polynomials

Binary polynomials are polynomials with coefficients modulo 2. Their applications include coding theory and cryptography. While one can successfully implement them with the poly package, operating on UPoly Bit, this package provides even faster arithmetic routines exposed via the F2Poly data type and its instances.

> :set -XBinaryLiterals
> -- (1 + x) * (1 + x + x^2) = 1 + x^3 (mod 2)
> 0b11 * 0b111 :: F2Poly
F2Poly {unF2Poly = [1,0,0,1]}

Use fromInteger / toInteger to convert binary polynomials from Integer to F2Poly and back.

Package flags

Flag simd, enabled by default.

Use a C SIMD implementation for the ultimate performance of zipBits, invertBits and countBits.

Similar packages

bv and bv-little do not offer mutable vectors.
array is memory-efficient for Bool, but lacks a handy Vector interface and is not thread-safe.

Additional resources

Bit vectors without compromises, Haskell Love, 31.07.2020: slides, video.

bitvec's People

Contributors

Stargazers

Watchers

Forkers

sdiehl minoki parsonsmatt konsumlamm vitalbio treeowl sysfce2

bitvec's Issues

More SIMD implementations

Haddocks for F2Poly

Export BitVec and BitMVec constructors

Check whether mutable functions with unaligned offsets are thread-safe

It boils down to writeWord, basicSet, basicUnsafeCopy, zipInPlace.

WASM build failure

This probably isn't surprising, given #64 (comment):

Hmm, wouldn't that break with the JS or WASM backend (or in the rare case that the default C compiler doesn't support OpenMP)?

I would not worry about them, they are too boutique. Until there is a CI job to witness the contrary, we can effectively assume that they are broken anyways.

$ wasm32-wasi-cabal build
Resolving dependencies...
Build profile: -w ghc-9.6.3.20230927 -O1
In order, the following will be built (use -v for more details):
 - bitvec-1.1.5.0 (lib) (first run)
./bitvec.cabal has been changed. Re-configuring with most recently used
options. If this fails, please run configure manually.
Configuring library for bitvec-1.1.5.0..
Preprocessing library for bitvec-1.1.5.0..
Building library for bitvec-1.1.5.0..
[ 9 of 12] Compiling Data.Bit.F2PolyTS ( src/Data/Bit/F2PolyTS.hs, /home/gthomas/code/bitvec/dist-newstyle/build/wasm32-wasi/ghc-9.6.3.20230927/bitvec-1.1.5.0/build/Data/Bit/F2PolyTS.o )

<no location info>: error:
    panic! (the 'impossible' happened)
  GHC version 9.6.3.20230927:
        destination label not in evaluation context
  CallStack (from HasCallStack):
    panic, called at compiler/GHC/Wasm/ControlFlow/FromCmm.hs:303:17 in ghc:GHC.Wasm.ControlFlow.FromCmm


Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

[11 of 12] Compiling Data.Bit.F2Poly  ( src/Data/Bit/F2Poly.hs, /home/gthomas/code/bitvec/dist-newstyle/build/wasm32-wasi/ghc-9.6.3.20230927/bitvec-1.1.5.0/build/Data/Bit/F2Poly.o )

<no location info>: error:
    panic! (the 'impossible' happened)
  GHC version 9.6.3.20230927:
        destination label not in evaluation context
  CallStack (from HasCallStack):
    panic, called at compiler/GHC/Wasm/ControlFlow/FromCmm.hs:303:17 in ghc:GHC.Wasm.ControlFlow.FromCmm


Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

Error: cabal: Failed to build bitvec-1.1.5.0.

This may well be an actual GHC bug, but I thought I'd open it here first in case you have any ideas. Seeing as this is the only library I'm now failing to build in a pretty large dependency tree, and it has some funky C stuff going on.

Test on x86 platform

Add Fractional instance

Support thread safe unsafeWrite

Current unsafeWrite (consisting internally of read and write) is not thread safe, as demonstrated by the following program:

module Main where

import Control.Concurrent
import Control.Monad
import Control.Monad.Primitive
import Data.Bit
import Data.Foldable
import qualified Data.Vector.Unboxed as V
import qualified Data.Vector.Unboxed.Mutable as MV

len :: Int
len = 1000

foo :: MV.MVector (PrimState IO) Bit -> [Int] -> IO ()
foo vec = traverse_ $ \x -> MV.write vec x (Bit True)

main :: IO ()
main = forever $ do
  m <- newEmptyMVar
  vec <- MV.new len
  forkIO $ do
    foo vec [0,2..len-1]
    putMVar m ()
  foo vec [1,3..len-1]
  takeMVar m
  wec <- V.unsafeFreeze vec
  if V.all (== Bit True) wec
    then putStr "."
    else putStr "!"

One can write a thread-safe implementation using fetchAndIntArray and fetchOrIntArray.

It remains to be benchmarked whether it makes sense to make thread-safe implementation default one or provide both safe and unsafe newtypes.

Arithmetic of F(2) polynomials

While after #19 one would be able to use Data.Poly.UPoly Bit with all arithmetic available, there are a couple of operations, which can be implemented even more efficiently:

Addition, subtraction, negation.
Squaring (and consequently power) and multiplication.

These together give us a sensible Num instance for something like newtype F2Poly = F2Poly (U.Vector Bit).

`setBit` and `clearBit` do not work as expected

ghci> [0,0,0,1] `setBit` 1 :: Vector Bit
[0,1]

I think this happens because the zipWith function works over the minimum of the lengths and setBit is default implemented as bit i .|. x. I would expect the result to have the length of x here.

Document that `zipBits`, `selectBits`, `excludeBits` etc. cut off the longer bit vector

This may be obvious when comparing to zipWith, but it's less obvious for e.g. .|. on Vector Bit (one may expect this to keep the overhang, since it's logically ORed with 0).

Hashable instance for Bit?

This is something I would very much like, as this would enable bitvectors to be hashed without having to munge them into lists first, or something similarly silly.

Build failure with `vector` `HEAD`, missing CPP functions

In my effort to get the ecosystem ready for GHC 9.4, I'm attempting to get persistent and dependencies ready.

vector has some changes in HEAD that are unreleased, and these changes introduce some build failures. It would appear that vector.h changed, and so some things are different.

Build log:

Building library for bitvec-1.1.2.0..
[ 1 of 13] Compiling Data.Bit.Gmp     ( src/Data/Bit/Gmp.hs, dist/build/Data/Bit/Gmp.o, dist/build/Data/Bit/Gmp.dyn_o )
[ 2 of 13] Compiling Data.Bit.PdepPext ( src/Data/Bit/PdepPext.hs, dist/build/Data/Bit/PdepPext.o, dist/build/Data/Bit/PdepPext.dyn_o )
[ 3 of 13] Compiling Data.Bit.Utils   ( src/Data/Bit/Utils.hs, dist/build/Data/Bit/Utils.o, dist/build/Data/Bit/Utils.dyn_o )
[ 4 of 13] Compiling Data.Bit.InternalTS ( src/Data/Bit/InternalTS.hs, dist/build/Data/Bit/InternalTS.o, dist/build/Data/Bit/InternalTS.dyn_o )

src/Data/Bit/Internal.hs:479:3: error:
    Data constructor not in scope:
      BOUNDS_CHECK :: t0 -> String -> Int -> Int -> m0 () -> m ()
    |
479 |   BOUNDS_CHECK(checkIndex) "flipBit" i (MV.length v) $ unsafeFlipBit v i
    |   ^^^^^^^^^^^^

src/Data/Bit/Internal.hs:479:16: error:
    Variable not in scope: checkIndex
    |
479 |   BOUNDS_CHECK(checkIndex) "flipBit" i (MV.length v) $ unsafeFlipBit v i
    |                ^^^^^^^^^^
cabal: Failed to build bitvec-1.1.2.0 (which is required by test:test from

Implement rank/select queries

https://en.wikipedia.org/wiki/Succinct_data_structure#Succinct_dictionaries
https://hackage.haskell.org/package/hw-rankselect-base-0.3.2.1

Check whether the previous definition of signum was incorrect

Use Data.Bits.Pdep for selectBits/excludeBits

We should use Data.Bits.Pdep.pdep from bits-extra package to implement selectBitsInPlace / excludeBitsInPlace in the most efficient way.

Or pdep# / pext# from GHC.Exts.

Unpack vectors

Currently,

data instance U.MVector s Bit = BitMVec !Int !Int !(U.MVector s Word)
data instance U.Vector    Bit = BitVec  !Int !Int !(U.Vector    Word)

A vector or mutable vector is two unboxed Ints and a pointer to a U.(M)Vector, which itself is two unboxed Ints and a pointer to a raw byte array. I'd expect you'd be better off with

data instance U.MVector s Bit = BitMVec !Int !Int {-# UNPACK #-} !(U.MVector s Word)
data instance U.Vector    Bit = BitVec  !Int !Int {-# UNPACK #-} !(U.Vector    Word)

which gets rid of one indirection. You'd have to check that it doesn't interfere with stream fusion somehow in the immutable case.

Fix cloneToWords on big-endian architectures

    cloneFromWords8:   FAIL (0.02s)
      *** Failed! Falsifiable (after 6 tests and 1 shrink):
      [0]
      [1,0,0,0,0,0,0,0] /= [0,0,0,0,0,0,0,0]
      Use --quickcheck-replay=248223 to reproduce.
    cloneToWords8
      simple:          FAIL (0.02s)
        *** Failed! Falsifiable (after 5 tests and 3 shrinks):
        [1]
        [0] /= [1]
        Use --quickcheck-replay=551038 to reproduce.
      simple_long:     FAIL (0.01s)
        *** Failed! Falsifiable (after 2 tests and 63 shrinks):
        Large {getLarge = [1]}
        [0] /= [1]
        Use --quickcheck-replay=925274 to reproduce.
      middle:          FAIL
        *** Failed! Falsifiable (after 3 tests and 1 shrink):
        NonNegative {getNonNegative = 0}
        NonNegative {getNonNegative = 1}
        NonNegative {getNonNegative = 0}
        [0] /= [1]
        Use --quickcheck-replay=164311 to reproduce.
      middle_long:     FAIL
        *** Failed! Falsifiable (after 2 tests):
        NonNegative {getNonNegative = 0}
        NonNegative {getNonNegative = 1}
        NonNegative {getNonNegative = 0}
        [0,0,0,2,95] /= [145,210,69,95,2]
        Use --quickcheck-replay=935562 to reproduce.
    cloneToByteString: FAIL (0.06s)
      *** Failed! Falsifiable (after 31 tests and 6 shrinks):
      [0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,1,0,0,1,1]
      "\DEL\255\255" /= "\NUL\NUL\NUL"
      Use --quickcheck-replay=152022 to reproduce.

`vector` lower bound is incorrect

~/mess/bitvec-0.2.0.1 % cabal new-build -w ghc-7.10.3 --constraint='vector==0.10.*'
Resolving dependencies...
Build profile: -w ghc-7.10.3 -O1
In order, the following will be built (use -v for more details):
 - primitive-0.6.1.2 (lib) (requires build)
 - vector-0.10.12.3 (lib) (requires download & build)
 - bitvec-0.2.0.1 (lib) (first run)
Downloading  vector-0.10.12.3
Starting     primitive-0.6.1.2 (lib)
Building     primitive-0.6.1.2 (lib)
Downloaded   vector-0.10.12.3
Installing   primitive-0.6.1.2 (lib)
Completed    primitive-0.6.1.2 (lib)
Starting     vector-0.10.12.3 (lib)
Building     vector-0.10.12.3 (lib)
Installing   vector-0.10.12.3 (lib)
Completed    vector-0.10.12.3 (lib)
Configuring library for bitvec-0.2.0.1..
Preprocessing library for bitvec-0.2.0.1..
Building library for bitvec-0.2.0.1..
[1 of 5] Compiling Data.Bit.Internal ( src/Data/Bit/Internal.hs, /home/ogre/mess/bitvec-0.2.0.1/dist-newstyle/build/x86_64-linux/ghc-7.10.3/bitvec-0.2.0.1/build/Data/Bit/Internal.o )
[2 of 5] Compiling Data.Vector.Unboxed.Bit.Internal ( src/Data/Vector/Unboxed/Bit/Internal.hs, /home/ogre/mess/bitvec-0.2.0.1/dist-newstyle/build/x86_64-linux/ghc-7.10.3/bitvec-0.2.0.1/build/Data/Vector/Unboxed/Bit/Internal.o )

src/Data/Vector/Unboxed/Bit/Internal.hs:120:16:
    ‘basicInitialize’ is not a (visible) method of class ‘MV.MVector’

(vector-0.11) seems to work.
I'd recommend also adding upper bounds, but at least checking lower bounds in CI,
See e.g. https://github.com/phadej/qc-instances/blob/master/cabal.haskell-ci configuration for https://github.com/haskell-ci/haskell-ci

Do not use basic* functions

Restore Num/Real/Integral instances of Bit

And change examples, removing read and parenthesis.

About Thread Safety

What exactly is the goal of the thread-safe operations? I get it for basicFlipBit and basicUnsafeWrite (they use a single atomic operation to modify the array), but not modifyByteArray (it uses two atomic operations, so it can still happen that the array is modified inbetween). Is it undefined behaviour to have multiple concurrent array writes to the same index?

And why are there Data.Bit.ImmutableTS and Data.Bit.F2PolyTS modules? These only use local mutability, so it shouldn't make a difference wether they use atomics or not (except that the "thread safe" version is slower).

Implement mapBits

Add mutable flip and unsafeFlip

Investigate benchmarks of reverseInPlace

For some reason Vector Bool is much faster than Vector Bit.

Enable `-mbmi2`

While working on #66, I discovered that pext#/pdep# only use hardware instructions when enabling the -mbmi2 flag. This makes selectBits up to 20x faster! This is an x86-specific flag though and it's not available everywhere, so it shouldn't be enabled by default.

Issue building bitvec on Windows

When compiling hw-dsv on Windows, there was an issue trying to link bitvec, which is one of the dependencies:

C:\Users\newho\wrk\hw-dsv>cabal build --ghc-options=-v
...
"C:\ProgramData\chocolatey\lib\ghc.8.6.5\tools\ghc-8.6.5\lib\../mingw/bin/gcc.exe" "-fno-stack-protector" "-DTABLES_NEXT_TO_CODE" "--print-search-dirs"
Loading package ghc-prim-0.5.3 ... linking ... done.
Loading package integer-gmp-1.0.2.0 ... linking ... done.
Loading package base-4.12.0.0 ... linking ... done.
Loading package array-0.5.3.0 ... linking ... done.
Loading package deepseq-1.4.4.0 ... linking ... done.
Loading package transformers-0.5.6.2 ... linking ... done.
Loading package primitive-0.7.1.0 ... linking ... done.
Loading package vector-0.12.1.2 ... linking ... done.
Loading package bits-extra-0.0.2.0 ... linking ... done.
Loading package bytestring-0.10.8.2 ... linking ... done.
Loading package bitvec-1.0.3.0 ... ghc.exe: Could not load `libgmp-10.dll'. Reason: addDLL: libgmp-10.dll or dependencies not loaded. (Win32 error 126)

*** Deleting temp files:
Deleting: C:\Users\newho\AppData\Local\Temp\ghc1656_0\ghc_1.hscpp
*** Deleting temp dirs:
Deleting: C:\Users\newho\AppData\Local\Temp\ghc1656_0
ghc.exe: panic! (the 'impossible' happened)
  (GHC version 8.6.5 for x86_64-unknown-mingw32):
        loadArchive "C:\\ProgramData\\chocolatey\\lib\\ghc.8.6.5\\tools\\ghc-8.6.5\\mingw\\lib\\libgmp.dll.a": failed

Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

cabal.exe: Failed to build hw-dsv-0.4.1.0 (which is required by exe:hw-dsv
from hw-dsv-0.4.1.0).

How lazy is listBits?

I know I could read the source code, but the implementation appears quite involved. My specific question relates to code like

take 3 . listBits $ vec

where vec has 100+ set bits. Will this immediately generate a list of 100 bits, or will it only generate the 3 needed?

Convert to Data.Vector.Storable Word8 and back

I've been asked a couple of times about interacting with ByteString. I really do not want to add bytestring as dependency, so hopefully casting to storable vectors would be a reasonable half-measure.

The zipBits type is unsafe

The argument function has type forall a. Bits a => a -> a -> a. This is insufficient to avoid bad behavior. Nothing stops the function from performing shifts, rotations, popCount, bitSizeMaybe, setBit, etc. In principle, we'd be safe if we had a superclass of Bits that just supported the operations we want, but that's not the case. Options:

Make a newtype around Word whose Bits instance errors out for unsupported operations, and make the argument function deal with that type. This seems like the easiest thing. It should be moderately easy to document the restrictions. I believe the supported operations are .|., .&., xor, complement and zeroBits. I suppose isSigned could be supported too, but it would be quite useless.
Use a newtype around Word without a Bits instance, and make the user use custom operators.
Just give each operation a different name, and forget about the general zipBits.

SIMD implementation for `zipBits`/`invertBits`

bitvec has the option to use libgmp for faster implementations of zipBits and invertBits.

I noticed that you can get even better performance with a C implementation that uses SIMD (this is really easy through OpenMP, with #pragma omp simd), for example:

void omp_and(uint8_t *dest, const uint8_t *src1, const uint8_t *src2, size_t len) {
    #pragma omp simd
    for (size_t i = 0; i < len; i++) {
        dest[i] = src1[i] & src2[i];
    }
}

OpenMP is supported by all standard C compilers. The SIMD versions scale better, being about twice as fast as the libgmp versions for higher amounts of bits in the benchmark (looking at its source code, libgmp doesn't seem to use SIMD optimized implementations for this, so this makes sense).

Benchmark results

I benched the SIMD version in the invert, intersection and union benchmarks (as those use the optimized implementations), with the baseline being the libgmp version.

All
  invert
    32
      Bit:    OK (0.25s)
        27.6 ns ± 2.0 ns,       same as baseline
      BitTS:  OK (0.23s)
        25.2 ns ± 1.5 ns, 0.91x, 21% less than baseline
      Vector: OK (0.29s)
        65.1 ns ± 5.0 ns, 2.36x,       same as baseline
    64
      Bit:    OK (0.24s)
        27.3 ns ± 1.6 ns,       same as baseline
      BitTS:  OK (0.23s)
        25.2 ns ± 1.4 ns, 0.92x, 19% less than baseline
      Vector: OK (0.25s)
        109  ns ± 6.3 ns, 4.01x,       same as baseline
    128
      Bit:    OK (0.26s)
        28.2 ns ± 1.4 ns, 10% less than baseline
      BitTS:  OK (0.24s)
        25.9 ns ± 1.9 ns, 0.92x,       same as baseline
      Vector: OK (0.23s)
        200  ns ±  11 ns, 7.09x,       same as baseline
    256
      Bit:    OK (0.26s)
        29.7 ns ± 1.9 ns,       same as baseline
      BitTS:  OK (0.25s)
        27.5 ns ± 1.7 ns, 0.93x,       same as baseline
      Vector: OK (0.22s)
        380  ns ±  22 ns, 12.77x,       same as baseline
    512
      Bit:    OK (0.26s)
        28.6 ns ± 1.3 ns, 13% less than baseline
      BitTS:  OK (0.24s)
        26.4 ns ± 1.6 ns, 0.92x, 17% less than baseline
      Vector: OK (0.21s)
        750  ns ±  43 ns, 26.20x,       same as baseline
    1024
      Bit:    OK (0.14s)
        28.8 ns ± 2.6 ns, 22% less than baseline
      BitTS:  OK (0.25s)
        26.8 ns ± 1.6 ns, 0.93x, 23% less than baseline
      Vector: OK (0.21s)
        1.48 μs ±  93 ns, 51.49x,       same as baseline
    2048
      Bit:    OK (0.29s)
        31.7 ns ± 1.5 ns, 31% less than baseline
      BitTS:  OK (0.14s)
        30.9 ns ± 2.8 ns, 0.98x, 23% less than baseline
      Vector: OK (0.40s)
        2.96 μs ± 126 ns, 93.48x,       same as baseline
    4096
      Bit:    OK (0.19s)
        38.7 ns ± 2.8 ns, 36% less than baseline
      BitTS:  OK (0.18s)
        38.5 ns ± 3.3 ns, 0.99x, 36% less than baseline
      Vector: OK (0.87s)
        12.6 μs ± 389 ns, 325.57x,       same as baseline
    8192
      Bit:    OK (0.44s)
        48.9 ns ± 4.0 ns, 49% less than baseline
      BitTS:  OK (0.43s)
        47.3 ns ± 1.6 ns, 0.97x, 51% less than baseline
      Vector: OK (0.41s)
        46.7 μs ± 3.3 μs, 955.37x,       same as baseline
    16384
      Bit:    OK (0.18s)
        71.4 ns ± 6.0 ns, 61% less than baseline
      BitTS:  OK (0.20s)
        79.0 ns ± 6.6 ns, 1.11x, 57% less than baseline
      Vector: OK (0.25s)
        114  μs ± 7.6 μs, 1595.65x,       same as baseline
  intersection
    32
      Bit:    OK (0.22s)
        92.3 ns ± 5.2 ns,       same as baseline
      BitTS:  OK (0.23s)
        105  ns ± 5.5 ns, 1.14x, 15% more than baseline
      Vector: OK (0.15s)
        251  ns ±  23 ns, 2.72x,       same as baseline
      IntSet: OK (0.16s)
        16.5 ns ± 1.6 ns, 0.18x,       same as baseline
    64
      Bit:    OK (0.22s)
        92.4 ns ± 5.8 ns,       same as baseline
      BitTS:  OK (0.23s)
        100  ns ± 5.4 ns, 1.08x, 10% more than baseline
      Vector: OK (0.26s)
        453  ns ±  38 ns, 4.91x,       same as baseline
      IntSet: OK (0.16s)
        16.6 ns ± 1.5 ns, 0.18x,       same as baseline
    128
      Bit:    OK (0.22s)
        95.9 ns ± 9.6 ns,       same as baseline
      BitTS:  OK (0.22s)
        92.9 ns ± 7.5 ns, 0.97x,       same as baseline
      Vector: OK (0.25s)
        871  ns ±  58 ns, 9.08x,       same as baseline
      IntSet: OK (0.65s)
        37.5 ns ± 3.1 ns, 0.39x,       same as baseline
    256
      Bit:    OK (0.22s)
        95.8 ns ± 7.1 ns,       same as baseline
      BitTS:  OK (0.22s)
        94.5 ns ± 5.1 ns, 0.99x,       same as baseline
      Vector: OK (0.24s)
        1.70 μs ±  97 ns, 17.72x,       same as baseline
      IntSet: OK (0.18s)
        77.2 ns ± 5.4 ns, 0.81x,       same as baseline
    512
      Bit:    OK (0.22s)
        91.9 ns ± 6.1 ns,       same as baseline
      BitTS:  OK (0.21s)
        91.6 ns ± 5.5 ns, 1.00x,       same as baseline
      Vector: OK (0.24s)
        3.41 μs ± 176 ns, 37.10x,       same as baseline
      IntSet: OK (0.21s)
        176  ns ±  12 ns, 1.92x,       same as baseline
    1024
      Bit:    OK (0.22s)
        92.1 ns ± 6.1 ns,       same as baseline
      BitTS:  OK (0.21s)
        91.7 ns ± 5.5 ns, 1.00x,  9% less than baseline
      Vector: OK (0.27s)
        7.67 μs ± 351 ns, 83.24x,       same as baseline
      IntSet: OK (0.77s)
        351  ns ± 8.9 ns, 3.81x,       same as baseline
    2048
      Bit:    OK (0.42s)
        92.1 ns ± 3.4 ns, 16% less than baseline
      BitTS:  OK (0.23s)
        94.9 ns ± 7.4 ns, 1.03x, 16% less than baseline
      Vector: OK (0.20s)
        22.8 μs ± 1.3 μs, 247.50x,       same as baseline
      IntSet: OK (0.23s)
        773  ns ±  45 ns, 8.39x,       same as baseline
    4096
      Bit:    OK (0.24s)
        98.7 ns ± 6.4 ns, 26% less than baseline
      BitTS:  OK (0.24s)
        98.4 ns ± 5.7 ns, 1.00x, 28% less than baseline
      Vector: OK (0.14s)
        60.6 μs ± 6.0 μs, 613.72x,       same as baseline
      IntSet: OK (0.23s)
        1.57 μs ±  90 ns, 15.93x,       same as baseline
    8192
      Bit:    OK (0.27s)
        112  ns ± 5.4 ns, 42% less than baseline
      BitTS:  OK (0.27s)
        112  ns ± 5.6 ns, 1.01x, 42% less than baseline
      Vector: OK (0.15s)
        126  μs ±  12 μs, 1130.02x,       same as baseline
      IntSet: OK (0.25s)
        3.26 μs ± 188 ns, 29.19x,       same as baseline
    16384
      Bit:    OK (0.18s)
        126  ns ±  12 ns, 58% less than baseline
      BitTS:  OK (0.18s)
        128  ns ±  11 ns, 1.02x, 57% less than baseline
      Vector: OK (0.29s)
        259  μs ±  13 μs, 2064.13x,       same as baseline
      IntSet: OK (1.77s)
        6.55 μs ± 118 ns, 52.10x,       same as baseline
  union
    32
      Bit:    OK (0.23s)
        89.6 ns ± 8.3 ns,       same as baseline
      BitTS:  OK (0.23s)
        90.2 ns ± 6.0 ns, 1.01x,       same as baseline
      Vector: OK (0.27s)
        226  ns ±  10 ns, 2.53x,       same as baseline
      IntSet: OK (0.18s)
        15.9 ns ± 1.5 ns, 0.18x,       same as baseline
    64
      Bit:    OK (0.22s)
        89.4 ns ± 5.2 ns,       same as baseline
      BitTS:  OK (0.23s)
        89.8 ns ± 5.4 ns, 1.01x,       same as baseline
      Vector: OK (0.26s)
        429  ns ±  22 ns, 4.81x,       same as baseline
      IntSet: OK (0.18s)
        16.0 ns ± 1.4 ns, 0.18x,       same as baseline
    128
      Bit:    OK (0.23s)
        89.9 ns ± 5.8 ns,       same as baseline
      BitTS:  OK (0.23s)
        90.4 ns ± 5.7 ns, 1.01x,       same as baseline
      Vector: OK (0.25s)
        836  ns ±  43 ns, 9.29x,       same as baseline
      IntSet: OK (0.18s)
        35.9 ns ± 2.7 ns, 0.40x,       same as baseline
    256
      Bit:    OK (0.23s)
        92.6 ns ± 6.3 ns,       same as baseline
      BitTS:  OK (0.23s)
        91.8 ns ± 6.0 ns, 0.99x,       same as baseline
      Vector: OK (0.24s)
        1.63 μs ±  84 ns, 17.64x,       same as baseline
      IntSet: OK (0.20s)
        74.9 ns ± 5.4 ns, 0.81x,       same as baseline
    512
      Bit:    OK (0.23s)
        90.4 ns ± 6.1 ns,       same as baseline
      BitTS:  OK (0.23s)
        90.6 ns ± 5.5 ns, 1.00x,       same as baseline
      Vector: OK (0.25s)
        3.27 μs ± 267 ns, 36.17x,       same as baseline
      IntSet: OK (0.40s)
        172  ns ± 7.2 ns, 1.91x,       same as baseline
    1024
      Bit:    OK (0.23s)
        89.3 ns ± 5.5 ns,       same as baseline
      BitTS:  OK (0.23s)
        90.9 ns ± 5.3 ns, 1.02x,  8% less than baseline
      Vector: OK (0.25s)
        6.79 μs ± 376 ns, 76.02x,       same as baseline
      IntSet: OK (0.41s)
        355  ns ±  26 ns, 3.98x,       same as baseline
    2048
      Bit:    OK (0.23s)
        92.2 ns ± 5.5 ns, 13% less than baseline
      BitTS:  OK (0.22s)
        93.3 ns ± 5.4 ns, 1.01x, 14% less than baseline
      Vector: OK (0.20s)
        20.9 μs ± 1.4 μs, 226.96x,       same as baseline
      IntSet: OK (0.24s)
        771  ns ±  46 ns, 8.37x,       same as baseline
    4096
      Bit:    OK (0.25s)
        96.6 ns ± 7.0 ns, 28% less than baseline
      BitTS:  OK (0.25s)
        97.1 ns ± 6.0 ns, 1.01x, 28% less than baseline
      Vector: OK (0.26s)
        56.3 μs ± 3.8 μs, 582.50x,       same as baseline
      IntSet: OK (0.25s)
        1.58 μs ± 108 ns, 16.36x,       same as baseline
    8192
      Bit:    OK (0.16s)
        105  ns ±  10 ns, 45% less than baseline
      BitTS:  OK (0.26s)
        104  ns ± 5.5 ns, 0.99x, 46% less than baseline
      Vector: OK (0.28s)
        122  μs ±  10 μs, 1160.93x,       same as baseline
      IntSet: OK (0.25s)
        3.23 μs ± 190 ns, 30.66x,       same as baseline
    16384
      Bit:    OK (0.18s)
        114  ns ±  10 ns, 61% less than baseline
      BitTS:  OK (0.18s)
        114  ns ±  11 ns, 1.00x, 62% less than baseline
      Vector: OK (0.16s)
        260  μs ±  24 μs, 2285.21x,       same as baseline
      IntSet: OK (0.26s)
        6.55 μs ± 417 ns, 57.52x,       same as baseline

My implementation can be found at https://github.com/konsumlamm/bitvec/tree/simd. It could also be expanded to be applicable in cases where the offset is non-zero. A similar implementation should also work for zipInPlace/invertInPlace and perhaps also others.

Would you be interested in a PR for this?

Add F2Poly extended Euclidean algorithm

An analogue of gcdExt for F2Poly similar to that of Poly. Would also be convenient to have relevant Semiring instances to use gcd, etc.

Implement immutable reverseBits

Move listBits/countBits to Data.Bit.Internal

And implement them more efficiently.

Convert to bytestring and back

Earlier (cf. #34) I was against it, but I changed my mind.

More reliable zipBits approach

zipBits is used for defining .|., .&., and xor. But it's not at all obvious to me that zipBits will inline to eliminate all the unused paths. I think there are a few ways that might work to improve that; I'm working on them.

Describe what exactly is thread-safe

Benchmark Data.Bit vs Data.Bit.ThreadSafe and update documentation

Currently docs claim that Data.Bit.ThreadSafe is twice slower.

New release

Any chance to get new release that includes #61? Otherwise package causes troubles when used with ghc with native bignums, like not being able to load it to ghci.

Benchmark random writes vs Data.Vector.Unboxed Bool and update documentation

Should zipBits be strict?

Currently, if the first vector argument to zipBits is empty, then the second isn't inspected. This prevents the second argument from being unboxed. Is optimizing that special case worth it? I think it's pretty weird.

Provide a flag to disable integer-gmp dependency

Use bitReverse# when ghc >= 8.10

Use bitReverse# for Data.Bit.Utils.reverseWord, once ghc-8.10 is out: https://github.com/ghc/ghc/blob/ghc-8.10/docs/users_guide/8.10.1-notes.rst#ghc-prim-library

Also mention vector-rotcev package in haddocks for reverseBits / reverseInPlace.

Bit Vectors and most vs least significant bit

I really like the current unboxed implementation of bit vectors and how it plays nicely with other libraries based on vector, but I think there are some additions that'd be nice to have.

I've been using bit vectors for dealing with incremental construction of network packets (together with vector-sized), which works really well for constructing things, but converting to/from these bit vectors to ByteString is rather annoying. As most networking protocols are specified with bit 0 = most significant bit, as opposed to bit 0 = least significant bit.

To simplify my life I'm currently writing some newtype wrappers that use different indexing logic to extract bits so I can more conveniently go from ByteString/Storable.Vector Word8 to a bit vector with the data I expect. I think it'd be nice to add these newtypes and conversion code to bitvec when I'm done.

So consider this issue a poll whether there's interest to merge this functionality and a bikeshedding opportunity on naming of any of this code and how to integrate it with the existing code.

I was thinking of maybe defining another newtype for Bit that has a type parameter that indicates the bit addressing (so whether 0 is least/most significant), additionally I think it'd be nice to have some more functions for creating bit vectors from other vectors, rather than only from Vector Word.

Convert to and from sparse vectors

Tentatively along these lines:

to    :: Vector Bit -> Vector Int 
from  :: Vector Int -> Vector Bit 
from' :: [Int] -> Vector Bit