Coder Social home page Coder Social logo

dnbaker / sketch Goto Github PK

View Code? Open in Web Editor NEW
151.0 8.0 14.0 4.54 MB

C++ Implementations of sketch data structures with SIMD Parallelism, including Python bindings

License: MIT License

C++ 97.87% C 1.82% Makefile 0.01% Python 0.03% Cuda 0.07% Shell 0.01% CMake 0.21%
hll sketch-data-structures hyperloglog bloom-filter count-min-sketch minhash

sketch's People

Contributors

benlangmead avatar dnbaker avatar jermp avatar mattheww95 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sketch's Issues

test script and several tests fail

Make 4.4.1 puts all the names of tests in a single line, which makes the test script fail.
When this problem is fixed (see revised script below), I get 6 failed tests (bftest, cmtest, divtest, testcontain, testmhmerge, vactest) for various reasons.

#!/bin/bash make setup_tests -j8 1>/dev/null tests=$(cat tmpfiles.txt | tr " " "\n") n_failed=0 n_successes=0 rm -f failed.txt echo "running tests" start_time="$(date -u +%s)" for test_exe in $tests; do echo -n " ${test_exe}..." $(./$test_exe 2>/dev/null) if [ $? -eq 0 ]; then echo "OK" ((n_successes=n_successes+1)) else echo -e "\e[31m\e[1mfailed\e[0m" echo "${test_exe}" >> failed.txt ((n_failed=n_failed+1)) fi done end_time="$(date -u +%s)" elapsed="$(($end_time-$start_time))" echo "Results: ${n_successes} successes, ${n_failed} failures in ${elapsed} s" if [ $n_successes -gt 0 ]; then echo "Failed tests:" cat failed.txt fi rm -f tmpfiles.txt failed.txt

won't compile under gcc > 11 or clang 16

compiles under gcc 11, but fails under gcc 12 & 13 with the following (similar under clang-16):

x86_64-pc-linux-gnu-g++ -std=c++17 -O3 -funroll-loops -pipe -march=native -Iinclude/sketch -I. -Iinclude/blaze -Ivec -Ipybind11/include -Iinclude -fpic -Wall -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable -Wno-cast-align -fno-strict-aliasing -Wreorder -Wno-unused-parameter -pthread kthread.o testsrc/lpcqf_approx.cpp -o lpcqf_approx -lz # -fsanitize=undefined -fsanitize=address
In file included from testsrc/lpcqf_approx.cpp:1:
include/sketch/lpcqf.h: In instantiation of ‘constexpr const std::array<long double, 64> sketch::POWERS<25, 26, 64>’:
include/sketch/lpcqf.h:160:56: required from ‘static long double sketch::LPCQF<BaseT, sigbits, flags, argnum, argdenom, ModT, NCACHED_POWERS>::ainc_estimate_count(long long int) [with BaseT = short unsigned int; long unsigned int sigbits = 2; int flags = 11; long unsigned int argnum = 52; long unsigned int argdenom = 50; ModT = unsigned int; long long int NCACHED_POWERS = 64]’
include/sketch/lpcqf.h:360:47: required from ‘std::conditional_t<sketch::LPCQF<BaseT, sigbits, flags, argnum, argdenom, ModT, NCACHED_POWERS>::approx_inc, long double, BaseT> sketch::LPCQF<BaseT, sigbits, flags, argnum, argdenom, ModT, NCACHED_POWERS>::count_estimate(uint64_t) const [with BaseT = short unsigned int; long unsigned int sigbits = 2; int flags = 11; long unsigned int argnum = 52; long unsigned int argdenom = 50; ModT = unsigned int; long long int NCACHED_POWERS = 64; std::conditional_t<approx_inc, long double, BaseT> = long double; uint64_t = long unsigned int]’
testsrc/lpcqf_approx.cpp:16:121: required from here
include/sketch/lpcqf.h:113:73: error: invalid use of incomplete type ‘struct std::array<long double, 64>’
113 | constexpr std::array<long double, N> POWERS = get_ipowers<num, denom, N>();
| ~~~~~~~~~~~~~~~~~~~~~~~~~~^~
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/bits/unique_ptr.h:36,
from /usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/memory:75,
from include/sketch/lpcqf.h:11:
/usr/lib/gcc/x86_64-pc-linux-gnu/12/include/g++-v12/tuple:1595:45: note: declaration of ‘struct std::array<long double, 64>’
1595 | template<typename _Tp, size_t _Nm> struct array;
| ^~~~~
include/sketch/lpcqf.h:113:38: error: ‘constexpr const std::array<long double, 64> sketch::POWERS<25, 26, 64>’ has incomplete type
113 | constexpr std::array<long double, N> POWERS = get_ipowers<num, denom, N>();
| ^~~~~~

Error Compilation

Hello ,

I get the following error while compiling the codes:

rc/mhtest.cpp: In function ‘int main(int, char**)’:
src/mhtest.cpp:101:9: error: ‘aes’ has not been declared
aes::AesCtr<uint64_t> gen1(13), gen2(1337);

I tried including #include "aesctr.h" in the code, but it produces a large number of errors.

Can someone help?

Trouble including the library

I have trouble understanding what is the intended way to include the library. The README says
#include <sketch/sketch.h> but this file is in include/sketch/sketch.h. If I attempt to do #include "sketch/include/sketch/sketch.h", I get fatal error: aesctr/wy.h: No such file or directory (I did clone with submodules). Should I manually move all the directories like aesctr, vec next to sketch.h?

Are all submodules required? Blaze seems huge... I want to try out the count-min sketch, do I need the whole library for that?

Thanks!

bad versioning code in setup.py

setup.py contains the following line:

__version__ = subprocess.check_output( ["git", "describe", "--abbrev=4"]).decode().strip().split('-')[0]

which fails if the package is installed from tarball and not from git.

Fixing this is one of the messes of python packaging. One popular way is to use the package setuptools-scm, but that has some pitfalls too. With my own code, I find the simplest is the best. Put a version string into its own .py file, where it can be imported into setup. I haven't seen where your version info is stored, but have your Makefile stuff it into the version.py file as part of release.

SetSketch implementations

Hi! Thank you for this wonderful library. I am working on estimation of the overlap between different web crawls, this basically requires estimating the number of unique URLs in lists of 10-100 billion URLs and the cardinalities of their intersections. After some literature search it seems that SetSketch is the right method for this case as it allows both cardinality and Jaccard index estimation.
I found several implementations of SetSketch in your library (ByteSetSketch, CSetSketch, FSetSketch, ShortSetSketch). Could you please give an advice how to select the appropriate one? Also I could not find how I can change the hyperparameters a,b from Python. Is it possible and reasonable to try selecting them, or better rely on the default values?
The final question is about the calculation of the confidence intervals for the estimates. In the implementation of HLL there is the method relative_error() to get those, is there a way to get similar estimates for SetSketch?

please choose a unique name and create a python package

Both "sketch" and "hll" are taken in PYPI with packages that are far more popular than yours, creating a namespace collision waiting to happen. Please choose a new name and reserve that name in PYPI. Very little work would be needed to make a python package once this barrier is overcome.

Compile error

Hi,

I cloned the repo recursively on a centos 7.9 machine. When I try to make the repo, I get the following error:

`make

g++ -std=c++17 -O3 -funroll-loops -pipe -march=native -I/usr/include/ -Iinclude/sketch -I. -Iinclude/blaze -Ivec -Ipybind11/include -Iinclude -fpic -Wall -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable -Wno-cast-align -fno-strict-aliasing -Wreorder -Wno-unused-parameter -pthread kthread.o testsrc/swtest.cpp -o swtest -lz # -fsanitize=undefined -fsanitize=address
In file included from include/sketch/ccm.h:5,
from testsrc/swtest.cpp:1:
include/sketch/hash.h:25:24: error: 'vec' has not been declared
25 | using Type = typename vec::SIMDTypes<uint64_t>::Type;
| ^~~`

I will appreciate any help to fix it.

error: use of undeclared identifier '_mm512_cmpeq_epi64_mask'

Hey @dnbaker, neat library! I'm trying to compile it on macOS 11.1 (x86-64) but am running into the following error:

make
cc -c -O3 -funroll-loops -pipe -march=native -Iinclude/sketch -I. -Ivec/blaze -Ivec -Ipybind11/include -Iinclude -fpic -Wall -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable  -Wpedantic  -fno-strict-aliasing -DXXH_INLINE_ALL 	kthread.c -o kthread.o
c++ -O3 -funroll-loops -pipe -march=native -Iinclude/sketch -I. -Ivec/blaze -Ivec -Ipybind11/include -Iinclude -fpic -Wall -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable  -Wpedantic  -fno-strict-aliasing -DXXH_INLINE_ALL  -Wreorder  	-std=c++14 -Wno-unused-parameter -pthread kthread.o testsrc/bbmhtest.cpp -o bbmhtest -lz # -fsanitize=undefined -fsanitize=address
In file included from testsrc/bbmhtest.cpp:1:
In file included from include/sketch/bbmh.h:3:
In file included from include/sketch/common.h:26:
./libpopcnt/libpopcnt.h:464:10: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const uint64_t *'
      (aka 'const unsigned long long *') increases required alignment from 1 to 8 [-Wcast-align]
        *(const uint64_t*) *p);
         ^~~~~~~~~~~~~~~~~~~~
./libpopcnt/libpopcnt.h:576:10: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const uint64_t *'
      (aka 'const unsigned long long *') increases required alignment from 1 to 8 [-Wcast-align]
        *(const uint64_t*) *p);
         ^~~~~~~~~~~~~~~~~~~~
./libpopcnt/libpopcnt.h:654:26: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const __m512i *' increases required alignment
      from 1 to 64 [-Wcast-align]
    cnt += popcnt_avx512((const __m512i*) ptr, size / 64);
                         ^~~~~~~~~~~~~~~~~~~~
./libpopcnt/libpopcnt.h:668:24: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const __m256i *' increases required alignment
      from 1 to 32 [-Wcast-align]
    cnt += popcnt_avx2((const __m256i*) ptr, size / 32);
                       ^~~~~~~~~~~~~~~~~~~~
./libpopcnt/libpopcnt.h:679:30: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const uint64_t *'
      (aka 'const unsigned long long *') increases required alignment from 1 to 8 [-Wcast-align]
    cnt += popcnt64_unrolled((const uint64_t*) ptr, size / 8);
                             ^~~~~~~~~~~~~~~~~~~~~
./libpopcnt/libpopcnt.h:694:32: warning: cast from 'const uint8_t *' (aka 'const unsigned char *') to 'const uint64_t *'
      (aka 'const unsigned long long *') increases required alignment from 1 to 8 [-Wcast-align]
    cnt += popcount64_unrolled((const uint64_t*) ptr, size / 8);
                               ^~~~~~~~~~~~~~~~~~~~~
In file included from testsrc/bbmhtest.cpp:1:
In file included from include/sketch/bbmh.h:3:
In file included from include/sketch/common.h:60:
In file included from include/sketch/./hash.h:17:
././vec/vec.h:603:5: error: use of undeclared identifier '_mm512_cmpeq_epi64_mask'
    declare_all_int512(epi64, 512)
    ^
././vec/vec.h:376:5: note: expanded from macro 'declare_all_int512'
    declare_int_epi64_512(sz) \
    ^
././vec/vec.h:351:5: note: expanded from macro 'declare_int_epi64_512'
    declare_avx512_cmpeq_mask(64)
    ^
././vec/vec.h:168:32: note: expanded from macro 'declare_avx512_cmpeq_mask'
    static constexpr decltype(&_mm512_cmpeq_epi##sz##_mask) cmpeq_mask = &_mm512_cmpeq_epi##sz##_mask; \
                               ^
<scratch space>:193:1: note: expanded from here
_mm512_cmpeq_epi64_mask
^

It looks like it may be related to AVX512. Here are the AVX macros on my machine:

c++ -march=native -dM -E - < /dev/null | egrep "AVX" | sort
#define __AVX2__ 1
#define __AVX512BITALG__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512IFMA__ 1
#define __AVX512VBMI2__ 1
#define __AVX512VBMI__ 1
#define __AVX512VL__ 1
#define __AVX512VNNI__ 1
#define __AVX512VPOPCNTDQ__ 1
#define __AVX__ 1

Any ideas?

SetSketches saved from different processes have jaccard estimation of 0

Hi! I'm using CSetSketch from python. I noticed that when I create, fill and then save this structure on disk for 2 sets in the same process, then loading it from disk and calculating jaccard estimation works well. But when 2 sets are processed in different processes, then the estimate is 0. For cardinality estimation everything works well in both cases.
Here is a minimal example showing this:

import os
import sketch

m = 2**18
hll, hll2 = sketch.setsketch.CSetSketch(m), sketch.setsketch.CSetSketch(m)

step1, step2, maxval1, maxval2 = 2, 5, 1000, 1000
for i in range(step1, maxval1+1, step1):
    hll.addh(str(i))

for i in range(step2, maxval2+1, step2):
    hll2.addh(str(i))
    
hll.write(f'tmp1_{os.getpid()}')
hll2.write(f'tmp2_{os.getpid()}')

Run this code twice in 2 different process. Then run:

from pathlib import Path
for ss1 in Path('./').glob('tmp1_*'):
    for ss2 in Path('./').glob('tmp2_*'):
        hll, hll2 = sketch.setsketch.CSetSketch(str(ss1)), sketch.setsketch.CSetSketch(str(ss2))
        jaccard_est = sketch.setsketch.jaccard_index(hll, hll2)
        print(ss1, ss2, jaccard_est)

It will print this in my case:
tmp1_736949 tmp2_736949 0.16761398315429688
tmp1_736949 tmp2_736999 0.0
tmp1_736999 tmp2_736949 0.0
tmp1_736999 tmp2_736999 0.166900634765625

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.