harsha-simhadri / big-ann-benchmarks Goto Github PK
View Code? Open in Web Editor NEWFramework for evaluating ANNS algorithms on billion scale datasets.
Home Page: https://big-ann-benchmarks.com
License: MIT License
Framework for evaluating ANNS algorithms on billion scale datasets.
Home Page: https://big-ann-benchmarks.com
License: MIT License
If an algorithm primarily stores its structures on disk, but makes heavy use of the filesystem cache to speedup access to those structures, does it belong in T1 or T2?
Some concrete examples:
I personally think they should fall under T2, but I could see arguments for both. Anyway, it would be good to clarify this somewhere in the readme or website if possible.
Command:
python install.py --neurips23track ood --algorithm diskann
Output + error:
Building base image...
[+] Building 137.8s (13/13) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 556B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 66B 0.0s
=> [internal] load metadata for docker.io/library/ubuntu:jammy 0.4s
=> [1/8] FROM docker.io/library/ubuntu:jammy@sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77 1.2s
=> => resolve docker.io/library/ubuntu:jammy@sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77 0.0s
=> => sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77 1.13kB / 1.13kB 0.0s
=> => sha256:56887c5194fddd8db7e36ced1c16b3569d89f74c801dc8a5adbf48236fb34564 424B / 424B 0.0s
=> => sha256:01f29b872827fa6f9aed0ea0b2ede53aea4ad9d66c7920e81a8db6d1fd9ab7f9 2.30kB / 2.30kB 0.0s
=> => sha256:b237fe92c4173e4dfb3ba82e76e5fed4b16186a6161e07af15814cb40eb9069d 29.54MB / 29.54MB 0.4s
=> => extracting sha256:b237fe92c4173e4dfb3ba82e76e5fed4b16186a6161e07af15814cb40eb9069d 0.7s
=> [internal] load build context 0.0s
=> => transferring context: 320B 0.0s
=> [2/8] RUN apt-get update && apt-get install -y python3-numpy python3-scipy python3-pip build-essential git axel wget 46.1s
=> [3/8] RUN wget https://aka.ms/downloadazcopy-v10-linux && mv downloadazcopy-v10-linux azcopy.tgz && tar xzf azcopy.tgz --transform 's!^[^/]\+\($\|/\)!azcopy_folder\1!' 1.0s
=> [4/8] RUN cp azcopy_folder/azcopy /usr/bin 0.4s
=> [5/8] RUN pip3 install -U pip 2.6s
=> [6/8] WORKDIR /home/app 0.0s
=> [7/8] COPY requirements_py3.10.txt run_algorithm.py ./ 0.0s
=> [8/8] RUN pip3 install -r requirements_py3.10.txt 81.0s
=> exporting to image 4.8s
=> => exporting layers 4.7s
=> => writing image sha256:ce6f63808ecd14af21e8afd0f5352768165e29ccccb422ec34b815db1691935f 0.0s
=> => naming to docker.io/library/neurips23 0.0s
Building algorithm images... with (1) processes
Building neurips23-ood-diskann...
docker build --rm -t neurips23-ood-diskann -f neurips23/ood/diskann/Dockerfile .
[+] Building 183.6s (14/15) docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 633B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 66B 0.0s
=> [internal] load metadata for docker.io/library/neurips23:latest 0.0s
=> [ 1/12] FROM docker.io/library/neurips23 0.1s
=> [ 2/12] RUN apt update 1.4s
=> [ 3/12] RUN apt install -y software-properties-common 14.8s
=> [ 4/12] RUN add-apt-repository -y ppa:git-core/ppa 4.2s
=> [ 5/12] RUN apt update 1.6s
=> [ 6/12] RUN DEBIAN_FRONTEND=noninteractive apt install -y git make cmake g++ libaio-dev libgoogle-perftools-dev libunwind-dev clang-format libboost-dev libboost-program-options-dev libmkl-full-dev libcpprest-dev python3.10 31.5s
=> [ 7/12] RUN git clone https://github.com/microsoft/DiskANN.git --branch 0.5.0.rc3 1.8s
=> [ 8/12] WORKDIR /home/app/DiskANN 0.0s
=> [ 9/12] RUN pip3 install virtualenv build 2.2s
=> [10/12] RUN python3 -m build 125.0s
=> ERROR [11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl 0.8s
------
> [11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl:
0.726 WARNING: Requirement 'dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl' looks like a filename, but the file does not exist
0.744 Processing ./dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
0.749 ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/app/DiskANN/dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl'
0.749
------
Dockerfile:13
--------------------
11 | RUN pip3 install virtualenv build
12 | RUN python3 -m build
13 | >>> RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
14 | WORKDIR /home/app
15 |
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl" did not complete successfully: exit code: 1
Install Status:
{'neurips23-ood-diskann': 'fail'}
System info:
For easy replication, I'm using an EC2 c5.2xlarge
instance with image Deep Learning AMI GPU PyTorch 1.13.1 (Ubuntu 20.04) 20230818
because it comes with docker and conda preinstalled.
Specifics:
Thank you for providing such a great competition.
I would like to confirm the SSD storages for T1 and T2 hardware. Which storage type is used for them, standard, premium or ultra disk? Compared to physical SSD, the standard and premium are very slow to build indexes. I haven’t tried the ultra disk yet.
Hello, are the results out? If yes, can you please share the results publicly? Ideally, something like https://ann-benchmarks.com/ graphs would be great.
Thanks!
The 1TB constraint for the local SSD index is quite lower than the 1.92 TB available on the NVMe disk of the Azure Standard_L8s_v2 VMs. Could this constraint be increased?
It might be a mistake that in the t1_t2/README.md, the result of BBAnn (track 2) on text2image is incorrect.
As we discussed in PR #70, the best entry on the public query set is when QPS is 1540.622672933968, the recall is 0.495423.
We would be grateful if you would update the entry in the results section.
I met problems when building docker images in neurips23 dir.
apt-get update
: Problem executing scripts APT::Update::Post-Invoke 'rm -f /var/cache/apt/archives/.deb /var/cache/apt/archives/partial/.deb /var/cache/apt/*.bin || true'
pip3 install ...
: RuntimeError: can't start new thread
But when I change the ubuntu base image from jammy to 20.04, the problems are fixed.
I think it might be because of the docker version (mine is 20.10.7). I don't see a specified version in this repo.
So I recommend the organizers share the docker version user in the evaluation machines.
I tried to reproduce the T2 baseline using official indices and codes in PR #17, but cannot get reasonable results compared to the official baseline(i.e. 2000QPS, 0.957 Recall@10 for BIGANN-1B). So would you please release your code and configurations for T2 baseline evaluation?
Here is my operations and results for BIGANN-1B:
pip install -U -r requirements_py38.txt
to install python requirementspython install.py
to build docker imagespython run.py --dataset bigann-1B --algorithm diskann-t2
python data_export.py --output result.csv
to export resultsThe final results are as follows:
algorithm,parameters,dataset,count,qps,distcomps,build,indexsize,queriessize,wspq,recall/ap
diskann-t2,DiskANN,bigann-1B,10,883.7374471075963,0.0,1000000.0,51774612.0,58585.96596698971,inf,0.0019
diskann-t2,DiskANN,bigann-1B,10,1307.73774493947,0.0,1000000.0,51774612.0,39590.97471977949,inf,0.00108
diskann-t2,DiskANN,bigann-1B,10,803.5946779065205,0.0,1000000.0,51774612.0,64428.764181067374,inf,0.00208
diskann-t2,DiskANN,bigann-1B,10,1118.5931944484782,0.0,1000000.0,51774612.0,46285.4702289937,inf,0.00133
diskann-t2,DiskANN,bigann-1B,10,1604.1311836371815,0.0,1000000.0,51774612.0,32275.796722938252,inf,0.00076
diskann-t2,DiskANN,bigann-1B,10,929.937028492744,0.0,1000000.0,51774612.0,55675.39565976534,inf,0.0017500000000000003
diskann-t2,DiskANN,bigann-1B,10,966.1165292100424,0.0,1000000.0,51774612.0,53590.44218230504,inf,0.0017399999999999998
diskann-t2,DiskANN,bigann-1B,10,990.0256363593018,0.0,1000000.0,51774612.0,52296.23365147877,inf,0.00166
diskann-t2,DiskANN,bigann-1B,10,1051.6221286578705,0.0,1000000.0,51774612.0,49233.09484375076,inf,0.00149
diskann-t2,DiskANN,bigann-1B,10,682.3607260592325,0.0,1000000.0,51774612.0,75875.72089473638,inf,0.00258
Using axel
to speed-up index downloads can basically only be run with -q
(quiet) inside the docker container. It would be better if index downloading would happen outside of the container.
Hi ! Thanks for providing the scripts for evaluating results.
I found that when running python data_export.py --output res.csv
,
this line of code:
power_capture.detect_power_benchmarks(metrics, res)
will run out the yield generator res,
so the next line:
for i, (properties, run) in enumerate(res):
don't output anything to be write into res.csv.
I'm still studying the code, and not sure if this is a bug...
There is a bit of non-trivial setup for setting up a machine for faiss_t3 to work with docker. I'll add this to a README in the t3/faiss_t3/ directory.
In general, docker won't be able accommodate all the installation steps needed for various T3 submissions ( host drivers and libraries, etc. ). This was expected :-)
Hi there,
There is a default value for radius here
.I am wondering will this default value 96237 be used for the final evaluation in the benchmark?
Hi, I have a question about using this framework w/ a non-python ANN implementation.
It looks like this is mostly a fork of ann-benchmarks. So the only option for using outside of Python is to hack together a client/server setup, as has been done for a few algos in ann-benchmarks. This obviously handicaps and complicates non-python implementations, as it introduces costs of context-switching, serialization, and data transfer among processes.
I asked and was told early on by project organizers that the big-ann challenge would support non-python implementations:
So I'm wondering if there has been progress here, or any idea of how it might work?
It seems like it wouldn't be terribly difficult to refactor the code so that the containers executed by runner.py can have any entrypoint, e.g., a program in another language. The interface between runner and algorithm would then simply be some standard file format for inputs and nearest neighbor results. If that sounds like a good idea I can try to implement it. Otherwise maybe we can use this ticket for discussing alternatives.
Thanks
-Alex
in the sparse dataset "base_small.csr" there is no documentation of how we deal with this dataset format. For example, how to read it in a matrix format.
when I use this command, there is an error.
python run.py --neurips23track ood --algorithm diskann --dataset random-xs
how to fix this issue? Anybody can help me? thanks.
Preparing datasets with 10000 random points and 1000 queries.
Computing groundtruth
2023-09-04 21:25:39,786 - annb - INFO - running only diskann
Traceback (most recent call last):
File "run.py", line 6, in <module>
main()
File "/home/cy/work_cy/big-ann-benchmarks/benchmark/main.py", line 236, in main
for image in docker_client.images.list():
File "/home/cy/.local/lib/python3.8/site-packages/docker/models/images.py", line 230, in list
resp = self.client.api.images(name=name, all=all, filters=filters)
File "/home/cy/.local/lib/python3.8/site-packages/docker/api/image.py", line 93, in images
res = self._result(self._get(self._url("/images/json"), params=params),
File "/home/cy/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/cy/.local/lib/python3.8/site-packages/docker/api/client.py", line 191, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 602, in get
return self.request("GET", url, **kwargs)
File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/home/cy/.local/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/home/cy/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "/home/cy/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 496, in _make_request
conn.request(
TypeError: request() got an unexpected keyword argument 'chunked'
When running python install.py --neurips23track ood --algorithm diskann
or python install.py --neurips23track streaming --algorithm diskann
, an issue is encountered where the installation of diskann fails because the file name for the compiled diskann is specified as dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
when the file created by the build is dist/diskannpy-0.5.0rc2-cp310-cp310-linux_x86_64.whl
.
This seems to stem from this commit in which the diskann version is upgraded to rc3, but the rc3 branch does not seem to compile an output with the correct name, causing the issue.
Fixed for me by changing the filename in the dockerfiles as described, but if the upgrade to rc3 matters it might therefore not be getting applied.
When running the cmd:
python3 run.py --dataset msturing-10M-clustered --algorithm diskann --neurips23track streaming --runbook_path neurips23/streaming/delete_runbook.yaml
This error continuously pops out after several iterations:
...
...
2023-08-10 01:08:16,415 - annb.31d58d2d344a - INFO - ^[[34mStep 54 took 3.0553574562072754s.^[[0m
2023-08-10 01:08:16,621 - annb.31d58d2d344a - INFO - ^[[34m#active pts 4539934 #unprocessed deletes 1500000^[[0m
2023-08-10 01:09:59,611 - annb.31d58d2d344a - ERROR - Container.wait for container 31d58d2d344a failed with exception
2023-08-10 01:09:59,611 - annb.31d58d2d344a - ERROR - Invoked with ['--dataset', 'msturing-10M-clustered', '--algorithm', 'diskann', '--module', 'neurips23.streaming.diskann.diskann-str', '--constructor', 'diskann', '--runs', '5', '--count', '10', '--neurips23track', 'streaming', '--runbook_path', 'neurips23/streaming/delete_runbook.yaml', '["euclidean", {"R": 64, "L": 50, "insert_threads": 16, "consolidate_threads": 16}]', '[{"Ls": 100, "T": 16}]']
Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 710, in _error_catcher
yield
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1077, in read_chunked
self._update_chunk_length()
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1005, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 816, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 937, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1065, in read_chunked
with self._error_catcher():
File "/usr/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 715, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.") from e # type: ignore[arg-type]
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/benchmark/runner.py", line 318, in run_docker
return_value = container.wait(timeout=timeout)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/models/containers.py", line 514, in wait
return self.client.api.wait(self.id, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/api/container.py", line 1338, in wait
res = self._post(url, timeout=timeout, params=params)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/api/client.py", line 233, in _post
return self.post(url, **self._set_request_timeout(kwargs))
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 747, in send
r.content
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 899, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 822, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
103,1 99%
> python3 run.py --neurips23track filter --algorithm faiss --dataset yfcc-10M
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/query.public.100K.u8bin -> data/yfcc100M/query.public.100K.u8bin...
[2.55 s] downloaded 18.31 MiB / 18.31 MiB at 7.19 MiB/s
download finished in 2.55 s, total size 19200008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/GT.public.ibin -> data/yfcc100M/GT.public.ibin...
[1.45 s] downloaded 7.63 MiB / 7.63 MiB at 5.28 MiB/s
download finished in 1.45 s, total size 8000008 bytes
file data/yfcc100M/ already exists
file data/yfcc100M/ already exists
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/base.metadata.10M.spmat -> data/yfcc100M/base.metadata.10M.spmat...
[94.03 s] downloaded 901.87 MiB / 901.87 MiB at 9.59 MiB/s
download finished in 94.03 s, total size 945683840 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/query.metadata.public.100K.spmat -> data/yfcc100M/query.metadata.public.100K.spmat...
[1.00 s] downloaded 1.82 MiB / 1.82 MiB at 1.82 MiB/s
download finished in 1.00 s, total size 1907024 bytes
2023-07-18 19:55:12,243 - annb - INFO - running only faiss
2023-07-18 19:55:12,319 - annb - INFO - Order: [Definition(algorithm='faiss', constructor='FAISS', module='neurips23.filter.faiss.faiss', docker_tag='neurips23-filter-faiss', docker_volumes=[], arguments=['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}], query_argument_groups=[[{'nprobe': 1, 'mt_threshold': 0.0003}], [{'nprobe': 4, 'mt_threshold': 0.0003}], [{'nprobe': 16, 'mt_threshold': 0.0003}], [{'nprobe': 32, 'mt_threshold': 0.0003}], [{'nprobe': 64, 'mt_threshold': 0.0003}], [{'nprobe': 96, 'mt_threshold': 0.0003}], [{'nprobe': 1, 'mt_threshold': 0.0001}], [{'nprobe': 4, 'mt_threshold': 0.0001}], [{'nprobe': 16, 'mt_threshold': 0.0001}], [{'nprobe': 32, 'mt_threshold': 0.0001}], [{'nprobe': 64, 'mt_threshold': 0.0001}], [{'nprobe': 96, 'mt_threshold': 0.0001}], [{'nprobe': 1, 'mt_threshold': 0.01}], [{'nprobe': 4, 'mt_threshold': 0.01}], [{'nprobe': 16, 'mt_threshold': 0.01}], [{'nprobe': 32, 'mt_threshold': 0.01}], [{'nprobe': 64, 'mt_threshold': 0.01}], [{'nprobe': 96, 'mt_threshold': 0.01}]], disabled=False)]
RW Namespace(dataset='yfcc-10M', count=10, definitions='algos-2021.yaml', algorithm='faiss', docker_tag=None, list_algorithms=False, force=False, rebuild=False, runs=5, timeout=43200, max_n_algorithms=-1, power_capture='', t3=False, nodocker=False, upload_index=False, download_index=False, blob_prefix=None, sas_string=None, private_query=False, neurips23track='filter', runbook_path='neurips23/streaming/simple_runbook.yaml')
Setting container wait timeout to 30 minutes
2023-07-18 19:55:12,762 - annb.d25eedf2531c - INFO - Created container d25eedf2531c: CPU limit 0-11, mem limit 25092139776, timeout 1800, command ['--dataset', 'yfcc-10M', '--algorithm', 'faiss', '--module', 'neurips23.filter.faiss.faiss', '--constructor', 'FAISS', '--runs', '5', '--count', '10', '--neurips23track', 'filter', '["euclidean", {"indexkey": "IVF16384,SQ8", "binarysig": true, "threads": 16}]', '[{"nprobe": 1, "mt_threshold": 0.0003}]', '[{"nprobe": 4, "mt_threshold": 0.0003}]', '[{"nprobe": 16, "mt_threshold": 0.0003}]', '[{"nprobe": 32, "mt_threshold": 0.0003}]', '[{"nprobe": 64, "mt_threshold": 0.0003}]', '[{"nprobe": 96, "mt_threshold": 0.0003}]', '[{"nprobe": 1, "mt_threshold": 0.0001}]', '[{"nprobe": 4, "mt_threshold": 0.0001}]', '[{"nprobe": 16, "mt_threshold": 0.0001}]', '[{"nprobe": 32, "mt_threshold": 0.0001}]', '[{"nprobe": 64, "mt_threshold": 0.0001}]', '[{"nprobe": 96, "mt_threshold": 0.0001}]', '[{"nprobe": 1, "mt_threshold": 0.01}]', '[{"nprobe": 4, "mt_threshold": 0.01}]', '[{"nprobe": 16, "mt_threshold": 0.01}]', '[{"nprobe": 32, "mt_threshold": 0.01}]', '[{"nprobe": 64, "mt_threshold": 0.01}]', '[{"nprobe": 96, "mt_threshold": 0.01}]']
2023-07-18 19:55:13,268 - annb.d25eedf2531c - INFO - ['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}]
2023-07-18 19:55:13,268 - annb.d25eedf2531c - INFO - Trying to instantiate neurips23.filter.faiss.faiss.FAISS(['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}])
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - Running faiss on yfcc-10M
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - preparing binary signatures
2023-07-18 19:55:40,382 - annb.d25eedf2531c - INFO - writing to data/yfcc-10M.IVF16384,SQ8.binarysig
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO - Traceback (most recent call last):
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO - File "/home/app/run_algorithm.py", line 3, in <module>
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO - run_from_cmdline()
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/runner.py", line 222, in run_from_cmdline
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - run(definition, args.dataset, args.count, args.runs, args.rebuild,
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/runner.py", line 69, in run
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - build_time = custom_runner.build(algo, dataset)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - algo.fit(dataset)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/neurips23/filter/faiss/faiss.py", line 112, in fit
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - xb = ds.get_dataset()
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/datasets.py", line 217, in get_dataset
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - slice = next(self.get_dataset_iterator(bs=self.nb))
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/datasets.py", line 190, in get_dataset_iterator
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - x = xbin_mmap(filename, dtype=self.dtype, maxn=self.nb)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - File "/home/app/benchmark/dataset_io.py", line 96, in xbin_mmap
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - FileNotFoundError: [Errno 2] No such file or directory: 'data/yfcc100M/base.10M.u8bin.crop_nb_10000000'
2023-07-18 19:55:44,390 - annb.d25eedf2531c - ERROR - ['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}]
Trying to instantiate neurips23.filter.faiss.faiss.FAISS(['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}])
{'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}
Running faiss on yfcc-10M
preparing binary signatures
writing to data/yfcc-10M.IVF16384,SQ8.binarysig
Traceback (most recent call last):
File "/home/app/run_algorithm.py", line 3, in <module>
run_from_cmdline()
File "/home/app/benchmark/runner.py", line 222, in run_from_cmdline
run(definition, args.dataset, args.count, args.runs, args.rebuild,
File "/home/app/benchmark/runner.py", line 69, in run
build_time = custom_runner.build(algo, dataset)
File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
algo.fit(dataset)
File "/home/app/neurips23/filter/faiss/faiss.py", line 112, in fit
xb = ds.get_dataset()
File "/home/app/benchmark/datasets.py", line 217, in get_dataset
slice = next(self.get_dataset_iterator(bs=self.nb))
File "/home/app/benchmark/datasets.py", line 190, in get_dataset_iterator
x = xbin_mmap(filename, dtype=self.dtype, maxn=self.nb)
File "/home/app/benchmark/dataset_io.py", line 96, in xbin_mmap
n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
FileNotFoundError: [Errno 2] No such file or directory: 'data/yfcc100M/base.10M.u8bin.crop_nb_10000000'
2023-07-18 19:55:44,390 - annb.d25eedf2531c - ERROR - Child process for container d25eedf2531creturned exit code 1 with message None
Hallo,
I have a problem when i follow the instructions on the README file.
first i create a conda python3.10 environment, and run pip install -r requirements_py3.10.txt.
then i run python3 install.py --algorithm pqbuddy
, and it create a docker container.
Aftrer i prepare the dataset with python3 create_dataset.py --dataset deep-10M
, i run python3 run.py --algorithm pqbuddy --dataset deep-10M --rebuild
and the error occurs.
File "run_algorithm.py", line 1, in
from benchmark.runner import run_from_cmdline
File "/home/app/benchmark/runner.py", line 26, in
from neurips23.common import RUNNERS
File "/home/app/neurips23/common.py", line 7, in
from neurips23.streaming.run import StreamingRunner
File "/home/app/neurips23/streaming/run.py", line 39
match entry['operation']:
^
SyntaxError: invalid syntax
could you please help out with the problem, thx
I cannot reproduce it on h8
or e8
instances, but on f32v2
instances faiss will segfault with some parameter settings. E.g., set up everything to run msturing-1B
and carry out
params="
nprobe=128,quantizer_efSearch=128
nprobe=64,quantizer_efSearch=512
nprobe=128,quantizer_efSearch=256
nprobe=128,quantizer_efSearch=512
nprobe=256,quantizer_efSearch=256
nprobe=256,quantizer_efSearch=512
"
python track1_baseline_faiss/baseline_faiss.py \
--dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex \
--search --searchparams $params
results in
azureuser@test:~/big-ann-benchmarks$ bash test.sh
nb processors 32
model name : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Dataset MSTuringANNS in dimension 100, with distance euclidean, search_type knn, size: Q 100000 B 1000000000
reading data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex
imbalance_factor= 1.5638867719477003
index size on disk: 41360658380
current RSS: 44945760256
precomputed tables size: 0
Search threads: 32
Optimize for intersection @ 10
Running evaluation on 6 searchparams
parameters inter@ 10 time(ms/q) nb distances %quantization #runs
nprobe=128,quantizer_efSearch=128 test.sh: line 12: 8954 Killed python track1_baseline_faiss/baseline_faiss.py --dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex --search --searchparams $params
Any thoughts Matthijs? (Once you are back from vacation)
$ pip install -r big-ann-benchmarks/requirements_py3.10.txt
Collecting ansicolors==1.1.8 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 1))
Downloading ansicolors-1.1.8-py2.py3-none-any.whl (13 kB)
Collecting docker==6.1.2 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 2))
Downloading docker-6.1.2-py3-none-any.whl (148 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.1/148.1 kB 3.6 MB/s eta 0:00:00a 0:00:01
Collecting h5py==3.8.0 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 3))
Downloading h5py-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 22.8 MB/s eta 0:00:0000:0100:01
Collecting matplotlib==3.3.4 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 4))
Downloading matplotlib-3.3.4.tar.gz (37.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.9/37.9 MB 23.6 MB/s eta 0:00:0000:0100:01
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [56 lines of output]
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3031, in _dep_map
return self.__dep_map
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2828, in __getattr__
raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3022, in _parsed_pkg_info
return self._pkg_info
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2828, in __getattr__
raise AttributeError(attr)
AttributeError: _pkg_info. Did you mean: 'egg_info'?
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-ohdk213a/matplotlib_cd1d295c77724caeb6457c930691d7e2/setup.py", line 256, in <module>
setup( # Finally, pass this all along to distutils to do the heavy lifting.
File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 152, in setup
_install_setup_requires(attrs)
File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 147, in _install_setup_requires
dist.fetch_build_eggs(dist.setup_requires)
File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 812, in fetch_build_eggs
resolved_dists = pkg_resources.working_set.resolve(
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 785, in resolve
new_requirements = dist.requires(req.extras)[::-1]
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2749, in requires
dm = self._dep_map
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3033, in _dep_map
self.__dep_map = self._compute_dependencies()
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3042, in _compute_dependencies
for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3024, in _parsed_pkg_info
metadata = self.get_metadata(self.PKG_INFO)
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1412, in get_metadata
value = self._get(path)
File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1616, in _get
with open(path, 'rb') as stream:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/numpy-1.25.0.dist-info/METADATA'
Edit setup.cfg to change the build options; suppress output with --quiet.
BUILDING MATPLOTLIB
matplotlib: yes [3.3.4]
python: yes [3.10.12 | packaged by conda-forge | (main, Jun 23 2023,
22:40:32) [GCC 12.3.0]]
platform: yes [linux]
sample_data: yes [installing]
tests: no [skipping due to configuration]
macosx: no [Mac OS-X only]
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
It feels a bit weird to see a lot of activity on this repo rather than trying to contribute to the original one https://github.com/erikbern/ann-benchmarks
Is the ambition to merge it back into the main repo? Or is this just a short-lived repo anyway?
I'm happy to donate my code to something more neutral (eg we can set up a neutral github.com organzation rather than have the code under my username). Seems like it would be beneficial to to not diverge too far.
(also felt a bit weird that no one told me about this – I found out about it randomly)
@maumueller wdyt?
[11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl:
1.266 WARNING: Requirement 'dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl' looks like a filename, but the file does not exist
1.299 Processing ./dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
1.307 ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/app/DiskANN/dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl'
1.307
ERROR: failed to solve: process "/bin/sh -c pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl" did not complete successfully: exit code: 1
it seems diskann doesnt have dist/ directory
Dear all,
<tl;dr> Please add your thoughts on the future of this benchmark!
Thank you very much for participating in our NeurIPS'21 competition. The competition will end with an event on Dec 8, and you can find the timeline for this event on https://big-ann-benchmarks.com/. We hope many of you will be able to participate!
The last part of the event will be an open discussion among the participants for future directions of this competition. As organizers we have already identified some points we would like to discuss and potentially include in a future version of the benchmark.
Filtered ANNS: can you support ANNS queries which allow filters like date range, author or some combination of attributes. This would look like a simple SQL + ANNS query.
Streaming ANNS: Can algorithms be robust to insertions and deletions. Here we have a strong baseline (fresh-diskann: https://arxiv.org/abs/2105.09613)
Out of distribution queries: this is already a problem with T2I and we can imagine various variations
Better vector compression: Most approaches use some variant of product quantization as vector compression, but can we get more accurate estimation, maybe at the price of more expensive decoding?
Please let us know what you think about these topics, and add your own!
Thanks!
Hello!
Thanks for providing the scripts for running baselines. The following one liner:
python -u track1_baseline_faiss/baseline_faiss.py --dataset bigann-100M \
--indexkey OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr \
--maxtrain 100000000 \
--two_level_clustering \
--build \
--add_splits 30 \
--indexfile data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex \
--quantizer_efConstruction 200 \
--quantizer_add_efSearch 80
produces output on F32s_v2 with 64G RAM:
args= Namespace(M0=-1, add_bs=100000, add_splits=30, autotune_max=[], autotune_range=[], basedir=None, build=True, buildthreads=-1, by_residual=-1, clustering_niter=-1, dataset='bigann-100M', indexfile='data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex', indexkey='OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr', inter=True, k=10, maxRAM=-1, maxtrain=100000000, min_test_duration=3.0, n_autotune=500, no_precomputed_tables=False, pairwise_quantization='', parallel_mode=-1, prepare=False, quantizer_add_efSearch=80, quantizer_efConstruction=200, query_bs=-1, radius=96237, search=False, searchparams=['autotune'], searchthreads=-1, stop_at_split=-1, train_on_gpu=False, two_level_clustering=True)
nb processors 32
model name : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Dataset BigANNDataset in dimension 128, with distance euclidean, search_type knn, size: Q 10000 B 100000000
build index, key= OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr
Build-time number of threads: 32
metric type 1
Update add-time parameters
update quantizer efSearch= 16 -> 80
update quantizer efConstruction= 40 -> 200
getting first 100000000 dataset vectors for training
train, size (100000000, 128)
Forcing OPQ training PQ to PQ4
training vector transform
transform trainset
Killed
Can you please explain what could be wrong? Is the expectation to allocate 10% of data for training?
When running with the latest Docker Desktop on mac (https://docs.docker.com/desktop/install/mac-install/), running run.py
stops with a “Keyerror” around line 302 of runner.py (there is no item in the dict with key “Error”).
With an older version of docker (Docker version 20.10.11, build dea9396
) the code is running as expected, but the logs show nothing (although the container is running, and the results folder is created with the new results).
@maumueller T3 algorithms will be tied to certain hardware. Instead of putting T3 algo definitions in the default algos.yaml, would it be better to put them into a separate one?
Could the building of the index (which has a 4 days constraint) start by downloading one of the available baseline indexes, and then apply changes to it (possibly taking a couple of days)?
I am trying to get bigann-1B dataset using following command, But it fails with error.
Is there an alternative source for this dataset ??
(big-ann) ubuntu@ip-172-31-2-12:~/pgvector_testing/big-ann-benchmarks$ python create_dataset.py --dataset bigann-1B
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -> data/bigann/query.public.10K.u8bin...
[0.24 s] downloaded 1.22 MiB / 1.22 MiB at 5.10 MiB/s
download finished in 0.24 s, total size 1280008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/GT.public.1B.ibin -> data/bigann/GT.public.1B.ibin...
[0.40 s] downloaded 7.63 MiB / 7.63 MiB at 19.18 MiB/s
download finished in 0.40 s, total size 8000008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.private.799253207.10K.u8bin -> data/bigann/query.private.799253207.10K.u8bin...
Traceback (most recent call last):
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/create_dataset.py", line 16, in
ds.prepare(True if args.skip_data else False)
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/benchmark/datasets.py", line 140, in prepare
download(self.private_qs_url, outfile)
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/benchmark/dataset_io.py", line 25, in download
inf = urlopen(src)
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.
I am trying to access the file directly by using get_dataset_fn
method instead of get_dataset_iterator
method, and I wonder if this is not an issue.
If possible, there seems to be something wrong with the implementation of the get_dataset_fn
method for small datasets.
In the get_dataset_fn
method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator
method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn
is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn
method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.
big-ann-benchmarks/benchmark/main.py
Line 145 in 8180e0e
When I look at the meta file for the yfcc100M dataset I can only get the first two values as the number of nodes and the number of labels,
For the description of the latter data format I guess it is the sparse matrix format but I can't tell for sure
Please tell me where should I get more information from
Thx!!!!!!!!
Otherwise, an error will be generated
> python3 plot.py --dataset yfcc-10M
writing output to results/yfcc-10M.png
Traceback (most recent call last):
File "/home/nop/projects/nips23/big-ann-benchmarks/plot.py", line 161, in <module>
raise Exception('Nothing to plot')
Exception: Nothing to plot
After adding --neurips23track filter
everything works as expected
> python3 plot.py --dataset yfcc-10M --neurips23track filter
writing output to results/yfcc-10M.png
Computing knn metrics
0: Faiss(('IVF16384,SQ8', {'nprobe': 32, 'mt_threshold': 0.0001})) 0.847 3069.679
Computing knn metrics
1: Faiss(('IVF16384,SQ8', {'nprobe': 64, 'mt_threshold': 0.0001})) 0.901 2421.102
...
The correct command line was somewhat unclear from the description on the main page.
Python: 3.8.5
(venv) (base) dmitry@dmitrykan:/datadrive/big-ann-benchmarks$ pip install -r requirements.txt
Collecting ansicolors==1.1.8
Using cached ansicolors-1.1.8-py2.py3-none-any.whl (13 kB)
Collecting docker==2.6.1
Using cached docker-2.6.1-py2.py3-none-any.whl (117 kB)
Collecting h5py==2.10.0
Using cached h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting matplotlib==2.1.0
Using cached matplotlib-2.1.0.tar.gz (35.7 MB)
ERROR: Command errored out with exit status 1:
command: /datadrive/big-ann-benchmarks/venv/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py'"'"'; __file__='"'"'/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-vl5anihm
cwd: /tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/
Complete output (78 lines):
IMPORTANT WARNING:
pkg-config is not installed.
matplotlib may not be able to find some of its dependencies
============================================================================
Edit setup.cfg to change the build options
BUILDING MATPLOTLIB
matplotlib: yes [2.1.0]
python: yes [3.8.5 (default, Sep 4 2020, 07:30:14) [GCC
7.3.0]]
platform: yes [linux]
REQUIRED DEPENDENCIES AND EXTENSIONS
numpy: yes [not found. pip may install it below.]
six: yes [six was not found.pip will attempt to install
it after matplotlib.]
dateutil: yes [dateutil was not found. It is required for date
axis support. pip/easy_install may attempt to
install it after matplotlib.]
backports.functools_lru_cache: yes [Not required]
subprocess32: yes [Not required]
pytz: yes [pytz was not found. pip/easy_install may
attempt to install it after matplotlib.]
cycler: yes [cycler was not found. pip/easy_install may
attempt to install it after matplotlib.]
tornado: yes [tornado was not found. It is required for the
WebAgg backend. pip/easy_install may attempt to
install it after matplotlib.]
pyparsing: yes [pyparsing was not found. It is required for
mathtext support. pip/easy_install may attempt to
install it after matplotlib.]
libagg: yes [pkg-config information for 'libagg' could not
be found. Using local copy.]
freetype: no [The C/C++ header for freetype2 (ft2build.h)
could not be found. You may need to install the
development package.]
png: yes [version 1.6.37]
qhull: yes [pkg-config information for 'libqhull' could not
be found. Using local copy.]
OPTIONAL SUBPACKAGES
sample_data: yes [installing]
toolkits: yes [installing]
tests: no [skipping due to configuration]
toolkits_tests: no [skipping due to configuration]
OPTIONAL BACKEND EXTENSIONS
macosx: no [Mac OS-X only]
qt5agg: no [PySide2 not found; PyQt5 not found]
qt4agg: no [PySide not found; PyQt4 not found]
gtk3agg: no [Requires pygobject to be installed.]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py", line 216, in <module>
pkg_help = pkg.install_help_msg()
File "/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setupext.py", line 595, in install_help_msg
release = platform.linux_distribution()[0].lower()
AttributeError: module 'platform' has no attribute 'linux_distribution'
gtk3cairo: no [Requires cairocffi or pycairo to be installed.]
gtkagg: no [Requires pygtk]
tkagg: yes [installing; run-time loading from Python Tcl /
Tk]
wxagg: no [requires wxPython]
gtk: no [Requires pygtk]
agg: yes [installing]
cairo: no [cairocffi or pycairo not found]
windowing: no [Microsoft Windows only]
OPTIONAL LATEX DEPENDENCIES
dvipng: no
ghostscript: no
latex: no
pdftops: no
OPTIONAL PACKAGE DATA
dlls: no [skipping due to configuration]
============================================================================
----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/6c/90/cf10bb2020d2811da811a49601f6eafcda022c6ccd296fd05aba093dee96/matplotlib-2.1.0.tar.gz#sha256=4b5f16c9cefde553ea79975305dcaa67c8e13d927b6e55aa14b4a8d867e25387 (from https://pypi.org/simple/matplotlib/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement matplotlib==2.1.0 (from versions: 0.86, 0.86.1, 0.86.2, 0.91.0, 0.91.1, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1rc1, 1.4.1, 1.4.2, 1.4.3, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1, 2.0.2, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.2.0rc1, 2.2.0, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 3.0.0rc2, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0rc1, 3.1.0rc2, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.2.0rc1, 3.2.0rc3, 3.2.0, 3.2.1, 3.2.2, 3.3.0rc1, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4, 3.4.0rc1, 3.4.0rc2, 3.4.0rc3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.5.0b1)
ERROR: No matching distribution found for matplotlib==2.1.0
@alexklibisz Since FAISS used D,I order to return range search results, in a recent commit, I changed the range search result return order in the base ANN class to D,I.
Please consider updating HttpANN class as well.
when downloading datasets,we met two errors, error info is shown as below:
A.
Initializing download: https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin
Unable to connect to server dl.fbaipublicfiles.com:80
B.
Initializing download: https://comp21storage.blob.core.windows.net/publiccontainer/comp21/MSFT-TURING-ANNS/base1b.fbin
HTTP/1.1 400 The account being accessed does not support http.
The current workflow to run algorithm X on dataset Y is something like this:
python install.py
builds docker containerpython create_dataset.py
sets up datasetspython run.py --dataset Y --algorithm X
mounts data/
, results/
, benchmark/
into the container for X
python plot.py / data_export.py / ...
to evaluate the resultsGiven @harsha-simhadri's and @sourcesync's frustrations and some directions discussed in other meetings, I think we should relax step 3 a bit and allow more flexibility in the container setup. One direction could look like this:
python install.py
builds docker container, participants are expected to overwrite the entry point to point to their own implementation (file algorithms/X/Dockerfile
)python create_dataset.py
sets up datasetsalgorithms/X/run.{py,sh}
)
task
, dataset
, where the results should be written, and some additional parametersdata/
, results/
, and the config file that is used by the implementation (algorithms/X/config.yaml
, maybe task specific)results/Y/X/run_identifier.hdf5
)python plot.py / data_export.py / ...
to evaluate the resultsWe provide a default run script for inspiration, which would be pretty close to the current setup. Putting all the logic into the container could mean a lot of code duplication, but isolated containers will allow for a much easier orchestration.
I can provide a proof-of-concept if this sounds promising.
For the Azure Premium SSD used in the build machine for Task 2 (with 4 TB), have you been using Host Caching? If so, Read-Only (default) or Write-Read?
I try to built index database of track-3 follow :
python track3_baseline_faiss/gpu_baseline_faiss.py --dataset bigann-1B \ --indexkey IVF1048576,SQ8 \ --train_on_gpu \ --build --quantizer_on_gpu_add --add_splits 30 \ --search \ --searchparams nprobe={1,4,16,64,256} \ --parallel_mode 3 --quantizer_on_gpu_search
but failed caused by DRAM 128GB less than base-line 768GB.
Someone could provide Track-3 Base-line link , thanks
Hi, I wanted to know are there any plans for adding benchmarks for SCANN? I am not sure if there are benchmarks available for SCANN for large datasets so I was curious on the same. Thanks!
@maumueller Alright, I tried the index strategy "OPQ32_128,IVF1048576_HNSW32,PQ32" on SSNPP and got the exception below. Note that I'm now defaulting to CPU on build_index for this dataset since the quantizer class doesn't support range search.
I will next try to set quantizer_on_gpu_add=False and train_on_gpu=False for build_index(). The default was True for both.
...
Training PQ slice 30/32
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (0.36 s, search 0.32 s): objective=1.11718e+07 imbalance=1.174 nsplit=0
Training PQ slice 31/32
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (0.36 s, search 0.32 s): objective=1.12101e+07 imbalance=1.185 nsplit=0
doing polysemous training for PQ
IndexIVFPQ::precompute_table: not precomputing table, it would be too big: 34359738368 bytes (max 2147483648)
Total train time 14384.034 s
============== SPLIT 0/1
Process Process-1:
Traceback (most recent call last):
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/main.py", line 45, in run_worker
run_no_docker(definition, args.dataset, args.count,
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 268, in run_no_docker
run_from_cmdline(cmd)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 182, in run_from_cmdline
run(definition, args.dataset, args.count, args.runs, args.rebuild)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 76, in run
algo.fit(dataset)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 274, in fit
index = build_index(buildthreads, by_residual, maxtrain,
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 184, in build_index
for xblock, assign in stage2:
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 46, in rate_limited_iter
res = res.get()
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 39, in next_or_None
return next(l)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 176, in produce_batches
_, assign = quantizer_gpu.search(xblock, 1)
File "/home/george/anaconda3/envs/bigann/lib/python3.8/site-packages/faiss/init.py", line 287, in replacement_search
assert d == self.d
AssertionError
I can't reappearance the results in benchmark through https://github.com/harsha-simhadri/big-ann-benchmarks/tree/main/t1_t2, I used the same machine in azure: Standard L8s_v2 (8 vcpus, 64 GiB memory) but the qps for t2 is always 100+, I checked the cpu usage during searching indexing ,it's so slow never higher than 100%, is there any configuration I should mind and deal with? ps:I used ssd not from origianl Standard L8s_v2 becasue of space problem
diskann-t2,DiskANN,bigann-1B,10,68.26486482533713,3897.3418,1000000.0,60385020.0,884569.5388762793,109.42,117092.392,0.97882
diskann-t2,DiskANN,bigann-1B,10,178.47422862340082,1835.1444,1000000.0,60385020.0,338340.2772812576,42.42,44780.5504,0.89605
diskann-t2,DiskANN,bigann-1B,10,126.13846714851216,2419.60864,1000000.0,60385020.0,478720.10311417724,61.0627,63752.9448,0.9426500000000001
diskann-t2,DiskANN,bigann-1B,10,96.7202182888592,3009.87794,1000000.0,60385020.0,624326.7547190337,80.204,83991.4693,0.96411
diskann-t2,DiskANN,bigann-1B,10,149.1239430899944,2126.48608,1000000.0,60385020.0,404931.7550807948,51.6712,54252.8475,0.9246700000000001
diskann-t2,DiskANN,bigann-1B,10,109.6912829487855,2714.58698,1000000.0,60385020.0,550499.7149882326,70.5957,72910.7472,0.9549200000000001
diskann-t2,DiskANN,bigann-1B,10,105.95468768221484,2772.53362,1000000.0,60385020.0,569913.6236530666,72.4778,76382.5452,0.95691
diskann-t2,DiskANN,bigann-1B,10,84.24591401229037,3306.53768,1000000.0,60385020.0,716770.9046540895,89.9183,96594.0784,0.9699899999999999
diskann-t2,DiskANN,bigann-1B,10,117.73705018597681,2566.1336200000005,1000000.0,60385020.0,512880.3541843128,65.7918,68474.3466,0.94913
diskann-t2,DiskANN,bigann-1B,10,101.80129233368652,2862.72176,1000000.0,60385020.0,593165.5543435407,75.4055,79424.7472,0.9599399999999999
I used the ssd which I created from azure
Couldn't access the slides of talks from track winners.
The url like: https://big-ann-benchmarks.com/templates/slides/*
e.g. https://big-ann-benchmarks.com/templates/slides/invited-talk-anshu.pptx
(just report 404 error)
With the current setup of using all threads available, the daemon thread (https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/runner.py#L207-L212) seems to never get scheduled during querying, so there is no visible progress.
I am actually not sure how to fix that, since there is no way to give it some kind of higher priority. Would love to hear some thoughts!
Track 1 & Track 2 can run successfully.
But I met a Track 3 docker evironment error because nvidia/cuda:11.0-devel-ubuntu18.04 is no longer available.
Install Status:
{'faiss_t3': 'fail'}
=> ERROR [internal] load metadata for docker.io/nvidia/cuda:11.0-devel-ubuntu18.04 1.6s
------
> [internal] load metadata for docker.io/nvidia/cuda:11.0-devel-ubuntu18.04:
------
Dockerfile:2
--------------------
1 |
2 | >>> FROM nvidia/cuda:11.0-devel-ubuntu18.04
3 |
4 | ENV PATH="/root/miniconda3/bin:${PATH}"
--------------------
ERROR: failed to solve: nvidia/cuda:11.0-devel-ubuntu18.04: docker.io/nvidia/cuda:11.0-devel-ubuntu18.04: not found
So I tried with a similar nvidia docker version nvidia/cuda:11.0.3-devel-ubuntu18.04, but seems to cause a lot of package conflicts:
Examining conflict for libgcc-ng libgomp _openmp_mutex: 67%|██████▋ | 34/51 [02:32<00:21, 1.25s/it] failed
#0 358.3
#0 358.3 UnsatisfiableError: The following specifications were found to be incompatible with a past
#0 358.3 explicit spec that is not an explicit spec in this operation (setuptools):
#0 358.3
#0 358.3 - faiss-gpu -> numpy[version='>=1.11,<2'] -> python[version='>=3.10,<3.11.0a0|>=3.11,<3.12.0a0']
#0 358.3 - faiss-gpu -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.5,<3.6.0a0']
#0 358.3 - python=3.6.9 -> pip -> setuptools
#0 358.3 - python=3.6.9 -> pip -> wheel
#0 358.3
#0 358.3 The following specifications were found to be incompatible with each other:
#0 358.3
#0 358.3 Output in format: Requested package -> Available versions
#0 358.3
#0 358.3 Package ncurses conflicts for:
#0 358.3 pip -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 cffi -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 cryptography -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 pyopenssl -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 brotlipyThe following specifications were found to be incompatible with your system:
#0 358.3
#0 358.3 - feature:/linux-64::__glibc==2.27=0
#0 358.3 - brotlipy -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3 - bzip2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3 - cffi -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - conda-package-handling -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - cryptography -> libgcc-ng -> __glibc[version='>=2.17']
#0 358.3 - cudatoolkit=11.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3 - faiss-gpu -> libgcc-ng[version='>=8.4.0'] -> __glibc[version='>=2.17']
#0 358.3 - libffi -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - libgcc-ng -> __glibc[version='>=2.17']
#0 358.3 - libstdcxx-ng -> __glibc[version='>=2.17']
#0 358.3 - libuuid -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - ncurses -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - openssl -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
#0 358.3 - pycosat -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - python=3.6.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3 - readline -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - ruamel.yaml -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - ruamel.yaml.clib -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - sqlite -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - tk -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
#0 358.3 - xz -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - zlib -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 - zstandard -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3
#0 358.3 Your installed version is: 2.27
#0 358.3
#0 358.3
------
Dockerfile:11
--------------------
10 |
11 | >>> RUN wget \
12 | >>> https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
13 | >>> && mkdir /root/.conda \
14 | >>> && bash Miniconda3-latest-Linux-x86_64.sh -b \
15 | >>> && rm -f Miniconda3-latest-Linux-x86_64.sh \
16 | >>> && conda --version \
17 | >>> && conda install -c pytorch python=3.6.9 faiss-gpu cudatoolkit=11.0
18 |
--------------------
ERROR: failed to solve: process "/bin/sh -c wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && mkdir /root/.conda && bash Miniconda3-latest-Linux-x86_64.sh -b && rm -f Miniconda3-latest-Linux-x86_64.sh && conda --version && conda install -c pytorch python=3.6.9 faiss-gpu cudatoolkit=11.0" did not complete successfully: exit code: 1
I need to implement a custom dataset and its handling and been thinking about the easiest way to approach it.
I've implemented something half-way through it that kept me going and allowed to plug-in a custom dataset -- in fact it is a dataset derived from BIGANN by reducing dimensionality using a neural network.
I will show the code of what I needed to change and happy to discuss this further!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.