harsha-simhadri / big-ann-benchmarks Goto Github PK

View Code? Open in Web Editor NEW

321.0 321.0 111.0 124.98 MB

Framework for evaluating ANNS algorithms on billion scale datasets.

Home Page: https://big-ann-benchmarks.com

License: MIT License

Python 44.58% Dockerfile 2.30% Jupyter Notebook 51.67% Shell 1.45%

approximate-nearest-neighbor-search information-retrival

big-ann-benchmarks's People

Contributors

Stargazers

Watchers

Forkers

cqy123456 hkzhang-git delightrun alexklibisz gony-noreply dmitrykey binarymax xiaosu-zhu neurips-challenge-team-11 nk2014yj vamossagar12 windsorwho hildebrandmw holeguma llljun qiaoyuks masajiro anaruse nju-yasuo yong-wang pwzxxm supersupeng lijiunderstand jiajieyao 5ace shubhampachori12110095 shanpic isururanawaka myazi zjchenn ynuosoft kyoungrok0517 yujian-fu terry1504 harsha2010 ilikeyou3000 lihuibng wnch tiledb-inc caucherwang eric-epsilla chengsecret heaoxiang njorda epsilla-cloud impanyu wamos victorsuciu hhy3 alemagnani landrumb arthurhaiying heaoxiang1012 pkomlev angeljaviersalazar cadurosar benjaminxiang ziaddahmedd nishaq503 matchyc lucky-chang fzliu rrankpyramid ldhulipala khylonwong whateveraname puitar maronuu rutgers-db jacketsj wpjiang lizzy-0323 veaaaab sdu-l jonathanjia1 ekzhu moonm3n carlosgauci dm520 darvg nngerncham jaepil ziqiwww sandy4321 pauldintel amallia arron2003 a-guldborg kinddevil dungnmaster tangsiyang2001 crclark blenature kc-ck sourcesync suri-kumkaran aerospike-community devanshsa5 onurctirtir karthik86248

big-ann-benchmarks's Issues

If an algo makes heavy use of the filesystem cache, is it T1 or T2?

If an algorithm primarily stores its structures on disk, but makes heavy use of the filesystem cache to speedup access to those structures, does it belong in T1 or T2?

Some concrete examples:

Elastiknn uses Lucene, which makes heavy use of the filesystem cache, memmapping, etc.
Annoy is able to read from memmapped files ("Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly.")

I personally think they should fall under T2, but I could see arguments for both. Anyway, it would be good to clarify this somewhere in the readme or website if possible.

Error when installing diskann in docker image - missing .whl file

Command:
python install.py --neurips23track ood --algorithm diskann

Output + error:

Building base image...
[+] Building 137.8s (13/13) FINISHED                                                                                                                                                                                                 docker:default
 => [internal] load build definition from Dockerfile                                                                                                                                                                                           0.0s
 => => transferring dockerfile: 556B                                                                                                                                                                                                           0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                              0.0s
 => => transferring context: 66B                                                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/ubuntu:jammy                                                                                                                                                                                0.4s
 => [1/8] FROM docker.io/library/ubuntu:jammy@sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77                                                                                                                          1.2s
 => => resolve docker.io/library/ubuntu:jammy@sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77                                                                                                                          0.0s
 => => sha256:ec050c32e4a6085b423d36ecd025c0d3ff00c38ab93a3d71a460ff1c44fa6d77 1.13kB / 1.13kB                                                                                                                                                 0.0s
 => => sha256:56887c5194fddd8db7e36ced1c16b3569d89f74c801dc8a5adbf48236fb34564 424B / 424B                                                                                                                                                     0.0s
 => => sha256:01f29b872827fa6f9aed0ea0b2ede53aea4ad9d66c7920e81a8db6d1fd9ab7f9 2.30kB / 2.30kB                                                                                                                                                 0.0s
 => => sha256:b237fe92c4173e4dfb3ba82e76e5fed4b16186a6161e07af15814cb40eb9069d 29.54MB / 29.54MB                                                                                                                                               0.4s
 => => extracting sha256:b237fe92c4173e4dfb3ba82e76e5fed4b16186a6161e07af15814cb40eb9069d                                                                                                                                                      0.7s
 => [internal] load build context                                                                                                                                                                                                              0.0s
 => => transferring context: 320B                                                                                                                                                                                                              0.0s
 => [2/8] RUN apt-get update && apt-get install -y python3-numpy python3-scipy python3-pip build-essential git axel wget                                                                                                                      46.1s
 => [3/8] RUN wget https://aka.ms/downloadazcopy-v10-linux && mv downloadazcopy-v10-linux azcopy.tgz && tar xzf azcopy.tgz --transform 's!^[^/]\+\($\|/\)!azcopy_folder\1!'                                                                    1.0s
 => [4/8] RUN cp azcopy_folder/azcopy /usr/bin                                                                                                                                                                                                 0.4s
 => [5/8] RUN pip3 install -U pip                                                                                                                                                                                                              2.6s
 => [6/8] WORKDIR /home/app                                                                                                                                                                                                                    0.0s
 => [7/8] COPY requirements_py3.10.txt run_algorithm.py ./                                                                                                                                                                                     0.0s
 => [8/8] RUN pip3 install -r requirements_py3.10.txt                                                                                                                                                                                         81.0s
 => exporting to image                                                                                                                                                                                                                         4.8s
 => => exporting layers                                                                                                                                                                                                                        4.7s
 => => writing image sha256:ce6f63808ecd14af21e8afd0f5352768165e29ccccb422ec34b815db1691935f                                                                                                                                                   0.0s
 => => naming to docker.io/library/neurips23                                                                                                                                                                                                   0.0s
Building algorithm images... with (1) processes
Building neurips23-ood-diskann...
docker build  --rm -t neurips23-ood-diskann -f neurips23/ood/diskann/Dockerfile .
[+] Building 183.6s (14/15)                                                                                                                                                                                                          docker:default
 => [internal] load build definition from Dockerfile                                                                                                                                                                                           0.0s
 => => transferring dockerfile: 633B                                                                                                                                                                                                           0.0s
 => [internal] load .dockerignore                                                                                                                                                                                                              0.0s
 => => transferring context: 66B                                                                                                                                                                                                               0.0s
 => [internal] load metadata for docker.io/library/neurips23:latest                                                                                                                                                                            0.0s
 => [ 1/12] FROM docker.io/library/neurips23                                                                                                                                                                                                   0.1s
 => [ 2/12] RUN apt update                                                                                                                                                                                                                     1.4s
 => [ 3/12] RUN apt install -y software-properties-common                                                                                                                                                                                     14.8s
 => [ 4/12] RUN add-apt-repository -y ppa:git-core/ppa                                                                                                                                                                                         4.2s
 => [ 5/12] RUN apt update                                                                                                                                                                                                                     1.6s
 => [ 6/12] RUN DEBIAN_FRONTEND=noninteractive apt install -y git make cmake g++ libaio-dev libgoogle-perftools-dev libunwind-dev clang-format libboost-dev libboost-program-options-dev libmkl-full-dev libcpprest-dev python3.10            31.5s
 => [ 7/12] RUN git clone https://github.com/microsoft/DiskANN.git --branch  0.5.0.rc3                                                                                                                                                         1.8s
 => [ 8/12] WORKDIR /home/app/DiskANN                                                                                                                                                                                                          0.0s
 => [ 9/12] RUN pip3 install virtualenv build                                                                                                                                                                                                  2.2s
 => [10/12] RUN python3 -m build                                                                                                                                                                                                             125.0s
 => ERROR [11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl                                                                                                                                                         0.8s
------
 > [11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl:
0.726 WARNING: Requirement 'dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl' looks like a filename, but the file does not exist
0.744 Processing ./dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
0.749 ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/app/DiskANN/dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl'
0.749 
------
Dockerfile:13
--------------------
  11 |     RUN pip3 install virtualenv build
  12 |     RUN python3 -m build
  13 | >>> RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
  14 |     WORKDIR /home/app
  15 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl" did not complete successfully: exit code: 1


Install Status:
{'neurips23-ood-diskann': 'fail'}

System info:

For easy replication, I'm using an EC2 c5.2xlarge instance with image Deep Learning AMI GPU PyTorch 1.13.1 (Ubuntu 20.04) 20230818 because it comes with docker and conda preinstalled.

Specifics:

CPU arch: x86_64
OS: Ubuntu 20.04
Docker: version 24.0.5, build ced0996
Python 3.10.12

Storage type

Thank you for providing such a great competition.
I would like to confirm the SSD storages for T1 and T2 hardware. Which storage type is used for them, standard, premium or ultra disk? Compared to physical SSD, the standard and premium are very slow to build indexes. I haven’t tried the ultra disk yet.

Please share the results publically

Hello, are the results out? If yes, can you please share the results publicly? Ideally, something like https://ann-benchmarks.com/ graphs would be great.

Thanks!

Evaluation of Track 2: on-disk index size constraint

The 1TB constraint for the local SSD index is quite lower than the 1.92 TB available on the NVMe disk of the Azure Standard_L8s_v2 VMs. Could this constraint be increased?

Incorrect result of BBAnn on text2image public query set

It might be a mistake that in the t1_t2/README.md, the result of BBAnn (track 2) on text2image is incorrect.

As we discussed in PR #70, the best entry on the public query set is when QPS is 1540.622672933968, the recall is 0.495423.

We would be grateful if you would update the entry in the results section.

Docker build problems

I met problems when building docker images in neurips23 dir.
apt-get update: Problem executing scripts APT::Update::Post-Invoke 'rm -f /var/cache/apt/archives/.deb /var/cache/apt/archives/partial/.deb /var/cache/apt/*.bin || true'
pip3 install ...: RuntimeError: can't start new thread

But when I change the ubuntu base image from jammy to 20.04, the problems are fixed.
I think it might be because of the docker version (mine is 20.10.7). I don't see a specified version in this repo.
So I recommend the organizers share the docker version user in the evaluation machines.

Request for T2 baseline search parameters

I tried to reproduce the T2 baseline using official indices and codes in PR #17, but cannot get reasonable results compared to the official baseline(i.e. 2000QPS, 0.957 Recall@10 for BIGANN-1B). So would you please release your code and configurations for T2 baseline evaluation?

Here is my operations and results for BIGANN-1B:

Buy a Standard_L8s virtual machine from Azure with Ubuntu 20.04
Install necessary softwares such as Docker, Anaconda, CMake, etc.
Checkout code of PR #17
Run pip install -U -r requirements_py38.txt to install python requirements
Run python install.py to build docker images
Run python run.py --dataset bigann-1B --algorithm diskann-t2
Run python data_export.py --output result.csv to export results

The final results are as follows:

algorithm,parameters,dataset,count,qps,distcomps,build,indexsize,queriessize,wspq,recall/ap
diskann-t2,DiskANN,bigann-1B,10,883.7374471075963,0.0,1000000.0,51774612.0,58585.96596698971,inf,0.0019
diskann-t2,DiskANN,bigann-1B,10,1307.73774493947,0.0,1000000.0,51774612.0,39590.97471977949,inf,0.00108
diskann-t2,DiskANN,bigann-1B,10,803.5946779065205,0.0,1000000.0,51774612.0,64428.764181067374,inf,0.00208
diskann-t2,DiskANN,bigann-1B,10,1118.5931944484782,0.0,1000000.0,51774612.0,46285.4702289937,inf,0.00133
diskann-t2,DiskANN,bigann-1B,10,1604.1311836371815,0.0,1000000.0,51774612.0,32275.796722938252,inf,0.00076
diskann-t2,DiskANN,bigann-1B,10,929.937028492744,0.0,1000000.0,51774612.0,55675.39565976534,inf,0.0017500000000000003
diskann-t2,DiskANN,bigann-1B,10,966.1165292100424,0.0,1000000.0,51774612.0,53590.44218230504,inf,0.0017399999999999998
diskann-t2,DiskANN,bigann-1B,10,990.0256363593018,0.0,1000000.0,51774612.0,52296.23365147877,inf,0.00166
diskann-t2,DiskANN,bigann-1B,10,1051.6221286578705,0.0,1000000.0,51774612.0,49233.09484375076,inf,0.00149
diskann-t2,DiskANN,bigann-1B,10,682.3607260592325,0.0,1000000.0,51774612.0,75875.72089473638,inf,0.00258

Download indices outside of docker

Using axel to speed-up index downloads can basically only be run with -q (quiet) inside the docker container. It would be better if index downloading would happen outside of the container.

Bug? when running evaluate

Hi ! Thanks for providing the scripts for evaluating results.

I found that when running python data_export.py --output res.csv,
this line of code:
power_capture.detect_power_benchmarks(metrics, res)
will run out the yield generator res,
so the next line:
for i, (properties, run) in enumerate(res):
don't output anything to be write into res.csv.

I'm still studying the code, and not sure if this is a bug...

Add download option for prebuild indices

Add faiss_t3 installation instructions

There is a bit of non-trivial setup for setting up a machine for faiss_t3 to work with docker. I'll add this to a README in the t3/faiss_t3/ directory.

In general, docker won't be able accommodate all the installation steps needed for various T3 submissions ( host drivers and libraries, etc. ). This was expected :-)

Radius value used for evaluation in range search

Hi there,

There is a default value for radius here

big-ann-benchmarks/dataset_preparation/make_groundtruth.py

Line 241 in e29a4fa

aa('--radius', default=96237, type=float, help="range search radius")

I am wondering will this default value 96237 be used for the final evaluation in the benchmark?

Support for non-python implementations

Hi, I have a question about using this framework w/ a non-python ANN implementation.

It looks like this is mostly a fork of ann-benchmarks. So the only option for using outside of Python is to hack together a client/server setup, as has been done for a few algos in ann-benchmarks. This obviously handicaps and complicates non-python implementations, as it introduces costs of context-switching, serialization, and data transfer among processes.

I asked and was told early on by project organizers that the big-ann challenge would support non-python implementations:

So I'm wondering if there has been progress here, or any idea of how it might work?

It seems like it wouldn't be terribly difficult to refactor the code so that the containers executed by runner.py can have any entrypoint, e.g., a program in another language. The interface between runner and algorithm would then simply be some standard file format for inputs and nearest neighbor results. If that sounds like a good idea I can try to implement it. Otherwise maybe we can use this ticket for discussing alternatives.

Thanks
-Alex

Problem with dealing with datasets

in the sparse dataset "base_small.csr" there is no documentation of how we deal with this dataset format. For example, how to read it in a matrix format.

TypeError: request() got an unexpected keyword argument 'chunked'

when I use this command, there is an error.

python run.py --neurips23track ood       --algorithm diskann --dataset random-xs

how to fix this issue? Anybody can help me? thanks.

Preparing datasets with 10000 random points and 1000 queries.
Computing groundtruth
2023-09-04 21:25:39,786 - annb - INFO - running only diskann
Traceback (most recent call last):
  File "run.py", line 6, in <module>
    main()
  File "/home/cy/work_cy/big-ann-benchmarks/benchmark/main.py", line 236, in main
    for image in docker_client.images.list():
  File "/home/cy/.local/lib/python3.8/site-packages/docker/models/images.py", line 230, in list
    resp = self.client.api.images(name=name, all=all, filters=filters)
  File "/home/cy/.local/lib/python3.8/site-packages/docker/api/image.py", line 93, in images
    res = self._result(self._get(self._url("/images/json"), params=params),
  File "/home/cy/.local/lib/python3.8/site-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/home/cy/.local/lib/python3.8/site-packages/docker/api/client.py", line 191, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/cy/.local/lib/python3.8/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/home/cy/.local/lib/python3.8/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/home/cy/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 790, in urlopen
    response = self._make_request(
  File "/home/cy/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 496, in _make_request
    conn.request(
TypeError: request() got an unexpected keyword argument 'chunked'

Incorrect diskann dockerfiles

When running python install.py --neurips23track ood --algorithm diskann or python install.py --neurips23track streaming --algorithm diskann, an issue is encountered where the installation of diskann fails because the file name for the compiled diskann is specified as dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl when the file created by the build is dist/diskannpy-0.5.0rc2-cp310-cp310-linux_x86_64.whl.

This seems to stem from this commit in which the diskann version is upgraded to rc3, but the rc3 branch does not seem to compile an output with the correct name, causing the issue.

Fixed for me by changing the filename in the dockerfiles as described, but if the upgrade to rc3 matters it might therefore not be getting applied.

Merge t1 and t3 faiss index building code.

streaming baseline has error on ubuntu 22.04

When running the cmd:
python3 run.py --dataset msturing-10M-clustered --algorithm diskann --neurips23track streaming --runbook_path neurips23/streaming/delete_runbook.yaml

This error continuously pops out after several iterations:
...
...
2023-08-10 01:08:16,415 - annb.31d58d2d344a - INFO - ^[[34mStep 54 took 3.0553574562072754s.^[[0m
2023-08-10 01:08:16,621 - annb.31d58d2d344a - INFO - ^[[34m#active pts 4539934 #unprocessed deletes 1500000^[[0m
2023-08-10 01:09:59,611 - annb.31d58d2d344a - ERROR - Container.wait for container 31d58d2d344a failed with exception
2023-08-10 01:09:59,611 - annb.31d58d2d344a - ERROR - Invoked with ['--dataset', 'msturing-10M-clustered', '--algorithm', 'diskann', '--module', 'neurips23.streaming.diskann.diskann-str', '--constructor', 'diskann', '--runs', '5', '--count', '10', '--neurips23track', 'streaming', '--runbook_path', 'neurips23/streaming/delete_runbook.yaml', '["euclidean", {"R": 64, "L": 50, "insert_threads": 16, "consolidate_threads": 16}]', '[{"Ls": 100, "T": 16}]']
Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 710, in _error_catcher
yield
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1077, in read_chunked
self._update_chunk_length()
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1005, in _update_chunk_length
line = self._fp.fp.readline() # type: ignore[union-attr]
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 816, in generate
yield from self.raw.stream(chunk_size, decode_content=True)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 937, in stream
yield from self.read_chunked(amt, decode_content=decode_content)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 1065, in read_chunked
with self._error_catcher():
File "/usr/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/urllib3/response.py", line 715, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.") from e # type: ignore[arg-type]
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/impanyu/big-ann-benchmarks/benchmark/runner.py", line 318, in run_docker
return_value = container.wait(timeout=timeout)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/models/containers.py", line 514, in wait
return self.client.api.wait(self.id, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/utils/decorators.py", line 19, in wrapped
return f(self, resource_id, *args, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/api/container.py", line 1338, in wait
res = self._post(url, timeout=timeout, params=params)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/docker/api/client.py", line 233, in _post
return self.post(url, **self._set_request_timeout(kwargs))
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/sessions.py", line 747, in send
r.content
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 899, in content
self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
File "/home/impanyu/big-ann-benchmarks/ann/lib/python3.10/site-packages/requests/models.py", line 822, in generate
raise ConnectionError(e)
requests.exceptions.ConnectionError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out.
103,1 99%

the definition of QPS in internet services is different

Cannot evaluate `python3 run.py --neurips23track filter --algorithm faiss --dataset yfcc-10M`

> python3 run.py --neurips23track filter    --algorithm faiss   --dataset yfcc-10M
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/query.public.100K.u8bin -> data/yfcc100M/query.public.100K.u8bin...
  [2.55 s] downloaded 18.31 MiB / 18.31 MiB at 7.19 MiB/s   
download finished in 2.55 s, total size 19200008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/GT.public.ibin -> data/yfcc100M/GT.public.ibin...
  [1.45 s] downloaded 7.63 MiB / 7.63 MiB at 5.28 MiB/s   
download finished in 1.45 s, total size 8000008 bytes
file data/yfcc100M/ already exists
file data/yfcc100M/ already exists
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/base.metadata.10M.spmat -> data/yfcc100M/base.metadata.10M.spmat...
  [94.03 s] downloaded 901.87 MiB / 901.87 MiB at 9.59 MiB/s   
download finished in 94.03 s, total size 945683840 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/yfcc100M/query.metadata.public.100K.spmat -> data/yfcc100M/query.metadata.public.100K.spmat...
  [1.00 s] downloaded 1.82 MiB / 1.82 MiB at 1.82 MiB/s   
download finished in 1.00 s, total size 1907024 bytes
2023-07-18 19:55:12,243 - annb - INFO - running only faiss
2023-07-18 19:55:12,319 - annb - INFO - Order: [Definition(algorithm='faiss', constructor='FAISS', module='neurips23.filter.faiss.faiss', docker_tag='neurips23-filter-faiss', docker_volumes=[], arguments=['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}], query_argument_groups=[[{'nprobe': 1, 'mt_threshold': 0.0003}], [{'nprobe': 4, 'mt_threshold': 0.0003}], [{'nprobe': 16, 'mt_threshold': 0.0003}], [{'nprobe': 32, 'mt_threshold': 0.0003}], [{'nprobe': 64, 'mt_threshold': 0.0003}], [{'nprobe': 96, 'mt_threshold': 0.0003}], [{'nprobe': 1, 'mt_threshold': 0.0001}], [{'nprobe': 4, 'mt_threshold': 0.0001}], [{'nprobe': 16, 'mt_threshold': 0.0001}], [{'nprobe': 32, 'mt_threshold': 0.0001}], [{'nprobe': 64, 'mt_threshold': 0.0001}], [{'nprobe': 96, 'mt_threshold': 0.0001}], [{'nprobe': 1, 'mt_threshold': 0.01}], [{'nprobe': 4, 'mt_threshold': 0.01}], [{'nprobe': 16, 'mt_threshold': 0.01}], [{'nprobe': 32, 'mt_threshold': 0.01}], [{'nprobe': 64, 'mt_threshold': 0.01}], [{'nprobe': 96, 'mt_threshold': 0.01}]], disabled=False)]
RW Namespace(dataset='yfcc-10M', count=10, definitions='algos-2021.yaml', algorithm='faiss', docker_tag=None, list_algorithms=False, force=False, rebuild=False, runs=5, timeout=43200, max_n_algorithms=-1, power_capture='', t3=False, nodocker=False, upload_index=False, download_index=False, blob_prefix=None, sas_string=None, private_query=False, neurips23track='filter', runbook_path='neurips23/streaming/simple_runbook.yaml')
Setting container wait timeout to 30 minutes
2023-07-18 19:55:12,762 - annb.d25eedf2531c - INFO - Created container d25eedf2531c: CPU limit 0-11, mem limit 25092139776, timeout 1800, command ['--dataset', 'yfcc-10M', '--algorithm', 'faiss', '--module', 'neurips23.filter.faiss.faiss', '--constructor', 'FAISS', '--runs', '5', '--count', '10', '--neurips23track', 'filter', '["euclidean", {"indexkey": "IVF16384,SQ8", "binarysig": true, "threads": 16}]', '[{"nprobe": 1, "mt_threshold": 0.0003}]', '[{"nprobe": 4, "mt_threshold": 0.0003}]', '[{"nprobe": 16, "mt_threshold": 0.0003}]', '[{"nprobe": 32, "mt_threshold": 0.0003}]', '[{"nprobe": 64, "mt_threshold": 0.0003}]', '[{"nprobe": 96, "mt_threshold": 0.0003}]', '[{"nprobe": 1, "mt_threshold": 0.0001}]', '[{"nprobe": 4, "mt_threshold": 0.0001}]', '[{"nprobe": 16, "mt_threshold": 0.0001}]', '[{"nprobe": 32, "mt_threshold": 0.0001}]', '[{"nprobe": 64, "mt_threshold": 0.0001}]', '[{"nprobe": 96, "mt_threshold": 0.0001}]', '[{"nprobe": 1, "mt_threshold": 0.01}]', '[{"nprobe": 4, "mt_threshold": 0.01}]', '[{"nprobe": 16, "mt_threshold": 0.01}]', '[{"nprobe": 32, "mt_threshold": 0.01}]', '[{"nprobe": 64, "mt_threshold": 0.01}]', '[{"nprobe": 96, "mt_threshold": 0.01}]']
2023-07-18 19:55:13,268 - annb.d25eedf2531c - INFO - ['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}]
2023-07-18 19:55:13,268 - annb.d25eedf2531c - INFO - Trying to instantiate neurips23.filter.faiss.faiss.FAISS(['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}])
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - Running faiss on yfcc-10M
2023-07-18 19:55:13,305 - annb.d25eedf2531c - INFO - preparing binary signatures
2023-07-18 19:55:40,382 - annb.d25eedf2531c - INFO - writing to data/yfcc-10M.IVF16384,SQ8.binarysig
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO - Traceback (most recent call last):
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO -   File "/home/app/run_algorithm.py", line 3, in <module>
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO -     run_from_cmdline()
2023-07-18 19:55:44,039 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/runner.py", line 222, in run_from_cmdline
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     run(definition, args.dataset, args.count, args.runs, args.rebuild,
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/runner.py", line 69, in run
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     build_time = custom_runner.build(algo, dataset)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     algo.fit(dataset)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/neurips23/filter/faiss/faiss.py", line 112, in fit
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     xb = ds.get_dataset()
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/datasets.py", line 217, in get_dataset
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     slice = next(self.get_dataset_iterator(bs=self.nb))
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/datasets.py", line 190, in get_dataset_iterator
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     x = xbin_mmap(filename, dtype=self.dtype, maxn=self.nb)
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -   File "/home/app/benchmark/dataset_io.py", line 96, in xbin_mmap
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO -     n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
2023-07-18 19:55:44,040 - annb.d25eedf2531c - INFO - FileNotFoundError: [Errno 2] No such file or directory: 'data/yfcc100M/base.10M.u8bin.crop_nb_10000000'
2023-07-18 19:55:44,390 - annb.d25eedf2531c - ERROR - ['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}]
Trying to instantiate neurips23.filter.faiss.faiss.FAISS(['euclidean', {'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}])
{'indexkey': 'IVF16384,SQ8', 'binarysig': True, 'threads': 16}
Running faiss on yfcc-10M
preparing binary signatures
writing to data/yfcc-10M.IVF16384,SQ8.binarysig
Traceback (most recent call last):
  File "/home/app/run_algorithm.py", line 3, in <module>
    run_from_cmdline()
  File "/home/app/benchmark/runner.py", line 222, in run_from_cmdline
    run(definition, args.dataset, args.count, args.runs, args.rebuild,
  File "/home/app/benchmark/runner.py", line 69, in run
    build_time = custom_runner.build(algo, dataset)
  File "/home/app/benchmark/algorithms/base_runner.py", line 7, in build
    algo.fit(dataset)
  File "/home/app/neurips23/filter/faiss/faiss.py", line 112, in fit
    xb = ds.get_dataset()
  File "/home/app/benchmark/datasets.py", line 217, in get_dataset
    slice = next(self.get_dataset_iterator(bs=self.nb))
  File "/home/app/benchmark/datasets.py", line 190, in get_dataset_iterator
    x = xbin_mmap(filename, dtype=self.dtype, maxn=self.nb)
  File "/home/app/benchmark/dataset_io.py", line 96, in xbin_mmap
    n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
FileNotFoundError: [Errno 2] No such file or directory: 'data/yfcc100M/base.10M.u8bin.crop_nb_10000000'

2023-07-18 19:55:44,390 - annb.d25eedf2531c - ERROR - Child process for container d25eedf2531creturned exit code 1 with message None

match entry['operation']: ^ SyntaxError: invalid syntax

Hallo,
I have a problem when i follow the instructions on the README file.
first i create a conda python3.10 environment, and run pip install -r requirements_py3.10.txt.
then i run python3 install.py --algorithm pqbuddy, and it create a docker container.
Aftrer i prepare the dataset with python3 create_dataset.py --dataset deep-10M, i run python3 run.py --algorithm pqbuddy --dataset deep-10M --rebuild and the error occurs.

File "run_algorithm.py", line 1, in
from benchmark.runner import run_from_cmdline
File "/home/app/benchmark/runner.py", line 26, in
from neurips23.common import RUNNERS
File "/home/app/neurips23/common.py", line 7, in
from neurips23.streaming.run import StreamingRunner
File "/home/app/neurips23/streaming/run.py", line 39
match entry['operation']:
^
SyntaxError: invalid syntax

could you please help out with the problem, thx

faiss segfaults on f32 instances

I cannot reproduce it on h8 or e8 instances, but on f32v2 instances faiss will segfault with some parameter settings. E.g., set up everything to run msturing-1B and carry out

params="
nprobe=128,quantizer_efSearch=128
nprobe=64,quantizer_efSearch=512
nprobe=128,quantizer_efSearch=256
nprobe=128,quantizer_efSearch=512
nprobe=256,quantizer_efSearch=256
nprobe=256,quantizer_efSearch=512
"

python  track1_baseline_faiss/baseline_faiss.py \
           --dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex \
              --search --searchparams $params

results in

azureuser@test:~/big-ann-benchmarks$ bash test.sh
nb processors 32
model name      : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Dataset MSTuringANNS in dimension 100, with distance euclidean, search_type knn, size: Q 100000 B 1000000000
reading data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex
imbalance_factor= 1.5638867719477003
index size on disk:  41360658380
current RSS: 44945760256
precomputed tables size: 0
Search threads: 32
Optimize for intersection @  10
Running evaluation on 6 searchparams
parameters                                   inter@ 10 time(ms/q)   nb distances %quantization #runs
nprobe=128,quantizer_efSearch=128        test.sh: line 12:  8954 Killed                  python track1_baseline_faiss/baseline_faiss.py --dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex --search --searchparams $params

Any thoughts Matthijs? (Once you are back from vacation)

AttributeError: _DistInfoDistribution__dep_map with requirements_py3.10.txt

$ pip install -r big-ann-benchmarks/requirements_py3.10.txt

Collecting ansicolors==1.1.8 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 1))
  Downloading ansicolors-1.1.8-py2.py3-none-any.whl (13 kB)
Collecting docker==6.1.2 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 2))
  Downloading docker-6.1.2-py3-none-any.whl (148 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.1/148.1 kB 3.6 MB/s eta 0:00:00a 0:00:01
Collecting h5py==3.8.0 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 3))
  Downloading h5py-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.6/4.6 MB 22.8 MB/s eta 0:00:0000:0100:01
Collecting matplotlib==3.3.4 (from -r big-ann-benchmarks/requirements_py3.10.txt (line 4))
  Downloading matplotlib-3.3.4.tar.gz (37.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 37.9/37.9 MB 23.6 MB/s eta 0:00:0000:0100:01
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [56 lines of output]
      Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3031, in _dep_map
          return self.__dep_map
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2828, in __getattr__
          raise AttributeError(attr)
      AttributeError: _DistInfoDistribution__dep_map
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3022, in _parsed_pkg_info
          return self._pkg_info
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2828, in __getattr__
          raise AttributeError(attr)
      AttributeError: _pkg_info. Did you mean: 'egg_info'?
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-ohdk213a/matplotlib_cd1d295c77724caeb6457c930691d7e2/setup.py", line 256, in <module>
          setup(  # Finally, pass this all along to distutils to do the heavy lifting.
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 152, in setup
          _install_setup_requires(attrs)
        File "/opt/conda/lib/python3.10/site-packages/setuptools/__init__.py", line 147, in _install_setup_requires
          dist.fetch_build_eggs(dist.setup_requires)
        File "/opt/conda/lib/python3.10/site-packages/setuptools/dist.py", line 812, in fetch_build_eggs
          resolved_dists = pkg_resources.working_set.resolve(
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 785, in resolve
          new_requirements = dist.requires(req.extras)[::-1]
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 2749, in requires
          dm = self._dep_map
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3033, in _dep_map
          self.__dep_map = self._compute_dependencies()
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3042, in _compute_dependencies
          for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3024, in _parsed_pkg_info
          metadata = self.get_metadata(self.PKG_INFO)
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1412, in get_metadata
          value = self._get(path)
        File "/opt/conda/lib/python3.10/site-packages/pkg_resources/__init__.py", line 1616, in _get
          with open(path, 'rb') as stream:
      FileNotFoundError: [Errno 2] No such file or directory: '/opt/conda/lib/python3.10/site-packages/numpy-1.25.0.dist-info/METADATA'
      
      Edit setup.cfg to change the build options; suppress output with --quiet.
      
      BUILDING MATPLOTLIB
        matplotlib: yes [3.3.4]
            python: yes [3.10.12 | packaged by conda-forge | (main, Jun 23 2023,
                        22:40:32) [GCC 12.3.0]]
          platform: yes [linux]
       sample_data: yes [installing]
             tests: no  [skipping due to configuration]
            macosx: no  [Mac OS-X only]
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

is this a permanent fork of ann-benchmark?

It feels a bit weird to see a lot of activity on this repo rather than trying to contribute to the original one https://github.com/erikbern/ann-benchmarks

Is the ambition to merge it back into the main repo? Or is this just a short-lived repo anyway?

I'm happy to donate my code to something more neutral (eg we can set up a neutral github.com organzation rather than have the code under my username). Seems like it would be beneficial to to not diverge too far.

(also felt a bit weird that no one told me about this – I found out about it randomly)

@maumueller wdyt?

where is diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl?

when building neurips23-ood-diskann...
=> ERROR [11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl 1.4s

[11/12] RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl:
1.266 WARNING: Requirement 'dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl' looks like a filename, but the file does not exist
1.299 Processing ./dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
1.307 ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/app/DiskANN/dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl'
1.307

Dockerfile:13

11 | RUN pip3 install virtualenv build
12 | RUN python3 -m build
13 | >>> RUN pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl
14 | WORKDIR /home/app
15 |

ERROR: failed to solve: process "/bin/sh -c pip install dist/diskannpy-0.5.0rc3-cp310-cp310-linux_x86_64.whl" did not complete successfully: exit code: 1

it seems diskann doesnt have dist/ directory

Discussion on Future Directions

Dear all,

<tl;dr> Please add your thoughts on the future of this benchmark!

Thank you very much for participating in our NeurIPS'21 competition. The competition will end with an event on Dec 8, and you can find the timeline for this event on https://big-ann-benchmarks.com/. We hope many of you will be able to participate!

The last part of the event will be an open discussion among the participants for future directions of this competition. As organizers we have already identified some points we would like to discuss and potentially include in a future version of the benchmark.

Filtered ANNS: can you support ANNS queries which allow filters like date range, author or some combination of attributes. This would look like a simple SQL + ANNS query.
Streaming ANNS: Can algorithms be robust to insertions and deletions. Here we have a strong baseline (fresh-diskann: https://arxiv.org/abs/2105.09613)
Out of distribution queries: this is already a problem with T2I and we can imagine various variations
Better vector compression: Most approaches use some variant of product quantization as vector compression, but can we get more accurate estimation, maybe at the price of more expensive decoding?

Please let us know what you think about these topics, and add your own!

Thanks!

track1_baseline_faiss/baseline_faiss.py runs out of memory for 100M vectors on F32s_v2 with 64G RAM

Hello!

Thanks for providing the scripts for running baselines. The following one liner:

python -u track1_baseline_faiss/baseline_faiss.py --dataset bigann-100M \
    --indexkey OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr \
    --maxtrain 100000000 \
    --two_level_clustering \
    --build \
    --add_splits 30 \
    --indexfile data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex \
    --quantizer_efConstruction 200 \
    --quantizer_add_efSearch 80

produces output on F32s_v2 with 64G RAM:

args= Namespace(M0=-1, add_bs=100000, add_splits=30, autotune_max=[], autotune_range=[], basedir=None, build=True, buildthreads=-1, by_residual=-1, clustering_niter=-1, dataset='bigann-100M', indexfile='data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex', indexkey='OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr', inter=True, k=10, maxRAM=-1, maxtrain=100000000, min_test_duration=3.0, n_autotune=500, no_precomputed_tables=False, pairwise_quantization='', parallel_mode=-1, prepare=False, quantizer_add_efSearch=80, quantizer_efConstruction=200, query_bs=-1, radius=96237, search=False, searchparams=['autotune'], searchthreads=-1, stop_at_split=-1, train_on_gpu=False, two_level_clustering=True)
nb processors 32
model name	: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Dataset BigANNDataset in dimension 128, with distance euclidean, search_type knn, size: Q 10000 B 100000000
build index, key= OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr
Build-time number of threads: 32
metric type 1
Update add-time parameters
   update quantizer efSearch= 16 -> 80
  update quantizer efConstruction= 40 -> 200
getting first 100000000 dataset vectors for training
train, size (100000000, 128)
  Forcing OPQ training PQ to PQ4
  training vector transform
  transform trainset
Killed

Can you please explain what could be wrong? Is the expectation to allocate 10% of data for training?

Docker errors on mac

When running with the latest Docker Desktop on mac (https://docs.docker.com/desktop/install/mac-install/), running run.py stops with a “Keyerror” around line 302 of runner.py (there is no item in the dict with key “Error”).

With an older version of docker (Docker version 20.10.11, build dea9396) the code is running as expected, but the logs show nothing (although the container is running, and the results folder is created with the new results).

Separate definitions file for T3 algorithms

@maumueller T3 algorithms will be tied to certain hardware. Instead of putting T3 algo definitions in the default algos.yaml, would it be better to put them into a separate one?

Evaluation of Track 2: Building new index by starting from existing baseline indexes

Could the building of the index (which has a 4 days constraint) start by downloading one of the available baseline indexes, and then apply changes to it (possibly taking a couple of days)?

Unable to download bigann-1B dataset

I am trying to get bigann-1B dataset using following command, But it fails with error.
Is there an alternative source for this dataset ??

(big-ann) ubuntu@ip-172-31-2-12:~/pgvector_testing/big-ann-benchmarks$ python create_dataset.py --dataset bigann-1B
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.public.10K.u8bin -> data/bigann/query.public.10K.u8bin...
[0.24 s] downloaded 1.22 MiB / 1.22 MiB at 5.10 MiB/s
download finished in 0.24 s, total size 1280008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/GT.public.1B.ibin -> data/bigann/GT.public.1B.ibin...
[0.40 s] downloaded 7.63 MiB / 7.63 MiB at 19.18 MiB/s
download finished in 0.40 s, total size 8000008 bytes
downloading https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/query.private.799253207.10K.u8bin -> data/bigann/query.private.799253207.10K.u8bin...
Traceback (most recent call last):
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/create_dataset.py", line 16, in
ds.prepare(True if args.skip_data else False)
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/benchmark/datasets.py", line 140, in prepare
download(self.private_qs_url, outfile)
File "/home/ubuntu/pgvector_testing/big-ann-benchmarks/benchmark/dataset_io.py", line 25, in download
inf = urlopen(src)
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Questions about Dataset class Interface and dataset prepare

Question about dataset class Interface

I wonder if it is okay to use all methods (interfaces) exposed in the Dataset class when implementing the algorithms to be used in the benchmark.

I am trying to access the file directly by using get_dataset_fn method instead of get_dataset_iterator method, and I wonder if this is not an issue.
If possible, there seems to be something wrong with the implementation of the get_dataset_fn method for small datasets.

In the get_dataset_fn method, if there is an original (1-billion) file, the path of the original file is returned. When used in get_dataset_iterator method, it seems reasonable because only a part of the original file is used by mmap. However, if get_dataset_fn is an externally exposed interface, it would be appropriate to give the path of the actual small file. Or, when using the get_dataset_fn method, if it is a small dataset but not a crop file, I am wondering if I should use only a part of the file.

Qustion about dataset prepare

big-ann-benchmarks/benchmark/main.py

Line 145 in 8180e0e

dataset.prepare(True) # prepare dataset, but skip potentially huge base vectors

I wonder if it can be assumed that the dataset file is downloaded in actual evaluation.

DataSet Format Question

When I look at the meta file for the yfcc100M dataset I can only get the first two values as the number of nodes and the number of labels,
For the description of the latter data format I guess it is the sparse matrix format but I can't tell for sure
Please tell me where should I get more information from
Thx!!!!!!!!

use `python plot.py --dataset yfcc-10M --neurips23track filter` instead of just `python plot.py --dataset yfcc-10M`

Otherwise, an error will be generated

> python3 plot.py --dataset yfcc-10M 

writing output to results/yfcc-10M.png
Traceback (most recent call last):
  File "/home/nop/projects/nips23/big-ann-benchmarks/plot.py", line 161, in <module>
    raise Exception('Nothing to plot')
Exception: Nothing to plot

After adding --neurips23track filter everything works as expected

> python3 plot.py --dataset yfcc-10M --neurips23track filter
writing output to results/yfcc-10M.png
Computing knn metrics
  0:                  Faiss(('IVF16384,SQ8', {'nprobe': 32, 'mt_threshold': 0.0001}))        0.847     3069.679
Computing knn metrics
  1:                  Faiss(('IVF16384,SQ8', {'nprobe': 64, 'mt_threshold': 0.0001}))        0.901     2421.102
...

The correct command line was somewhat unclear from the description on the main page.

Failure to install requirements.txt

Python: 3.8.5

(venv) (base) dmitry@dmitrykan:/datadrive/big-ann-benchmarks$ pip install -r requirements.txt
Collecting ansicolors==1.1.8
  Using cached ansicolors-1.1.8-py2.py3-none-any.whl (13 kB)
Collecting docker==2.6.1
  Using cached docker-2.6.1-py2.py3-none-any.whl (117 kB)
Collecting h5py==2.10.0
  Using cached h5py-2.10.0-cp38-cp38-manylinux1_x86_64.whl (2.9 MB)
Collecting matplotlib==2.1.0
  Using cached matplotlib-2.1.0.tar.gz (35.7 MB)
    ERROR: Command errored out with exit status 1:
     command: /datadrive/big-ann-benchmarks/venv/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py'"'"'; __file__='"'"'/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-vl5anihm
         cwd: /tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/
    Complete output (78 lines):
    IMPORTANT WARNING:
        pkg-config is not installed.
        matplotlib may not be able to find some of its dependencies
    ============================================================================
    Edit setup.cfg to change the build options
    
    BUILDING MATPLOTLIB
                matplotlib: yes [2.1.0]
                    python: yes [3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC
                            7.3.0]]
                  platform: yes [linux]
    
    REQUIRED DEPENDENCIES AND EXTENSIONS
                     numpy: yes [not found. pip may install it below.]
                       six: yes [six was not found.pip will attempt to install
                            it after matplotlib.]
                  dateutil: yes [dateutil was not found. It is required for date
                            axis support. pip/easy_install may attempt to
                            install it after matplotlib.]
    backports.functools_lru_cache: yes [Not required]
              subprocess32: yes [Not required]
                      pytz: yes [pytz was not found. pip/easy_install may
                            attempt to install it after matplotlib.]
                    cycler: yes [cycler was not found. pip/easy_install may
                            attempt to install it after matplotlib.]
                   tornado: yes [tornado was not found. It is required for the
                            WebAgg backend. pip/easy_install may attempt to
                            install it after matplotlib.]
                 pyparsing: yes [pyparsing was not found. It is required for
                            mathtext support. pip/easy_install may attempt to
                            install it after matplotlib.]
                    libagg: yes [pkg-config information for 'libagg' could not
                            be found. Using local copy.]
                  freetype: no  [The C/C++ header for freetype2 (ft2build.h)
                            could not be found.  You may need to install the
                            development package.]
                       png: yes [version 1.6.37]
                     qhull: yes [pkg-config information for 'libqhull' could not
                            be found. Using local copy.]
    
    OPTIONAL SUBPACKAGES
               sample_data: yes [installing]
                  toolkits: yes [installing]
                     tests: no  [skipping due to configuration]
            toolkits_tests: no  [skipping due to configuration]
    
    OPTIONAL BACKEND EXTENSIONS
                    macosx: no  [Mac OS-X only]
                    qt5agg: no  [PySide2 not found; PyQt5 not found]
                    qt4agg: no  [PySide not found; PyQt4 not found]
                   gtk3agg: no  [Requires pygobject to be installed.]
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setup.py", line 216, in <module>
        pkg_help = pkg.install_help_msg()
      File "/tmp/pip-install-glx_dtq7/matplotlib_c5b2d6b3ba9e41898b05b991281ae963/setupext.py", line 595, in install_help_msg
        release = platform.linux_distribution()[0].lower()
    AttributeError: module 'platform' has no attribute 'linux_distribution'
                 gtk3cairo: no  [Requires cairocffi or pycairo to be installed.]
                    gtkagg: no  [Requires pygtk]
                     tkagg: yes [installing; run-time loading from Python Tcl /
                            Tk]
                     wxagg: no  [requires wxPython]
                       gtk: no  [Requires pygtk]
                       agg: yes [installing]
                     cairo: no  [cairocffi or pycairo not found]
                 windowing: no  [Microsoft Windows only]
    
    OPTIONAL LATEX DEPENDENCIES
                    dvipng: no
               ghostscript: no
                     latex: no
                   pdftops: no
    
    OPTIONAL PACKAGE DATA
                      dlls: no  [skipping due to configuration]
    
    ============================================================================
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/6c/90/cf10bb2020d2811da811a49601f6eafcda022c6ccd296fd05aba093dee96/matplotlib-2.1.0.tar.gz#sha256=4b5f16c9cefde553ea79975305dcaa67c8e13d927b6e55aa14b4a8d867e25387 (from https://pypi.org/simple/matplotlib/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement matplotlib==2.1.0 (from versions: 0.86, 0.86.1, 0.86.2, 0.91.0, 0.91.1, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.3.0, 1.3.1, 1.4.0, 1.4.1rc1, 1.4.1, 1.4.2, 1.4.3, 1.5.0, 1.5.1, 1.5.2, 1.5.3, 2.0.0b1, 2.0.0b2, 2.0.0b3, 2.0.0b4, 2.0.0rc1, 2.0.0rc2, 2.0.0, 2.0.1, 2.0.2, 2.1.0rc1, 2.1.0, 2.1.1, 2.1.2, 2.2.0rc1, 2.2.0, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 3.0.0rc2, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0rc1, 3.1.0rc2, 3.1.0, 3.1.1, 3.1.2, 3.1.3, 3.2.0rc1, 3.2.0rc3, 3.2.0, 3.2.1, 3.2.2, 3.3.0rc1, 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.4, 3.4.0rc1, 3.4.0rc2, 3.4.0rc3, 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.5.0b1)
ERROR: No matching distribution found for matplotlib==2.1.0

Range results order flipped, need to update HTTPANN

@alexklibisz Since FAISS used D,I order to return range search results, in a recent commit, I changed the range search result return order in the base ANN class to D,I.

Please consider updating HttpANN class as well.

when downloading datasets，we met two errors

when downloading datasets，we met two errors, error info is shown as below:

A.
Initializing download: https://dl.fbaipublicfiles.com/billion-scale-ann-benchmarks/bigann/base.1B.u8bin
Unable to connect to server dl.fbaipublicfiles.com:80

B.
Initializing download: https://comp21storage.blob.core.windows.net/publiccontainer/comp21/MSFT-TURING-ANNS/base1b.fbin
HTTP/1.1 400 The account being accessed does not support http.

Write T1/T2 participant documentation as nice as George's T3

Proposed changes to framework

The current workflow to run algorithm X on dataset Y is something like this:

python install.py builds docker container
python create_dataset.py sets up datasets
python run.py --dataset Y --algorithm X mounts data/, results/, benchmark/ into the container for X
- it takes care of parsing the definitions file and checking present runs to figure out which runs to carry out.
- py-docker is used to spawn the container from within the Python process
- results are written to results/
python plot.py / data_export.py / ... to evaluate the results

Given @harsha-simhadri's and @sourcesync's frustrations and some directions discussed in other meetings, I think we should relax step 3 a bit and allow more flexibility in the container setup. One direction could look like this:

python install.py builds docker container, participants are expected to overwrite the entry point to point to their own implementation (file algorithms/X/Dockerfile)
python create_dataset.py sets up datasets
A python/shell script that contains the logic to run the container for X, (in algorithms/X/run.{py,sh})
- as arguments, we provide task, dataset, where the results should be written, and some additional parameters
- we mount data/, results/, and the config file that is used by the implementation (algorithms/X/config.yaml, maybe task specific)
- The following is done by the implementation in the container:
  a. file I/O in the container, loading/building index
  b. running the experiment and providing timings
  c. writing results in a standard format (as before results/Y/X/run_identifier.hdf5)
python plot.py / data_export.py / ... to evaluate the results

We provide a default run script for inspiration, which would be pretty close to the current setup. Putting all the logic into the container could mean a lot of code duplication, but isolated containers will allow for a much easier orchestration.

I can provide a proof-of-concept if this sounds promising.

Extract best recall above 10000 qps (t1)/ 1000 qps (t2)

Azure Premium SSD Cache Mode

For the Azure Premium SSD used in the build machine for Task 2 (with 4 TB), have you been using Host Caching? If so, Read-Only (default) or Write-Read?

Implement Track-3 base-line index database

I try to built index database of track-3 follow :
python track3_baseline_faiss/gpu_baseline_faiss.py --dataset bigann-1B \ --indexkey IVF1048576,SQ8 \ --train_on_gpu \ --build --quantizer_on_gpu_add --add_splits 30 \ --search \ --searchparams nprobe={1,4,16,64,256} \ --parallel_mode 3 --quantizer_on_gpu_search
but failed caused by DRAM 128GB less than base-line 768GB.
Someone could provide Track-3 Base-line link , thanks

Any Plans for supporting ScaNN?

Hi, I wanted to know are there any plans for adding benchmarks for SCANN? I am not sure if there are benchmarks available for SCANN for large datasets so I was curious on the same. Thanks!

faiss T3 range search on SSNPP crash

@maumueller Alright, I tried the index strategy "OPQ32_128,IVF1048576_HNSW32,PQ32" on SSNPP and got the exception below. Note that I'm now defaulting to CPU on build_index for this dataset since the quantizer class doesn't support range search.

I will next try to set quantizer_on_gpu_add=False and train_on_gpu=False for build_index(). The default was True for both.

...
Training PQ slice 30/32
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (0.36 s, search 0.32 s): objective=1.11718e+07 imbalance=1.174 nsplit=0
Training PQ slice 31/32
Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations
Preprocessing in 0.00 s
Iteration 24 (0.36 s, search 0.32 s): objective=1.12101e+07 imbalance=1.185 nsplit=0
doing polysemous training for PQ
IndexIVFPQ::precompute_table: not precomputing table, it would be too big: 34359738368 bytes (max 2147483648)
Total train time 14384.034 s
============== SPLIT 0/1
Process Process-1:
Traceback (most recent call last):
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/main.py", line 45, in run_worker
run_no_docker(definition, args.dataset, args.count,
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 268, in run_no_docker
run_from_cmdline(cmd)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 182, in run_from_cmdline
run(definition, args.dataset, args.count, args.runs, args.rebuild)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 76, in run
algo.fit(dataset)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 274, in fit
index = build_index(buildthreads, by_residual, maxtrain,
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 184, in build_index
for xblock, assign in stage2:
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 46, in rate_limited_iter
res = res.get()
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 771, in get
raise self._value
File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 39, in next_or_None
return next(l)
File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 176, in produce_batches
_, assign = quantizer_gpu.search(xblock, 1)
File "/home/george/anaconda3/envs/bigann/lib/python3.8/site-packages/faiss/init.py", line 287, in replacement_search
assert d == self.d
AssertionError

Problem with T2

I can't reappearance the results in benchmark through https://github.com/harsha-simhadri/big-ann-benchmarks/tree/main/t1_t2, I used the same machine in azure: Standard L8s_v2 (8 vcpus, 64 GiB memory) but the qps for t2 is always 100+, I checked the cpu usage during searching indexing ,it's so slow never higher than 100%, is there any configuration I should mind and deal with? ps：I used ssd not from origianl Standard L8s_v2 becasue of space problem
diskann-t2,DiskANN,bigann-1B,10,68.26486482533713,3897.3418,1000000.0,60385020.0,884569.5388762793,109.42,117092.392,0.97882
diskann-t2,DiskANN,bigann-1B,10,178.47422862340082,1835.1444,1000000.0,60385020.0,338340.2772812576,42.42,44780.5504,0.89605
diskann-t2,DiskANN,bigann-1B,10,126.13846714851216,2419.60864,1000000.0,60385020.0,478720.10311417724,61.0627,63752.9448,0.9426500000000001
diskann-t2,DiskANN,bigann-1B,10,96.7202182888592,3009.87794,1000000.0,60385020.0,624326.7547190337,80.204,83991.4693,0.96411
diskann-t2,DiskANN,bigann-1B,10,149.1239430899944,2126.48608,1000000.0,60385020.0,404931.7550807948,51.6712,54252.8475,0.9246700000000001
diskann-t2,DiskANN,bigann-1B,10,109.6912829487855,2714.58698,1000000.0,60385020.0,550499.7149882326,70.5957,72910.7472,0.9549200000000001
diskann-t2,DiskANN,bigann-1B,10,105.95468768221484,2772.53362,1000000.0,60385020.0,569913.6236530666,72.4778,76382.5452,0.95691
diskann-t2,DiskANN,bigann-1B,10,84.24591401229037,3306.53768,1000000.0,60385020.0,716770.9046540895,89.9183,96594.0784,0.9699899999999999
diskann-t2,DiskANN,bigann-1B,10,117.73705018597681,2566.1336200000005,1000000.0,60385020.0,512880.3541843128,65.7918,68474.3466,0.94913
diskann-t2,DiskANN,bigann-1B,10,101.80129233368652,2862.72176,1000000.0,60385020.0,593165.5543435407,75.4055,79424.7472,0.9599399999999999

I used the ssd which I created from azure

Couldn't access the slides of talks from track winners

Couldn't access the slides of talks from track winners.
The url like: https://big-ann-benchmarks.com/templates/slides/*
e.g. https://big-ann-benchmarks.com/templates/slides/invited-talk-anshu.pptx
(just report 404 error)

docker logs not updating

With the current setup of using all threads available, the daemon thread (https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/benchmark/runner.py#L207-L212) seems to never get scheduled during querying, so there is no visible progress.

I am actually not sure how to fix that, since there is no way to give it some kind of higher priority. Would love to hear some thoughts!

track 3 docker environment error

Track 1 & Track 2 can run successfully.
But I met a Track 3 docker evironment error because nvidia/cuda:11.0-devel-ubuntu18.04 is no longer available.

Install Status:
{'faiss_t3': 'fail'}

 => ERROR [internal] load metadata for docker.io/nvidia/cuda:11.0-devel-ubuntu18.04                                                                                               1.6s
------
 > [internal] load metadata for docker.io/nvidia/cuda:11.0-devel-ubuntu18.04:
------
Dockerfile:2
--------------------
   1 |      
   2 | >>> FROM nvidia/cuda:11.0-devel-ubuntu18.04
   3 |     
   4 |     ENV PATH="/root/miniconda3/bin:${PATH}"
--------------------
ERROR: failed to solve: nvidia/cuda:11.0-devel-ubuntu18.04: docker.io/nvidia/cuda:11.0-devel-ubuntu18.04: not found

So I tried with a similar nvidia docker version nvidia/cuda:11.0.3-devel-ubuntu18.04, but seems to cause a lot of package conflicts:

Examining conflict for libgcc-ng libgomp _openmp_mutex:  67%|██████▋   | 34/51 [02:32<00:21,  1.25s/it]                                                                                failed                                                                                                                          
#0 358.3 
#0 358.3 UnsatisfiableError: The following specifications were found to be incompatible with a past
#0 358.3 explicit spec that is not an explicit spec in this operation (setuptools):
#0 358.3 
#0 358.3   - faiss-gpu -> numpy[version='>=1.11,<2'] -> python[version='>=3.10,<3.11.0a0|>=3.11,<3.12.0a0']
#0 358.3   - faiss-gpu -> python[version='>=2.7,<2.8.0a0|>=3.6,<3.7.0a0|>=3.7,<3.8.0a0|>=3.9,<3.10.0a0|>=3.8,<3.9.0a0|>=3.5,<3.6.0a0']
#0 358.3   - python=3.6.9 -> pip -> setuptools
#0 358.3   - python=3.6.9 -> pip -> wheel
#0 358.3 
#0 358.3 The following specifications were found to be incompatible with each other:
#0 358.3 
#0 358.3 Output in format: Requested package -> Available versions
#0 358.3 
#0 358.3 Package ncurses conflicts for:
#0 358.3 pip -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 cffi -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 cryptography -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']
#0 358.3 pyopenssl -> python[version='>=3.11,<3.12.0a0'] -> ncurses[version='6.0.*|>=6.0,<7.0a0|>=6.1,<7.0a0|>=6.2,<7.0a0|>=6.3,<7.0a0|>=6.4,<7.0a0']


#0 358.3 brotlipyThe following specifications were found to be incompatible with your system:
#0 358.3 
#0 358.3   - feature:/linux-64::__glibc==2.27=0
#0 358.3   - brotlipy -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3   - bzip2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3   - cffi -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - conda-package-handling -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - cryptography -> libgcc-ng -> __glibc[version='>=2.17']
#0 358.3   - cudatoolkit=11.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3   - faiss-gpu -> libgcc-ng[version='>=8.4.0'] -> __glibc[version='>=2.17']
#0 358.3   - libffi -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - libgcc-ng -> __glibc[version='>=2.17']
#0 358.3   - libstdcxx-ng -> __glibc[version='>=2.17']
#0 358.3   - libuuid -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - ncurses -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - openssl -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
#0 358.3   - pycosat -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - python=3.6.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
#0 358.3   - readline -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - ruamel.yaml -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - ruamel.yaml.clib -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - sqlite -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - tk -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
#0 358.3   - xz -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - zlib -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3   - zstandard -> libgcc-ng[version='>=11.2.0'] -> __glibc[version='>=2.17']
#0 358.3 
#0 358.3 Your installed version is: 2.27
#0 358.3 
#0 358.3 
------
Dockerfile:11
--------------------
  10 |     
  11 | >>> RUN wget \
  12 | >>>     https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh \
  13 | >>>     && mkdir /root/.conda \
  14 | >>>     && bash Miniconda3-latest-Linux-x86_64.sh -b \
  15 | >>>     && rm -f Miniconda3-latest-Linux-x86_64.sh \
  16 | >>>     && conda --version \
  17 | >>>     && conda install -c pytorch python=3.6.9 faiss-gpu cudatoolkit=11.0
  18 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c wget     https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh     && mkdir /root/.conda     && bash Miniconda3-latest-Linux-x86_64.sh -b     && rm -f Miniconda3-latest-Linux-x86_64.sh     && conda --version     && conda install -c pytorch python=3.6.9 faiss-gpu cudatoolkit=11.0" did not complete successfully: exit code: 1

Custom dataset functionality

I need to implement a custom dataset and its handling and been thinking about the easiest way to approach it.

I've implemented something half-way through it that kept me going and allowed to plug-in a custom dataset -- in fact it is a dataset derived from BIGANN by reducing dimensionality using a neural network.

I will show the code of what I needed to change and happy to discuss this further!