Coder Social home page Coder Social logo

criteo / autofaiss Goto Github PK

View Code? Open in Web Editor NEW
758.0 18.0 71.0 11.48 MB

Automatically create Faiss knn indices with the most optimal similarity search parameters.

Home Page: https://criteo.github.io/autofaiss/

License: Apache License 2.0

Makefile 1.39% Python 98.61%

autofaiss's Introduction

AutoFaiss

pypi ci Open In Colab

Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Doc and posts and notebooks

Using faiss efficient indices, binary search, and heuristics, Autofaiss makes it possible to automatically build in 3 hours a large (200 million vectors, 1TB) KNN index in a low amount of memory (15 GB) with latency in milliseconds (10ms).

Get started by running this colab notebook, then check the full documentation.
Get some insights on the automatic index selection function with this colab notebook.

Then you can check our multimodal search example (using OpenAI Clip model).

Read the medium post to learn more about it!

Installation

To install run pip install autofaiss

It's probably best to create a virtual env:

python -m venv .venv/autofaiss_env
source .venv/autofaiss_env/bin/activate
pip install -U pip
pip install autofaiss

Using autofaiss in python

If you want to use autofaiss directly from python, check the API documentation and the examples

In particular you can use autofaiss with on memory or on disk embeddings collections:

Using in memory numpy arrays

If you have a few embeddings, you can use autofaiss with in memory numpy arrays:

from autofaiss import build_index
import numpy as np

embeddings = np.float32(np.random.rand(100, 512))
index, index_infos = build_index(embeddings, save_on_disk=False)

query = np.float32(np.random.rand(1, 512))
_, I = index.search(query, 1)
print(I)

Using numpy arrays saved as .npy files

If you have many embeddings file, it is preferred to save them on disk as .npy files then use autofaiss like this:

from autofaiss import build_index

build_index(embeddings="embeddings", index_path="my_index_folder/knn.index",
            index_infos_path="my_index_folder/index_infos.json", max_index_memory_usage="4G",
            current_memory_available="4G")

Memory-mapped indices

Faiss makes it possible to use memory-mapped indices. This is useful when you don't need a fast search time (>50ms) and still want to reduce the memory footprint to the minimum.

We provide the should_be_memory_mappable boolean in build_index function to generate memory-mapped indices only. Note: Only IVF indices can be memory-mapped in faiss, so the output index will be a IVF index.

To load an index in memory mapping mode, use the following code:

import faiss
faiss.read_index("my_index_folder/knn.index", faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)

You can have a look to the examples to see how to use it.

Technical note: You can create a direct map on IVF indices with index.make_direct_map() (or directly from the build_index function by passing the make_direct_map boolean). Doing this speeds up a lot the .reconstruct() method, function that gives you the value of one of your vector given its rank. However, this mapping will be stored in RAM... We advise you to create your own direct map in a memory-mapped numpy array and then call .reconstruct_from_offset() with your custom direct_map.

Using autofaiss with pyspark

Autofaiss allows you to build indices with Spark for the following two use cases:

  • To build a big index in a distributed way
  • Given a partitioned dataset of embeddings, building one index per partition in parallel and in a distributed way.

Prerequisities:

  1. Install pyspark: pip install pyspark.
  2. Prepare your embeddings files (partitioned or not).
  3. Create a Spark session before calling autofaiss. If no Spark session exists, a default session will be creaed with a minimum configuration.

Creating a big index in a distributed way

See distributed_autofaiss.md for a complete guide.

It is possible to generate an index that would require more memory than what's available. To do so, you can control the number of index splits that will compose your index with nb_indices_to_keep. For example, if nb_indices_to_keep is 10 and index_path is knn.index, the final index will be decomposed into 10 smaller indexes:

  • knn.index01
  • knn.index02
  • knn.index03
  • ...
  • knn.index10

A concrete example shows how to produce N indices and how to use them.

Creating partitioned indexes

Given a partitioned dataset of embeddings, it is possible to create one index per partition by calling the method build_partitioned_indexes.

See this example that shows how to create partitioned indexes.

Using the command line

Create embeddings

import os
import numpy as np
embeddings = np.random.rand(1000, 100)
os.mkdir("embeddings")
np.save("embeddings/part1.npy", embeddings)
os.mkdir("my_index_folder")

Generate a Knn index

autofaiss build_index --embeddings="embeddings" --index_path="my_index_folder/knn.index" --index_infos_path="my_index_folder/index_infos.json" --metric_type="ip"

Try the index

import faiss
import glob
import numpy as np

my_index = faiss.read_index(glob.glob("my_index_folder/*.index")[0])

query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)

print(list(zip(distances[0], indices[0])))

How are indices selected ?

To understand better why indices are selected and what are their characteristics, check the index selection demo

Command quick overview

Quick description of the autofaiss build_index command:

embeddings -> Source path of the embeddings in numpy.
index_path -> Destination path of the created index. index_infos_path -> Destination path of the index infos. save_on_disk -> Save the index on the disk. metric_type -> Similarity distance for the queries.

index_key -> (optional) Describe the index to build.
index_param -> (optional) Describe the hyperparameters of the index.
current_memory_available -> (optional) Describe the amount of memory available on the machine.
use_gpu -> (optional) Whether to use GPU or not (not tested).

Command details

The autofaiss build_index command takes the following parameters:

Flag available Default Description
--embeddings required directory (or list of directories) containing your .npy embedding files. If there are several files, they are read in the lexicographical order. This can be a local path or a path in another Filesystem e.g. hdfs://root/... or s3://...
--index_path required Destination path of the faiss index on local machine.
--index_infos_path required Destination path of the faiss index infos on local machine.
--save_on_disk required Save the index on the disk.
--file_format "npy" File format of the files in embeddings Can be either npy for numpy matrix files or parquet for parquet serialized tables
--embedding_column_name "embeddings" Only necessary when file_format=parquet In this case this is the name of the column containing the embeddings (one vector per row)
--id_columns None Can only be used when file_format=parquet. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index
--ids_path None Only useful when id_columns is not None and file_format=parquet. This will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format
--metric_type "ip" (Optional) Similarity function used for query: ("ip" for inner product, "l2" for euclidian distance)
--max_index_memory_usage "32GB" (Optional) Maximum size in GB of the created index, this bound is strict.
--current_memory_available "32GB" (Optional) Memory available (in GB) on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
--max_index_query_time_ms 10 (Optional) Bound on the query time for KNN search, this bound is approximative.
--min_nearest_neighbors_to_retrieve 20 (Optional) Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
--index_key None (Optional) If present, the Faiss index will be build using this description string in the index_factory, more detail in the Faiss documentation
--index_param None (Optional) If present, the Faiss index will be set using this description string of hyperparameters, more detail in the Faiss documentation
--use_gpu False (Optional) Experimental, gpu training can be faster, but this feature is not tested so far.
--nb_cores None (Optional) The number of cores to use, by default will use all cores
--make_direct_map False (Optional) Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings).
--should_be_memory_mappable False (Optional) Boolean used to force the index to be selected among indices having an on-disk memory-mapping implementation.
--distributed None (Optional) If "pyspark", create the index using pyspark. Otherwise, the index is created on your local machine.
--temporary_indices_folder "hdfs://root/tmp/distributed_autofaiss_indices" (Optional) Folder to save the temporary small indices, only used when distributed = "pyspark"
--verbose 20 (Optional) Set verbosity of logging output: DEBUG=10, INFO=20, WARN=30, ERROR=40, CRITICAL=50
--nb_indices_to_keep 1 (Optional) Number of indices to keep at most when distributed is "pyspark".

Install from source

First, create a virtual env and install dependencies:

python3 -m venv .env
source .env/bin/activate
make install

python -m pytest -x -s -v tests -k "test_get_optimal_hyperparameters" to run a specific test

autofaiss's People

Contributors

bamine avatar davnn avatar dependabot[bot] avatar dobraczka avatar evaia avatar hitchhicker avatar josephcappadona avatar mbompr avatar nateagr avatar quentin-auge avatar rom1504 avatar victor-paltz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autofaiss's Issues

Control verbosity of messages

Hi, thanks for this library, it really helps, when working with faiss! One minor problem I have is that I would like to control the verbosity of the messages, since I use this autofaiss in my own library. The simplest way to do that would probably through the use of python's logging module.

Is there anything planned in that regard?

build_index is very slow

machine:

  • cpu-machine:Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
  • mem: 32G
  • cpu-cores: 16

code:

from autofaiss import build_index
import numpy as np

embeddings = np.float32(np.random.rand(1000000, 512))
index, index_infos = build_index(embeddings, save_on_disk=False)

log:
image

Add func for load npz vectors

Hi! I have a numpy matrices that saved as npz files. Unfortunately Autofaiss support only npy. Can you add that functionality?

Suspicious constant 1-recall score

I have trained 3 different index and every time, my 1-recall@20 are exactly the same:

INFO:autofaiss: 1-recall@20: 0.802
INFO:autofaiss: 1-recall@40: 0.824

But there is some variation in the 20-recall and 40-recall scores.

3 digits of exactitude is too much.

What do you think about it?

x8 vs x4fsr

INFO:autofaiss: Computing best hyperparameters for index faiss_titles.faiss 05/05/2022, 07:16:53                                                            
WARNING:autofaiss:The maximum nearest neighbors coverage is 10.65% for this index. It means that when requesting 20 nearest neighbors, the average number of retrieved neighbors will be 2. The program will try to find the best hyperparameters to reach 95% of this max coverage at least, and then will optimize the search time for this target. The index search speed could be higher than the requested max search speed.

What can we do to prevent this?

This happened with "OPQ768_768,IVF262144_HNSW32,PQ768x8" -> bad max coverage
With the index_key "OPQ768_768,IVF262144_HNSW32,PQ768x4fsr", everything was ok. The vectors were just a bit too compressed.

My d is 768.

Thank you

get_optimal_index_keys_v2 support faiss AutoTune

def get_optimal_index_keys_v2(
nb_vectors: int,
dim_vector: int,
max_index_memory_usage: str,
flat_threshold: int = 1000,
quantization_threshold: int = 10000,
force_pq: Optional[int] = None,
make_direct_map: bool = False,
should_be_memory_mappable: bool = False,
ivf_flat_threshold: int = 1_000_000,
use_gpu: bool = False,
) -> List[str]:
"""
Gives a list of interesting indices to try, *the one at the top is the most promising*
See: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index for
detailed explanations.
"""
# Exception cases:

Windows parallelization

Hi! Thank you for the great project! Unfortunately I'm experiencing some issues, which could be caused by Windows (10 Pro) and I'm not sure how to solve them.

I installed autofaiss with conda into a new env with Python 3.6. First, I had problems with import:
ImportError: DLL load failed while importing _swigfaiss: The specified module could not be found.

I solved that by first installing openblass, numpy and faiss from conda-forge:
conda create --name faiss_env python=3.6
conda activate faiss_env
conda install conda-forge::blas=*=openblas
conda install -c conda-forge numpy
conda install -c conda-forge faiss
pip install autofaiss

Then I tried to run the example from README, but I have encountered an error in embedding_reader:

~\.conda\envs\faiss_env\lib\site-packages\embedding_reader\get_file_list.py in _get_file_list(path, file_format, sort_result)
     42     path = make_path_absolute(path)
     43     fs, path_in_fs = fsspec.core.url_to_fs(path)
---> 44     prefix = path[: path.index(path_in_fs)]
ValueError: substring not found

I found out that the problem is in the fsspec.core.url_to_fs method, namely in the private method _strip_protocol on the line 402 in fsspec\core.py:
urlpath = fs._strip_protocol(url)
This line changes backward slashes to forward slashes and therefore the substring path_in_fs is not found in the string path.

Now comes the incomprehensible part: when I changed the private method _strip_protocol to general method strip_protocol (I only deleted the leading underscore), the ValueError disapeared and the function preserved backward slashes in the path... but then another error appeared:
RuntimeError: Error in __cdecl faiss::FileIOWriter::FileIOWriter(const char *) at D:\a\faiss-wheels\faiss-wheels\faiss\faiss\impl\io.cpp:98: Error: 'f' failed: could not open C:\Users\USER\AppData\Local\Temp\tmp2jqscc1t for writing: Permission denied

This seems to me like the problem with parallelization and I don't know how to solve it. I suppose that the solution of the ValueError was not the correct one and there is still some problem with Windows implementation.

Can you give me some advice how to find out a solution to this?

Thanks!

get_optimal_index_keys_v2 returns an empty list

I am using autofaiss 2.14.0 and it works for some parts of the data I am working on, but not for some. I keep getting this error and I do not know where to look at:

2022-04-21 17:46:40,649 [INFO]: There are 16325691 embeddings of dim 768
2022-04-21 17:46:40,653 [INFO]: >>> Finished "Reading total number of vectors and dimension" in 37.7308 secs
2022-04-21 17:46:40,653 [INFO]:         Compute estimated construction time of the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,659 [INFO]:                 -> Train: 16.7 minutes
2022-04-21 17:46:40,659 [INFO]:                 -> Add: 2.3 minutes
2022-04-21 17:46:40,659 [INFO]:                 Total: 19.0 minutes
2022-04-21 17:46:40,659 [INFO]:         >>> Finished "Compute estimated construction time of the index" in 0.0057 secs
2022-04-21 17:46:40,659 [INFO]:         Checking that your have enough memory available to create the index 04/21/2022, 17:46:40
2022-04-21 17:46:40,802 [INFO]:         >>> Finished "Checking that your have enough memory available to create the index" in 0.1431 secs
2022-04-21 17:46:40,803 [INFO]: >>> Finished "Launching the whole pipeline" in 37.8808 secs
Traceback (most recent call last):
  File "process.py", line 26, in <module>
    chunks_to_precalculated_knn_(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 373, in chunks_to_precalculated_knn_
    index, embeddings = chunks_to_index_and_embed(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 334, in chunks_to_index_and_embed
    index = index_embeddings(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/retro_pytorch/retrieval.py", line 288, in index_embeddings
    build_index(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 224, in build_index
    necessary_mem, index_key_used = estimate_memory_required_for_index_creation(
  File "/home/x_ehsdo/.local/lib/python3.8/site-packages/autofaiss/external/build.py", line 46, in estimate_memory_required_for_index_creation
    index_key = get_optimal_index_keys_v2(
IndexError: list index out of range

Fix potential out of disk problem when producing N indices

When we produce N indices (with nb_indices_to_keep larger than 1), within the function of optimize_and_measure_indices, we download N indices from remote in one shot (see here), if the machine running autofaiss has limited disk space, it would fail due to No space left error.

Make ingestion pipeline require less disk space

Currently the flow is:

  • download a large amount of embeddings
  • convert to numpy
  • run autofaiss to produce an index

It works well but requires a large amount of disk space

It's possible to instead do download -> convert -> add for each part of the embedding collection (and remove temporary files when doing the next part)
One way to do this could be to opensource the pyspark job doing this
It could also be possible to implement this directly in python here.

A simple way to do this could also be to have better support of remote file systems directly in quantize.

Torch Tensor support?

I want to ask whether doing KNN search with torch tensors is supported? Many thanks!

use merging strategy in non-pyspark mode as well

the strategy to create a few small indices the memory usage during adding and (if using the special merge on disk function) completely cap the memory used by autofaiss in general, making it possible to create arbitrarily big indices with a fixed amount of ram

let's use that strategy not only for pyspark mode, but even for the normal mode
adding N indices to normal mode should also be possible by reusing the code from distributed

decrease memory used by merging

Currently merging in distributed mode requires to store the whole index in memory
Possible strategies:

  • improve faiss merge into to avoid putting everything in memory
  • producing N index instead of one and letting the user search in all of them at search time

add option to save keys from parquet embeddings into a new parquet collection

to avoid reading the embeddings parquet a second time, we could consider extracting, yielding and saving the keys from the parquet files in the read embeddings function.
These keys could be saved either as parquet, either in some format convenient for fast random access (eg arrow, hdf5 for one way, leveldb for 2 way).
That would probably be convenient but let's keep this for another PR

(Another option is to do this in another utility that would read only the key column, to be seen what is best)

multi index ideas

  • building one index or a thousand indices from one embedding set has the same cost if doing one training and grouping at read time (allows doing one index per strict category)
  • building N index-parts then merging may make it easier to parallelize reading and building. it could also post pone the memory cost only at merge time, which might be beneficial (for example unlocks building in many memory constrained executors then merging in one big machine afterwards, or maybe even merging with memory mapping to use no memory for merging)

some info at https://github.com/facebookresearch/faiss/tree/main/benchs/distributed_ondisk and https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors and https://github.com/facebookresearch/faiss/blob/151e3d7be54aec844b6328dc3e7dd0b83fcfa5bc/faiss/invlists/OnDiskInvertedLists.cpp

Vector normalization while building index

Hi!
According to the docs faiss doens't natively support cosine similarity as distance metric. The closest one is inner product which additionaly needs to prenormalize embedding vectors. In FAQ authors propose a way to do it manually with their function faiss.normalize_L2.
I have exactly the same case and would be glad, if autofaiss have an optional flag which additionally prenormalize vectors before building index.
It seems to me that it's not so difficult and ones should add faiss.normalize_L2 to each place where iterate over embedding_reader. If so i can make a PR.

GPU on A100

import numpy as np
from autofaiss import build_index

embeddings = np.float32(np.random.rand(700, 700))


build_index(
    embeddings=embeddings,  # type: ignore
    index_path="knn.index",
    index_infos_path="infos.json",
    should_be_memory_mappable=True,
    use_gpu=True,
)

On my A100, the use_gpu=True breaks the flow.

add_with_ids is not implemented for Flat indexes

Hello, I'm encountering an issue using autofaiss with flat indexes.
build_index raises an error (in my case, when embeddings are ndarray, I did not test with parquet embeddings) in distributed mode, for flat indexes. This error could be related to facebookresearch/faiss#1212 (method index.add_with_ids is not implemented for flat indexes).

from autofaiss import build_index

build_index(
    embeddings=np.ones((100, 512)),
    distributed="pyspark",
    should_be_memory_mappable=True,
    index_path="hdfs://root/user/foo/knn.index",
    index_key="Flat",
    nb_cores=20,
    max_index_memory_usage="32G",
    current_memory_available="48G",
    ids_path="hdfs://root/user/foo/test_indexing_out/ids",
    temporary_indices_folder="hdfs://root/user/foo/indices/tmp/",
    nb_indices_to_keep=5,
    index_infos_path="hdfs://root/user/r.laby/test_indexing_out/index_infos.json",
)

raises

RuntimeError: Error in virtual void faiss::Index::add_with_ids(faiss::Index::idx_t, const float*, const idx_t*) at /project/faiss/faiss/Index.cpp:39: add_with_ids not implemented for this type of index

Is it expected ? Or could this be fixed ?
Thanks !

Add all parameters from doc to readme

 embeddings_path : str
        Local path containing all preprocessed vectors and cached files.
        Files will be added if empty.
    output_path: str
        Destination path of the quantized model on local machine.
    index_key: Optinal(str)
        Optional string to give to the index factory in order to create the index.
        If None, an index is chosen based on an heuristic.
    index_param: Optional(str)
        Optional string with hyperparameters to set to the index.
        If None, the hyper-parameters are chosen based on an heuristic.
    max_index_query_time_ms: float
        Bound on the query time for KNN search, this bound is approximative
    max_index_memory_usage: str
        Maximum size allowed for the index, this bound is strict
    current_memory_available: str
        Memory available on the machine creating the index, having more memory is a boost
        because it reduces the swipe between RAM and disk.
    use_gpu: bool
        Experimental, gpu training is faster, not tested so far
    metric_type: str
        Similarity function used for query:
            - "ip" for inner product
            - "l2" for euclidian distance

[Feature Request:] Add new features to a previously built index

Right now there does not seem to be an easy way to take an already-built index and add more embeddings to it (from the same distribution). This is obviously already indirectly supported by / possible with autofaiss because distributed training already does it, and also it is something easily supported by FAISS backbone. But I wonder if we can expose an easy interface to take a built index and add more features from a new set of embeddings (Using all the bells and whistles provided by autofaiss/embedding-reader for reading embeddings from a numpy-parquet format). Perhaps a update_index interface?

Thanks!

module 'faiss' has no attribute 'swigfaiss'

python 3.8.12
autofaiss                 2.13.2                   pypi_0    pypi
faiss-cpu                 1.7.2                    pypi_0    pypi
libfaiss                  1.7.2            h2bc3f7f_0_cpu    pytorch

First of all, thank you for the great project! I get the error: module 'faiss' has no attribute 'swigfaiss' when running the following command:

import autofaiss

autofaiss.build_index(
    "embeddings.npy",
    "autofaiss.index",
    "autofaiss.json",
    metric_type="ip",
    should_be_memory_mappable=True,
    make_direct_map=True)

The error appears when running it for make_direct_map=True.

Tested using conda 4.11.0 or mamba 0.15.3 using pytorch or conda-forge channel.

fix estimation of training memory used by autofaiss

just tried it and the new estimation at https://github.com/criteo/autofaiss/pull/81/files doesn't fully capture the memory needed for training

when training an index such as OPQ32_224,IVF131072_HNSW32,PQ32x8 faiss trains the index in 2 steps
The first step seems to be indeed using the memory assumed by the current estimation (for example 21.5GB for 11M vectors of dimension 512) but then the second step uses some more ram.
I am not sure yet what are these 2 steps, but I'd guess something like a primary then secondary index

Let's figure it out then add some more tests for this (could be scheduled tests instead of tests that run for every commit)

build_index can't handle empty numpy files

Hello,

I'm currently running a workflow in argo which is generating several embedding files, in parallel, based on a database search.
If no data was found, the workflow returns a empty numpy file:

np.save(os.path.join(output, "features", filename), np.empty(0, np.float32))

Sadly the build_index is not capable of handling those files:

Using 4 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 04/08/2022, 09:54:53
Reading total number of vectors and dimension 04/08/2022, 09:54:53

  0%|          | 0/16 [00:00<?, ?it/s]
 19%|█▉        | 3/16 [00:00<00:00, 29.92it/s]
 56%|█████▋    | 9/16 [00:00<00:00, 87.73it/s]
>>> Finished "Reading total number of vectors and dimension" in 0.1517 secs
>>> Finished "Launching the whole pipeline" in 0.1517 secs
Traceback (most recent call last):
  File "/usr/local/bin/autofaiss", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 395, in main
    fire.Fire({"build_index": build_index, "tune_index": tune_index, "score_index": score_index})
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/autofaiss/external/quantize.py", line 143, in build_index
    nb_vectors, vec_dim = read_total_nb_vectors_and_dim(
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 258, in read_total_nb_vectors_and_dim
    for c in p.imap_unordered(file_to_line_count, file_paths):
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 252, in file_to_line_count
    return matrix_reader.get_row_count()
  File "/usr/local/lib/python3.8/site-packages/autofaiss/readers/embeddings_iterators.py", line 101, in get_row_count
    return self.get_shape()[0]

Would be great if it could handle it, by just showing a waning in the logs or a flag to allow it.

Tests

  • check hnsw size > flat size

Misunderstanding of the estimated computing time

I am not sure whether I misunderstand something or there is an error, but when building my index with autofaiss is written Train: 16.7 minutes but takes ~11 secs Finished "Launching the whole pipeline" in 11.1440 secs?

Using 16 omp threads (processes), consider increasing --nb_cores if you have more
Launching the whole pipeline 01/28/2022, 08:15:47
There are 4269 embeddings of dim 1024
	Compute estimated construction time of the index 01/28/2022, 08:15:47
		-> Train: 16.7 minutes
		-> Add: 0.0 seconds
		Total: 16.7 minutes
	>>> Finished "Compute estimated construction time of the index" in 0.0000 secs
	Checking that your have enough memory available to create the index 01/28/2022, 08:15:47
20.6MB of memory will be needed to build the index (more might be used if you have more)
	>>> Finished "Checking that your have enough memory available to create the index" in 0.0009 secs
	Selecting most promising index types given data characteristics 01/28/2022, 08:15:47
	>>> Finished "Selecting most promising index types given data characteristics" in 0.0000 secs
	Creating the index 01/28/2022, 08:15:47
		-> Instanciate the index HNSW15 01/28/2022, 08:15:47
		>>> Finished "-> Instanciate the index HNSW15" in 0.0036 secs
The index size will be approximately 17.2MB
The memory available for adding the vectors is 7.0GB(total available - used by the index)
Will be using at most 1GB of ram for adding
		-> Adding the vectors to the index 01/28/2022, 08:15:47
Using a batch size of 244140 (memory overhead 953.7MB)
100%|██████████| 1/1 [00:00<00:00, 74.53it/s]		>>> Finished "-> Adding the vectors to the index" in 0.1602 secs
	>>> Finished "Creating the index" in 0.1647 secs
	Computing best hyperparameters 01/28/2022, 08:15:47

	>>> Finished "Computing best hyperparameters" in 3.3091 secs
The best hyperparameters are: efSearch=21
	Compute fast metrics 01/28/2022, 08:15:50
2000
	>>> Finished "Compute fast metrics" in 7.6499 secs
	Saving the index on local disk 01/28/2022, 08:15:58
	>>> Finished "Saving the index on local disk" in 0.0091 secs
Recap:
{'99p_search_speed_ms': 30.39110283832997,
 'avg_search_speed_ms': 3.7983315605670214,
 'compression ratio': 0.9678652870286923,
 'index_key': 'HNSW15',
 'index_param': 'efSearch=21',
 'nb vectors': 4269,
 'reconstruction error %': 0.0,
 'size in bytes': 18066382,
 'vectors dimension': 1024}
>>> Finished "Launching the whole pipeline" in 11.1440 secs

Make embedding iterator faster on high latency file systems

s3 and hdfs are low latency high bandwidth file systems
On these fs, fetching files sequentially is slow
Today our embedding iterator read files sequentially

This could be made faster by reading files in parallel or even parts of files in parallel using pyarrow readers that includes threads internally

Distributed training

Hi,
thanks to all maintainers of this project, that's a great tool to streamline the building and tuning of a Faiss index.

I have a quick dumb question about the training of an index in distributed mode. Am I correct that the training is done on the host, i.e non distributed, and that only the adding/optimizing part is distributed ? After a quick look at the code and doc, I feel like that's the case, right ? If that's the case, would there be a possibility of training the index in a distributed fashion?

make autofaiss not use TemporaryDirectory

TemporaryDirectory is a local folder which may not have any room
the user should specify what is the temporary folder (in fact we already have an option for this)

Make current available memory properly aggregate all the memory needs

  • the index final size should be subtracted from the amount of memory adding is allowed to use
  • the index untrained size should be subtracted from the amount of memory training is allowed to use

this would make it possible to have stronger guaranties about how much memories autofaiss would use

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.