Coder Social home page Coder Social logo

hora-search / hora Goto Github PK

View Code? Open in Web Editor NEW
2.6K 48.0 72.0 14.51 MB

🚀 efficient approximate nearest neighbor search algorithm collections library written in Rust 🦀 .

Home Page: http://horasearch.com/

License: Apache License 2.0

Rust 100.00%
search-engine rust approximate-nearest-neighbor-search artificial-intelligence recommender-system image-search vector-search algorithm data-structures simd

hora's Introduction

Hora

[Homepage] [Document] [Examples]

Hora Search Everywhere!

Hora is an approximate nearest neighbor search algorithm (wiki) library. We implement all code in Rust🦀 for reliability, high level abstraction and high speeds comparable to C++.

Hora, 「ほら」 in Japanese, sounds like [hōlə], and means Wow, You see! or Look at that!. The name is inspired by a famous Japanese song 「小さな恋のうた」.

Demos

👩 Face-Match [online demo], have a try!

🍷 Dream wine comments search [online demo], have a try!

Features

  • Performant ⚡️

    • SIMD-Accelerated (packed_simd)
    • Stable algorithm implementation
    • Multiple threads design
  • Supports Multiple Languages ☄️

    • Python
    • Javascript
    • Java
    • Go (WIP)
    • Ruby (WIP)
    • Swift (WIP)
    • R (WIP)
    • Julia (WIP)
    • Can also be used as a service
  • Supports Multiple Indexes 🚀

    • Hierarchical Navigable Small World Graph Index (HNSWIndex) (details)
    • Satellite System Graph (SSGIndex) (details)
    • Product Quantization Inverted File(PQIVFIndex) (details)
    • Random Projection Tree(RPTIndex) (LSH, WIP)
    • BruteForce (BruteForceIndex) (naive implementation with SIMD)
  • Portable 💼

    • Supports WebAssembly
    • Supports Windows, Linux and OS X
    • Supports IOS and Android (WIP)
    • Supports no_std (WIP, partial)
    • No heavy dependencies, such as BLAS
  • Reliability 🔒

    • Rust compiler secures all code
    • Memory managed by Rust for all language libraries such as Python's
    • Broad testing coverage
  • Supports Multiple Distances 🧮

    • Dot Product Distance
      • equation
    • Euclidean Distance
      • equation
    • Manhattan Distance
      • equation
    • Cosine Similarity
      • equation
  • Productive

    • Well documented
    • Elegant, simple and easy to learn API

Installation

Rust

in Cargo.toml

[dependencies]
hora = "0.1.1"

Python

$ pip install horapy

Javascript (WebAssembly)

$ npm i horajs

Building from source

$ git clone https://github.com/hora-search/hora
$ cargo build

Benchmarks

by aws t2.medium (CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz) more information

Examples

Rust example [more info]

use hora::core::ann_index::ANNIndex;
use rand::{thread_rng, Rng};
use rand_distr::{Distribution, Normal};

pub fn demo() {
    let n = 1000;
    let dimension = 64;

    // make sample points
    let mut samples = Vec::with_capacity(n);
    let normal = Normal::new(0.0, 10.0).unwrap();
    for _i in 0..n {
        let mut sample = Vec::with_capacity(dimension);
        for _j in 0..dimension {
            sample.push(normal.sample(&mut rand::thread_rng()));
        }
        samples.push(sample);
    }

    // init index
    let mut index = hora::index::hnsw_idx::HNSWIndex::<f32, usize>::new(
        dimension,
        &hora::index::hnsw_params::HNSWParams::<f32>::default(),
    );
    for (i, sample) in samples.iter().enumerate().take(n) {
        // add point
        index.add(sample, i).unwrap();
    }
    index.build(hora::core::metrics::Metric::Euclidean).unwrap();

    let mut rng = thread_rng();
    let target: usize = rng.gen_range(0..n);
    // 523 has neighbors: [523, 762, 364, 268, 561, 231, 380, 817, 331, 246]
    println!(
        "{:?} has neighbors: {:?}",
        target,
        index.search(&samples[target], 10) // search for k nearest neighbors
    );
}

thank @vaaaaanquish for this complete pure Rust 🦀 image search example, For more information about this example, you can click Pure Rust な近似最近傍探索ライブラリ hora を用いた画像検索を実装する

Python example [more info]

import numpy as np
from horapy import HNSWIndex

dimension = 50
n = 1000

# init index instance
index = HNSWIndex(dimension, "usize")

samples = np.float32(np.random.rand(n, dimension))
for i in range(0, len(samples)):
    # add node
    index.add(np.float32(samples[i]), i)

index.build("euclidean")  # build index

target = np.random.randint(0, n)
# 410 in Hora ANNIndex <HNSWIndexUsize> (dimension: 50, dtype: usize, max_item: 1000000, n_neigh: 32, n_neigh0: 64, ef_build: 20, ef_search: 500, has_deletion: False)
# has neighbors: [410, 736, 65, 36, 631, 83, 111, 254, 990, 161]
print("{} in {} \nhas neighbors: {}".format(
    target, index, index.search(samples[target], 10)))  # search

JavaScript example [more info]

import * as horajs from "horajs";

const demo = () => {
    const dimension = 50;
    var bf_idx = horajs.BruteForceIndexUsize.new(dimension);
    // var hnsw_idx = horajs.HNSWIndexUsize.new(dimension, 1000000, 32, 64, 20, 500, 16, false);
    for (var i = 0; i < 1000; i++) {
        var feature = [];
        for (var j = 0; j < dimension; j++) {
            feature.push(Math.random());
        }
        bf_idx.add(feature, i); // add point
    }
    bf_idx.build("euclidean"); // build index
    var feature = [];
    for (var j = 0; j < dimension; j++) {
        feature.push(Math.random());
    }
    console.log("bf result", bf_idx.search(feature, 10)); //bf result Uint32Array(10) [704, 113, 358, 835, 408, 379, 117, 414, 808, 826]
}

(async () => {
    await horajs.default();
    await horajs.init_env();
    demo();
})();

Java example [more info]

public void demo() {
    final int dimension = 2;
    final float variance = 2.0f;
    Random fRandom = new Random();

    BruteForceIndex bruteforce_idx = new BruteForceIndex(dimension); // init index instance

    List<float[]> tmp = new ArrayList<>();
    for (int i = 0; i < 5; i++) {
        for (int p = 0; p < 10; p++) {
            float[] features = new float[dimension];
            for (int j = 0; j < dimension; j++) {
                features[j] = getGaussian(fRandom, (float) (i * 10), variance);
            }
            bruteforce_idx.add("bf", features, i * 10 + p); // add point
            tmp.add(features);
          }
    }
    bruteforce_idx.build("bf", "euclidean"); // build index

    int search_index = fRandom.nextInt(tmp.size());
    // nearest neighbor search
    int[] result = bruteforce_idx.search("bf", 10, tmp.get(search_index));
    // [main] INFO com.hora.app.ANNIndexTest  - demo bruteforce_idx[7, 8, 0, 5, 3, 9, 1, 6, 4, 2]
    log.info("demo bruteforce_idx" + Arrays.toString(result));
}

private static float getGaussian(Random fRandom, float aMean, float variance) {
    float r = (float) fRandom.nextGaussian();
    return aMean + r * variance;
}

Roadmap

  • Full test coverage
  • Implement EFANNA algorithm to achieve faster KNN graph building
  • Swift support and iOS/macOS deployment example
  • Support R
  • support mmap

Related Projects and Comparison

  • Faiss, Annoy, ScaNN:

    • Hora's implementation is strongly inspired by these libraries.
    • Faiss focuses more on the GPU scenerio, and Hora is lighter than Faiss (no heavy dependencies).
    • Hora expects to support more languages, and everything related to performance will be implemented by Rust🦀.
    • Annoy only supports the LSH (Random Projection) algorithm.
    • ScaNN and Faiss are less user-friendly, (e.g. lack of documentation).
    • Hora is ALL IN RUST 🦀.
  • Milvus, Vald, Jina AI

    • Milvus and Vald also support multiple languages, but serve as a service instead of a library
    • Milvus is built upon some libraries such as Faiss, while Hora is a library with all the algorithms implemented itself

Contribute

We appreciate your participation!

We are glad to have you participate, any contributions are welcome, including documentations and tests. You can create a Pull Request or Issue on GitHub, and we will review it as soon as possible.

We use GitHub issues for tracking suggestions and bugs.

Clone the repo

git clone https://github.com/hora-search/hora

Build

cargo build

Test

cargo test --lib

Try the changes

cd examples
cargo run

License

The entire repository is licensed under the Apache License.

hora's People

Contributors

btv avatar lesmiscore avatar mongkii avatar salamer avatar syndek avatar whiteworld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hora's Issues

panick when use `CosineSimilarity`

Hi.

With the latest hora, I get a panic when using hora::core::metrics::Metric::CosineSimilarity.

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/core/neighbor.rs:32:54

I think this is an error caused by partial_cmp.
partial_cmp expect it to be None when the given value cannot be ordered.

Saving Hora index to disk?

Hi,
I'm using Horapy, and everything works great. But how do I save an index to disk after building it, so that I can use it elsewhere or later?
Tried pickling the index but it gives an error that usize objects can't be pickled.

Possible to add sparse elements?

Thanks for the very nice library! I'm interested in using hora for doing nearest neighbor finding in single-cell genomics. The data of interest consist of very high dimensional points (D = 30,000), but for most points, most dimensions have value 0. Therefore, I'd like to avoid (it's not really feasible) to densify the elements before indexing them. Is there some way to provide a custom implementation of the relevant distance metrics for the indexed type such that I don't have to actually insert a dense representation of the points into the index?

Extensible index

Hello! Thanks for your wonderful library. Please tell me if supported extensible index. Can I adding elements to an already built index?

Node searches on arrays of objects

For the Node version, is it possible to run comparisons on arrays of objects?

For example, for using it with map data, the query would be a tuple of a map point coordinates: [latitude, longitude].

The material to compare against would be an array of objects similar to:

[ { objectId: 123, someOtherObjectData: string, coordinates: [latitude, longitude] }, { next object }, { next object} ... ]

If so, would it be possible to provide a few pointers or a basic example?

Thanks much!

Hey hello!! Project turismEC!!

I watch a great marketplace on my country.
I live in Ecuador and the country have good opportunities on the turism market.
I have a idea to connect the people, but I need some help.
I have basic skills on development, and some money to start a project . Thanks.
Psdt: is a great business

Return distance metric?

Other ANN libraries return the index array of the closest vectors and optionally the distance metric in a second array.
Is that possible or in future plans?
Thanks.

Including HGG

Hey, I just saw this on Reddit and I am very excited to try this out on my computer vision datasets.

I just created an ANN search data structure somewhat recently called HGG (https://github.com/rust-cv/hgg), and I would love to be able to add it to your collection of nearest neighbor searches (and especially over at https://github.com/hora-search/ann-benchmarks). It is based on HNSW, and I am currently using it for computer vision purposes. Let me know if you need some help integrating it.

Consider adding __version__ attribute to Python API.

version does is not available.

https://www.python.org/dev/peps/pep-0396/

import horapy

dir(horapy)
['BruteForceIndex', 'HNSWIndex', 'HoraANNIndex', 'HoraBruteForceIndexStr', 'HoraBruteForceIndexUsize', 'HoraHNSWIndexStr', 'HoraHNSWIndexUsize', 'HoraIVFPQIndexStr', 'HoraIVFPQIndexUsize', 'HoraPQIndexStr', 'HoraPQIndexUsize', 'HoraSSGIndexStr', 'HoraSSGIndexUsize', 'IVFPQIndex', 'PQIndex', 'SSGIndex', 'builtins', 'cached', 'doc', 'file', 'loader', 'name', 'package', 'path', 'spec', 'horapy', 'numpy']
horapy.version
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'horapy' has no attribute 'version'

A fun fact!

Hey guys,

This isn't an issue. You can close or delete it whenever you wish. I just wanted to share a fun fact.

I discovered "hora" at the top of a "most trending projects" lists and it attracted my attention as "hora" ("хора" in Cyrillic) means "people". I thought, is it possible that someone will name his trending lib after a Bulgarian word? Maybe there was a hidden meaning... I opened and scrolled the repo - what can I see - a list of faces/people. I was right. Someone named the library after a Bulgarian word... only to realise a few moments later that it's named after a Japanese word.

What a funny collision.

Have a great week!

Not able to install using pip

So, I tried installing using the pip install horapy, but getting this error:

ERROR: Command errored out with exit status 1:
     command: 'C:\Users\acer\anaconda3\python.exe' 'C:\Users\acer\anaconda3\lib\site-packages\pip\_vendor\pep517\_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\acer\AppData\Local\Temp\tmp6bg4brka'
         cwd: C:\Users\acer\AppData\Local\Temp\pip-install-qsf3wt4q\horapy_417ccdb8395f4cf39fc2475178a42d0c
    Complete output (6 lines):

    Cargo, the Rust package manager, is not installed or is not on PATH.
    This package requires Rust and Cargo to compile extensions. Install it through
    the system's package manager or via https://rustup.rs/

    Checking for Rust toolchain....
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/e2/4e/75206eb830280af8b6e0618bd76b4e3dc98b24f9cc9c3809af14b85c4f35/horapy-0.0.1.tar.gz#sha256=f1be68b8481b348e8a615328923f893678eac03215483abf5ff04304b8f952cc (from https://pypi.org/simple/horapy/). Command errored out with exit status 1: 'C:\Users\acer\anaconda3\python.exe' 'C:\Users\acer\anaconda3\lib\site-packages\pip\_vendor\pep517\_in_process.py' prepare_metadata_for_build_wheel 'C:\Users\acer\AppData\Local\Temp\tmp6bg4brka' Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement horapy
ERROR: No matching distribution found for horapy
```

Can hora's index be serialized to disk?

I have millions of vectors that need to be indexed, and the build speed is extremely slow.
Can I serialize index to disk and then deserialize it into an index?

BruteForceIndex load index from file error.

index = HNSWIndex(dimension, "usize") 


---> 16     index.load(ipath)
     17 
     18     # Search

C:\g\vr\lw\python\lib\site-packages\horapy\__init__.py in load(self, path)
     78             path: file path
     79         """
---> 80         self.ann_idx = self.ann_type.load(path)
     81 
     82     def dump(self, path):

PanicException: called `Result::unwrap()` on an `Err` value: Io(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })

Crash using cosine similarity when calling search

When indexing a small number of vectors I am getting this error when specifying cosine_similarity (euclidean works fine for instance):

thread 'hora_test' panicked at 'called `Option::unwrap()` on a `None` value', /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/core/neighbor.rs:32:54
stack backtrace:
   0: rust_begin_unwind
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:143:14
   2: core::panicking::panic
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/panicking.rs:48:5
   3: core::option::Option<T>::unwrap
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/option.rs:752:21
   4: <hora::core::neighbor::Neighbor<E,T> as core::cmp::Ord>::cmp
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/core/neighbor.rs:32:9
   5: <hora::core::neighbor::Neighbor<E,T> as core::cmp::PartialOrd>::partial_cmp
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/core/neighbor.rs:38:14
   6: core::cmp::PartialOrd::le
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/cmp.rs:1129:19
   7: core::cmp::impls::<impl core::cmp::PartialOrd<&B> for &A>::le
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/cmp.rs:1505:13
   8: alloc::collections::binary_heap::BinaryHeap<T>::sift_up
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/collections/binary_heap.rs:562:16
   9: alloc::collections::binary_heap::BinaryHeap<T>::push
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/collections/binary_heap.rs:496:18
  10: hora::index::hnsw_idx::HNSWIndex<E,T>::search_layer::{{closure}}
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/index/hnsw_idx.rs:363:25
  11: <core::slice::iter::Iter<T> as core::iter::traits::iterator::Iterator>::for_each
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/slice/iter/macros.rs:211:21
  12: hora::index::hnsw_idx::HNSWIndex<E,T>::search_layer
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/index/hnsw_idx.rs:353:13
  13: hora::index::hnsw_idx::HNSWIndex<E,T>::search_knn
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/index/hnsw_idx.rs:433:25
  14: <hora::index::hnsw_idx::HNSWIndex<E,T> as hora::core::ann_index::ANNIndex<E,T>>::node_search_k
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/index/hnsw_idx.rs:615:55
  15: hora::core::ann_index::ANNIndex::search
             at /Users/sam/.cargo/registry/src/github.com-1ecc6299db9ec823/hora-0.1.1/src/core/ann_index.rs:93:9
  16: hora_c::hora_test
             at ./src/lib.rs:192:13
  17: hora_c::hora_test::{{closure}}
             at ./src/lib.rs:168:1
  18: core::ops::function::FnOnce::call_once
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ops/function.rs:227:5
  19: core::ops::function::FnOnce::call_once
             at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.


failures:
    hora_test

IVFPQIndex ?

I'm not sure if this is my problem, (possible).
I am testing the index types all work, (SSG had extensive index times > 24 hours so I aborted that).
IVFPQIndex

I am doing this with all index types, no issues except this case.

    index = IVFPQIndex(dimension, "usize")
    for i,d in tqdm(enumerate(hd[hashtype].tolist())):
        index.add(d,i)

FYI:
print(i,d)
15746524 [248, 225, 188, 223, 199, 174, 144, 146]

Error:

lib\site-packages\horapy\__init__.py in build(self, metrics)
     36 
     37     def build(self, metrics=""):
---> 38         self.ann_idx.build(metrics)
     39 
     40     def add(self, vs, idx=None):

PanicException: attempt to calculate the remainder with a divisor of zero

Possible alternative to 'add'

The current index add method adds a single vector. Is it possible to add 2D numpy array of vectors and use the array row position as the index.

# proposal for consideration
a = np.array([[1,2,3],[4,5,6]]
index.add(a)
# currently
for i in a:
   index.add(sample, i).unwrap();

Dot production is not acting as a distance

This line assumes, that dot function returns a simple dot production and negates it to make it work like a distance (smaller = closer)

dot(vec1, vec2).map(|x| -x)

But in reality, dot function is already negated in the SIMDOptmized inplementation:

Ok(-(a.iter().zip(b).map(|(p, q)| p * q).sum::<$type_id>()))

Which leads to double inverting and incorrect use of Dot Production

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.