u1234x1234 / pynanoflann Goto Github PK

32.0 3.0 8.0 73 KB

Unofficial python wrapper to the nanoflann k-d tree

License: BSD 2-Clause "Simplified" License

Python 57.02% C++ 42.98%

python kd-tree kdtree nearest-neighbor-search nearest-neighbors nanoflann pybind11

pynanoflann's Introduction

pynanoflann

Unofficial python wrapper to the nanoflann library [1] with sklearn interface and additional multithreaded capabilities.

nanoflann implementation of k-d tree provides one of the best performance for many k-nearest neighbour problems [2].

It is a good choice for exact k-NN, radius searches in low dimensional spaces.

Install

pip install git+https://github.com/u1234x1234/[email protected]

Basic Usage

import numpy as np
import pynanoflann

index = np.random.uniform(0, 100, (100, 4))
queries = np.random.uniform(0, 100, (10, 4))

nn = pynanoflann.KDTree(n_neighbors=5, metric='L1', radius=100)
nn.fit(index)

# Get k-nearest neighbors
distances, indices = nn.kneighbors(queries)

# Radius search
distances, indices = nn.radius_neighbors(queries)

Save, load

If you need to save the model, there are two options:

Save only the built index. In this case, data points are NOT stored in file. Very efficient, but inconvenient in some cases.

kdtree.fit(X)
kdtree.save_index('index.bin')

prebuilt_kdtree = pynanoflann.KDTree()
# Must use the same data on which the index was built.
prebuilt_kdtree.fit(X, 'index.bin')

Please refer to the detailed example

Pickle the whole model with data points. Less efficient, but convenient.

kdtree.fit(X)
with open('kdtree.pkl', 'wb') as out_file:
    pickle.dump(kdtree, out_file)
with open('kdtree.pkl', 'rb') as in_file:
    unpickled_kdtree = pickle.load(in_file)

Please refer to the detailed example

Multicore parallelization

Implemented on C++ side to reduce python's multiprocessing overhead.

Query parallelization: Example
Simultaneous indexing+querying parallelization: Example, Discussion

Performance

Generally it much faster than brute force or cython implementation of k-d tree in sklearn

To run benchmark:

python benchmark.py

Links

Blanco, Jose Luis and Rai, Pranjal Kumar, 2014, nanoflann: a C++ header-only fork of FLANN, a library for Nearest Neighbor ({NN}) with KD-trees.
Vermeulen, J.L., Hillebrand, A. and Geraerts, R., 2017. A comparative study of k‐nearest neighbour techniques in crowd simulation.
OpenFLANN Performance comparison between PCL, nanoflann, picoflann

pynanoflann's People

Contributors

Stargazers

Watchers

Forkers

sandy4321 ccinc eokeeffe stongeetienne hochshi akaszynski shreyanshdarshan dwastberg

pynanoflann's Issues

How to find the package 'nanoflann_ext' ?

Hello,

Thanks for this great tool.

I try to use your code, and I realize that there is no specific package named 'nanoflann_ext' in import nanoflann_ext.

Can I get some help with this? Thanks in advance

Serializing KDTrees

Hi @u1234x1234,

Thanks for your awesome wrapper. I would like to use it for radius search in 3D point clouds. In my implementation, I would like to be able to save a KDTree in a text file and then reload it afterwards.

I tried to use pickle, but nanoflann_ext.KDTree32 or nanoflann_ext.KDTree64 object are not supported.

Do you think there is a way to save these objects to files? Or enable pickle to serialize them?

Best,
Hugues

Multicore processing of query points

Hey,

Another feature I've in mind is if it is possible to implement parallellized multicore processing for queries. For instance, in Scipy cKDTree (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.query.html#scipy.spatial.cKDTree.query) you can define number of cores with n_jobs parameter.

What I mean is that we could set 'n_cores' parameter when executing the query:

import pynanoflann
import numpy as np

target = np.random.rand(10000, 3)
query = np.random.rand(2000, 3)

kd_tree = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)
kd_tree.fit(target)
d, nn_idx = kd_tree.kneighbors(query, n_cores=10)

This would create a single kd-tree for the data in 'target' and then split the data in 'query' for multiple cores to do the queries.

Python implementation of this will introduce quite much overhead when working with relatively small data sets. I assume that this would be much faster when implemented in c++ side.

Best regards
Juho

Parallelized batch processing

Hey,

I'm wondering whether it could be possible to implement multi-core implementation for processing multiple batches of data simultaneously.

My current approach is:

import pynanoflann
import numpy as np

n_batches = 10
target = np.random.rand(n_batches, 10000, 3)
query = np.random.rand(n_batches, 2000, 3)

for i in range(n_batches):
    pts_target = target[i]
    pts_query = query[i]
    
    kd_tree = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)
    kd_tree.fit(pts_target)
    d, nn_idx = kd_tree.kneighbors(pts_query)

Instead, I'd like to do something like that:

import pynanoflann
import numpy as np

n_batches = 10
target = np.random.rand(n_batches, 10000, 3)
query = np.random.rand(n_batches, 2000, 3)

kd_trees = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)
kd_trees.fit(pts_target)
distances, nn_indexes = kd_trees.kneighbors(pts_query)

This would create a kd-tree for each batch in 'target'. Corresponding batches in 'query' then would be used to make nearest neighbor searches with corresponding kd-trees.

Currently, if I want to implement this kind of parallelized processing I have do it in python. For large data sets this is not a problem but for smaller data sets it will cause too much overhead. I assume that it would be much faster to implement in c++ side of the code.

Best regards
Juho

What does the radius parameter do?

First thanks for this great tool.
I'm a little confused by the radius parameter:
nn = pynanoflann.KDTree(n_neighbors=5, metric='L1', radius=100)

It seems like I can set this to a tiny value and still get correct results when I query.

how can i added quad_tree nearest method in benchmark?

Parallelized Radius Query

Hey! I really like the new multiprocessing capabilities of kneighbors, for one of my projects I need to use radius queries, would it be possible to get the same feature for radius_neighbors?

Your library is great, by the way, I have found it to be the fastest option for processing very large point clouds, so thank you!

Adding pynanoflann to pypi

Do you have any plans to add pynanoflann to pypi to make it easier to include in other projects? I've made some updates and written a GitHub workflow to build wheels of pynanoflann the I can merge if you are interested. Otherwise and can also upload my fork to pypi (possibly under a different name) if you're not interested.

Memory leak

Hey,

Thanks for the very good library, it is quite fast (fastest python implementation I've found so far). Unfortunately it seems that there is a memory leak somewhere.

If I run the following code memory usage keeps slowly increasing:

import pynanoflann
import numpy as np

for i in range(100):
    print(i)
    
    nn_search = pynanoflann.KDTree(n_neighbors=1, metric='L2', leaf_size=20)    
    nn_search.fit(np.random.rand(10000000,3))
    _, nn_idx = nn_search.kneighbors(np.random.rand(1000,3))
    
    del nn_search, nn_idx

I'm using version 0.0.1 of pynanoflann. The problem limits the usability of the library since I've to create new kdtree in each loop iteration and in the end I'll run out of memory.

Best regards
Juho

.AddPoint() function?

Hey there, wondering if there's a way to append elements without refitting every time the dataset.
Thanks!

 /**
     * Called during search to add an element matching the criteria.
     * @return true if the search should be continued, false if the results are
     * sufficient
     */
    bool addPoint(DistanceType dist, IndexType index)
    {
        printf(
            "addPoint() called: dist=%f index=%u\n", dist,
            static_cast<unsigned int>(index));

        if (dist < radius) m_indices_dists.emplace_back(index, dist);
        return true;
    }

    DistanceType worstDist() const { return radius; }
};

u1234x1234 / pynanoflann Goto Github PK

pynanoflann's Introduction

pynanoflann

Install

Basic Usage

Save, load

Multicore parallelization

Performance

Links

pynanoflann's People

Contributors

Stargazers

Watchers

Forkers

pynanoflann's Issues

Recommend Projects

Recommend Topics

Recommend Org