vincentfpgarcia / knn-cuda Goto Github PK

Fast k nearest neighbor search using GPU

License: Other

MATLAB 15.59% Cuda 72.49% Batchfile 1.98% Makefile 0.09% C 1.95% C++ 7.90%

knn-cuda's Introduction

Introduction

The k-nearest neighbor algorithm (k-NN) is a widely used machine learning algorithm used for both classification and regression. k-NN algorithms are used in many research and industrial domains such as 3-dimensional object rendering, content-based image retrieval, statistics (estimation of entropies and divergences), biology (gene classification), etc. The processing time of the kNN search still remains the bottleneck in many application domains, especially in high dimensional spaces. In this work, we try to address this processing time issue by performing the kNN search using the GPU.

Authors

Vincent Garcia
Éric Debreuve
Michel Barlaud

Description

Inputs values of the k-NN algorithm:

Set of reference points.
Set of query points.
Parameter k corresponding to the number of neighbors to search for.

For each query point, the k-NN algorithm locates the k closest points (k nearest neighbors) among the reference points set. The algorithm returns (1) the indexes (positions) of the k nearest points in the reference points set and (2) the k associated Euclidean distances.

We provide 3 CUDA implementations for this algorithm:

knn_cuda_global computes the k-NN using the GPU global memory for storing reference and query points, distances and indexes.
knn_cuda_texture computes the k-NN using the GPU texture memory for storing the reference points and the GPU global memory for storing other arrays. Using a texture usually speeds-up the computations compared to the first approach. However, due to a size constraint of the texture structures in CUDA, this implementation might not an option for some high dimensional problems.
knn_cublas computes the k-NN using a different technique involving the use of CUBLAS (CUDA implementation of BLAS). The computation of the distances are split into several sub-problems better suited to a GPU acceleration. As a consequence, on some specific problems, this implementation may lead to a much faster processing time compared to the first two approaches.

For more information about this project, please refer to the original papers listed in the bibliography section bellow.

Compilation and testing

We provide a test code with the purpose of verifying the correct compilation and execution of the code as well as evaluating the processing time. This code performs the following tasks:

Create a set of reference points randomly initialized.
Create a set of query points randomly initialized.
Compute the ground-truth k-NN using a non-optimized C implementation.
For each CUDA implementation listed above:
- Compare the output (indexes and distances) with the ground-truth.
- Measure the processing time.

One can change the number of points, the point dimension and the parameter k by editing the test.cpp file. However, with some parameters value, the provided code might generate errors. For instance, if the number of points is too big, the GPU memory might not be big enough to store the different arrays allocated by the program. Consequently, an allocation error should be displayed.

Before attempting to compile this code, one should verify the following points:

Computer used has a CUDA-enabled graphic card (c.f. NVIDIA website).
CUDA drivers and CUDA toolkit are installed.
C/C++ compiler is available.
Command nvcc installed.
Command make installed.

Compile the project using the Makefile:

Open a command line tool or a terminal.
Go to the code directory.
Compile the code using the following command: $ Make
Test the code using the following command: $ ./test

Reference compilation / execution

On August 1st, 2018, the code was successfully compiled and executed:

PARAMETERS
- Number reference points : 16384
- Number query points     : 4096
- Dimension of points     : 128
- Number of neighbors     : 16

Ground truth computation in progress...

TESTS
- knn_c             : PASSED in 90.46776 seconds (averaged over   2 iterations)
- knn_cuda_global   : PASSED in  0.04166 seconds (averaged over 100 iterations)
- knn_cuda_texture  : PASSED in  0.06296 seconds (averaged over 100 iterations)
- knn_cublas        : PASSED in  0.03112 seconds (averaged over 100 iterations)

Specifications of the platform used:

OS: Ubuntu 16.04.1
CPU: Intel Xeon E5-2630 @ 2.3 GHz, 24 cores
RAM: 8 x 4 GB DDR3
GPU: Nvidia Titan X 12GB

Data storage and representation

The storage and representation of a set of points used in this project is similar to the one found with Deep Neural Network projects. Let us consider the following set of 3 points of dimension 2:

p1 = (1, 2)
p2 = (3, 4)
p3 = (5, 6)

We represent these points as a 2D array A of size 2 x 3:

A = | 1 3 5 |
    | 2 4 6 |

Internally, this array is actually stored as a linear array (row-major order):

A = (1, 3, 5, 2, 4, 6)

Bibliography

V. Garcia and E. Debreuve and F. Nielsen and M. Barlaud. k-nearest neighbor search: fast GPU-based implementations and application to high-dimensional feature matching. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Hong Kong, China, September 2010
V. Garcia and E. Debreuve and M. Barlaud. Fast k nearest neighbor search using GPU. In Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska, USA, June 2008.
Vincent Garcia Ph.D. Thesis: Suivi d'objets d'intérêt dans une séquence d'images : des points saillants aux mesures statistiques Université de Nice - Sophia Antipolis, Sophia Antipolis, France, December 2008

Important note

The provided code was written in the context of a Ph.D. thesis. The code was intended to provide a fast and generic solution to the challenging k-NN search problem. However, authors are well aware that for a very specific application (e.g. small number of points, small dimension, only one query point, etc.) there is often a better implementation either in terms of computation time or memory footprint. It is known that a GPGPU implementation should be adapted to the problem and to the specifications of the GPU used. This code will help one to identify if the GPGPU solution is viable for the given problem. Following the terms of the license below, one is allowed and should adapt the code.§

Legacy code

The original code written in 2008 is still available in the legacy folder.

Changes made in the 2018 update:

Code cleaning: The code is now simpler, cleaner and better documented.
Code simplification: Back in 2008, 4 implementations were available, but the differents implementations were performing different tasks. The 2018 implementations all compute the same thing: distances and indexes of the k-NN. One can now select more easily the most appropriate implementation to her/his needs.
No query points split: When computing the k-NN using an exhaustive algorithm, a distance matrix (containing the all distances between query and reference points) must be computed. This matrix can be enormous and may not fit into the GPU memory. In the 2008 code, the query points were eventually splitted into different subsets. These subsets were processed iteratively and the results were finally merged. This allowed to partially solve the memory problem as long as the reference points could fit into the GPU memory. However, this part of the code was quite long and complex. We recall that a CUDA implementation should be adapted to the problem dimension and to the hardware employed. We believe the split used in the 2008 version belongs to the code adaptation and should not be provided into a generic framework. Removing this split is consistent with the idea of making the code cleaner, simpler and more readable.
Testing: Thanks to the test code provided in the 2018 update, one can now easily verify that the code compiles smoothly and that the results are correct. By modifying the parameters used in the test.cpp file, one can quickly identify the most adapted implementation based on the available options and processing times.
Easy compilation: Using a Makefile makes the project compilation much simpler than before.
No Matlab bindings Back in 2008, Matlab was to go-to programming language in Academy. Nowadays, other options are available. Since we do not want to provide and maintain the bindings for Matlab, Python, etc., we decided to simply remove the initial Matlab bindings. If necessary, these bindings are still available in the legacy code.

knn-cuda's People

Contributors

Stargazers

Watchers

Forkers

wanmiles albiyt sjpatel21 jthorniley pengxuwei shengren fsun2p10 linkedinyou codeaudit lingerlman nobodyinamerica daxiongshu ikrishneel yuyu2172 henriquepassos areslp 6676401088 jackyspeed minhpvo saminigod zhencang alexzzhu jettan ajaycharan limin2021 alexandliutao cquxkj jby1993 bowen1212 git-nibird daijicheng drmateo dishengbin neilkeats kervinder dendisuhubdy tcxxxx takanokage xuzhouchuan hunglethanh9 rongzihao ttiwari1 nicqiang phylony lunwk dylanwusee fwzhuang tthhee yueyub munstar redheli rachitb11 turtlezhong yechaowang whu-lyh satoshirobatofujimoto xiaohulugo collector-m autodriventech clyhh busuidemeng 5l1v3r1 aliakseis jkbjh hitzht eralien y04149 jifeixu doomf1st wangyujiewj 1035326373 das-projects rancheng yolavegita ruanchaves wujiaxiu chenyu2016 liuyongcan ramyasriundavall shuvammitra1999 sadavarterohit qmpzqmpz zwl995 kacperkan laplacekorea xuyanwen2012 wangchenjun521 hunisong xuyuzhuang11 lurvelly edhendo ssz990220 ginobilinie sunjuil-ty panzhichen jdm365 sam-37 kitzhongzhengqi liangjin2007 cicly

knn-cuda's Issues

Clarifications on compute_distances() function

Hello Vincent, 
Thank you for your work!

Could you please clarify a few things about your compute_distances() function
that is used in the knn_cuda_global implementation? Could you please clarify
how you specify conditions cond0 and cond2? As far as I understand, they are
used for staying within the A matrix when looping over the sub-matrices, but I
don’t fully understand how they are determined. For B matrix, as I
understood, cond1= (begin_B + tx < query_width) is used to stay within the B
matrix in the X direction. However, you use both cond0 = (begin_A + tx <ref_width) and cond2 = (begin_A + ty < ref_width) to check if the index is
less than ref_width. Shouldn’t these conditions check that the x index is
less than ref_width and the other index is less than the dimension of points?

Thank you in advance!

Best,
Ekaterina

Got Wrong Answer?

Sorry, i have made some mistake and they have been solved. Thanks.

question for use it in python

Hi, Could you please provide some hints to import your function in python codes? Thanks a lot!

what is the 'ind' refer to?

Thanks for your codes sharing.
But I still have some question when I use this.
I don't know what the 'ind' refer to in the file 'KNN.CUBLAS.with.indexes/knn_cublas_with_indexes.cu'.
So I make some experiment, but all fail.
Such as, for this case:
(Number of reference points : 4096 Number of query points : 1 Dimension of points : 2 Number of neighbors to consider : 1 )
And the ind is only one dim.

Is it the nth pair of point? (means the point found is ref[ind[0] * 2, ind[0] * 2 + 1])
Or the position that fit to the query point? (means the point found is ref[ind[0] - 1, ind[0])

I would appriciate if you reply soon.

High-Dimensional High-K Performance

Hello,

Thank you very much for sharing your valuable work. The type of applications I work with require KNN at high-dimensional (d>1000)) & high-k (k>1500). I noticed that the C implementation performance is faster when d and k become large. For example,

PARAMETERS

Number reference points : 16384
Number query points : 5
Dimension of points : 238
Number of neighbors : 600
Ground truth computation in progress...
TESTS
knn_c : PASSED in 0.11800 seconds (averaged over 2 iterations)
knn_cuda_global : PASSED in 0.45576 seconds (averaged over 100 iterations)
knn_cuda_texture : PASSED in 0.46048 seconds (averaged over 100 iterations)
knn_cublas : FAILED

Please let me know if it is possible to modify the code to make it perform faster for such cases.
Hardware :
CPU: intel Xeon W-2145 (3.7GHz)
GPU: Nvidia Quadro GV100

Best regard,
Ali

knn_cublas_with_indexes.cu failed to work

environment:
cuda 7.5, win10, visual studio 2013

knn_cublas_with_indexes.cu failed to work.

I get the following errors:
MEMORY ALLOCATION ERROE: unspecfied launch time failure
Wished allocated memory:481024000

Distance is always 0

The output if printed or checked is not accurate. The test_dist and gt_dist are both 0 hence 0-0 is always less than the precision. hence its passes but the output is wrong.

Sub-matrix of A

I wonder why begin_A = BLOCK_DIM * blockIdx.y ?, does that mean their begin_A is identical when threads' blockIdx.y=0 even though blockIdx.x are not the same; I want to know how to create the Sub-matrices.
this model is often used, int row = blockIdx.y * blockDim.y + threadIdx.y;Can you give some tips？ thanks a lot!

Compatible to CUDA9.1?

is your package compatible to CUDA9.1, cuDNN7.1 on Ubuntu 16.04?

Tutorial needed.

Hi!
Can you provide some command line examples? I can not figure out how to use your code.
Thank you very much!

Selection sort implementation

Hello,

My college group and I, tried to modify this algorithm by using selection sort which takes first k elements. However the precision we get for indices is less than 99.9% and the test function fails. This has lead us to believe that using selection sort for a parallel knn algorithm isn't possible. So, our question is if you think that it is possible and if it is how would you implement it? We can send you code for our selection sort method if you need it.

Kind regards,
Amer

failed on cuda7 ubuntu14

ubgpu@ubgpu:~/github/kNN-CUDA/KNN.CUDA.with.indexes$ sudo nvcc -o knn_cuda_with_indexes.exe knn_cuda_with_indexes.cu -lcuda -D_CRT_SECURE_NO_DEPRECATE
knn_cuda_with_indexes.cu(344): error: argument of type "unsigned int *" is incompatible with parameter of type "size_t *"

knn_cuda_with_indexes.cu(344): error: argument of type "unsigned int *" is incompatible with parameter of type "size_t *"

knn_cuda_with_indexes.cu(349): error: more than one instance of overloaded function "min" matches the argument list:
function "min(int, int)"
function "min(int, unsigned int)"
argument types are: (int, unsigned long)

knn_cuda_with_indexes.cu(409): error: more than one instance of overloaded function "min" matches the argument list:
function "min(int, int)"
function "min(unsigned int, int)"
argument types are: (size_t, int)

4 errors detected in the compilation of "/tmp/tmpxft_00000ba3_00000000-9_knn_cuda_with_indexes.cpp1.ii".
ubgpu@ubgpu:~/github/kNN-CUDA/KNN.CUDA.with.indexes$

About KDTree

knn_cuda slower than C++ implementation

Hi,

I was trying to optimize some kNN query using cKDTree from Python scipy, which uses C++

import time
import numpy
from scipy.spatial import cKDTree

numpy.random.seed(0)
arr = numpy.random.rand(10000, 3).astype(numpy.float32)
ca  = numpy.arange(10000, dtype=numpy.uint32)

timings = []
for i in range(100):
    if i < 5: continue
    start = time.time()
    tree = cKDTree(arr)
    neighbors2 = tree.query(arr, 2)
    timings.append(time.time() - start)

print("Scipy:", numpy.mean(timings))
Scipy: 0.013833379745483399

However, if I use the same parameters in your test, it's slower for cuda.

PARAMETERS
- Number reference points : 10000
- Number query points     : 10000
- Dimension of points     : 3
- Number of neighbors     : 1

Ground truth computation in progress...

TESTS
- knn_c             : PASSED in  2.91740 seconds (averaged over   2 iterations)
- knn_cuda_global   : PASSED in  0.06941 seconds (averaged over 100 iterations)
- knn_cuda_texture  : PASSED in  0.05560 seconds (averaged over 100 iterations)
- knn_cublas        : ERROR: Unable to copy data from device to host

So the CUDA version is >5x slower. Also, I am unable to run knn_cublas.

(Question) Volta architecture possible improvements

Hi Vincent (et al.),

Thank you for your contribution to the literature. I will probably be basing my implementation of KNN on your work, for my research.

I just found the repo, and so I've not yet read your thesis.
I did read your note regarding GPGPU implementations being specific to the GPU(s) being used.

Do you think that the new Volta architecture offers any features that would merit a revised implementation?

Any advice is welcomed.

Thank you!

C implementation not fully optimized

Hello, first of all let me thank you, your work might be helpful for my investigation.

I've checked your code and found that CPU implementation could run faster by adding the -O3 flag to the makefile, which could lead to a more realistic comparison. I leave my results (running on Intel i7-8700 and GeForce 1080Ti, ubuntu 16.04, CUDA 10.0):

My Makefile:

.PHONY: all test

all: test test2

test:
nvcc -o test test.cpp knncuda.cu -lcuda -lcublas -lcudart -Wno-deprecated-gpu-targets

test2:
nvcc -O3 -o test2 test.cpp knncuda.cu -lcuda -lcublas -lcudart -Wno-deprecated-gpu-targets

clean:
rm test
rm test2

about reference points

This code is helpful for beginners learning cuda, why we can't increase the limit...