Coder Social home page Coder Social logo

kristofferc / nearestneighbors.jl Goto Github PK

View Code? Open in Web Editor NEW
400.0 15.0 67.0 1.21 MB

High performance nearest neighbor data structures (KDTree and BallTree) and algorithms for Julia.

License: Other

Julia 100.00%
nearest-neighbors datastructures knn-search range-search

nearestneighbors.jl's Introduction

NearestNeighbors.jl

Build Status codecov.io DOI

NearestNeighbors.jl is a package written in Julia to perform nearest neighbor searches.


Creating a tree

There are currently three types of trees available:

  • BruteTree: Not actually a tree. It linearly searches all points in a brute force fashion. Works with any Metric.
  • KDTree: In a kd tree the points are recursively split into groups using hyper-planes. Therefore a KDTree only works with axis aligned metrics which are: Euclidean, Chebyshev, Minkowski and Cityblock.
  • BallTree: Points are recursively split into groups bounded by hyper-spheres. Works with any Metric.

These trees are created with the following syntax:

BruteTree(data, metric; leafsize, reorder)
KDTree(data, metric; leafsize, reorder)
BallTree(data, metric; leafsize, reorder)
  • data: The data, i.e., the points to build up the tree from. It can either be
    • a matrix of size nd ร— np with the points to insert in the tree where nd is the dimensionality of the points and np is the number of points
    • a vector of vectors with fixed dimensionality, nd, which must be part of the type. Specifically, data should be a Vector{V}, where V is itself a subtype of an AbstractVector and such that eltype(V) and length(V) are defined. (For example, with 3D points, V = SVector{3, Float64} works because eltype(V) = Float64 and length(V) = 3 are defined in V.)
  • metric: The metric to use, defaults to Euclidean. This is one of the Metric types defined in the Distances.jl packages. It is possible to define your own metrics by simply creating new types that are subtypes of Metric.
  • leafsize (keyword argument): Determines at what number of points to stop splitting the tree further. There is a trade-off between traversing the tree and having to evaluate the metric function for increasing number of points.
  • reorder (keyword argument): While building the tree this will put points close in distance close in memory since this helps with cache locality. In this case, a copy of the original data will be made so that the original data is left unmodified. This can have a significant impact on performance and is by default set to true.

All trees in NearestNeighbors.jl are static which means that points can not be added or removed from an already created tree.

Here are a few examples of creating trees:

using NearestNeighbors
data = rand(3, 10^4)

# Create trees
kdtree = KDTree(data; leafsize = 10)
balltree = BallTree(data, Minkowski(3.5); reorder = false)
brutetree = BruteTree(data)

k Nearest Neighbor (kNN) searches

A kNN search finds the k nearest neighbors to given point(s). This is done with the method:

knn(tree, points, k, sortres = false, skip = always_false) -> idxs, dists
  • tree: The tree instance
  • points: A vector or matrix of points to find the k nearest neighbors to. If points is a vector of numbers then this represents a single point, if points is a matrix then the k nearest neighbors to each point (column) will be computed. points can also be a vector of other vectors where each element in the outer vector is considered a point.
  • sortres (optional): Determines if the results should be sorted before returning. In this case the results will be sorted in order of increasing distance to the point.
  • skip (optional): A predicate to determine if a given point should be skipped, for example if iterating over points and a point has already been visited.

It is generally better for performance to query once with a large number of points than to query multiple times with one point per query.

As a convenience, if you only want the closest nearest neighbor, you can call nn instead for a cleaner result:

nn(tree, points, skip = always_false) -> idxs, dists

Some examples:

using NearestNeighbors
data = rand(3, 10^4)
k = 3
point = rand(3)

kdtree = KDTree(data)
idxs, dists = knn(kdtree, point, k, true)

idxs
# 3-element Array{Int64,1}:
#  4683
#  6119
#  3278

dists
# 3-element Array{Float64,1}:
#  0.039032201026256215
#  0.04134193711411951
#  0.042974090446474184

# Multiple points
points = rand(3, 4);

idxs, dists = knn(kdtree, points, k, true);

idxs
# 4-element Array{Array{Int64,1},1}:
#  [3330, 4072, 2696]
#  [1825, 7799, 8358]
#  [3497, 2169, 3737]
#  [1845, 9796, 2908]

# dists
# 4-element Array{Array{Float64,1},1}:
#  [0.0298932, 0.0327349, 0.0365979]
#  [0.0348751, 0.0498355, 0.0506802]
#  [0.0318547, 0.037291, 0.0421208]
#  [0.03321, 0.0360935, 0.0411951]

# Static vectors
v = @SVector[0.5, 0.3, 0.2];

idxs, dists = knn(kdtree, v, k, true);

idxs
# 3-element Array{Int64,1}:
#   842
#  3075
#  3046

dists
# 3-element Array{Float64,1}:
#  0.04178677766255837
#  0.04556078331418939
#  0.049967238112417205

Range searches

A range search finds all neighbors within the range r of given point(s). This is done with the method:

inrange(tree, points, r, sortres = false) -> idxs

Note that for performance reasons the distances are not returned. The arguments to inrange are the same as for knn except that sortres just sorts the returned index vector.

An example:

using NearestNeighbors
data = rand(3, 10^4)
r = 0.05
point = rand(3)

balltree = BallTree(data)
idxs = inrange(balltree, point, r, true)

# 4-element Array{Int64,1}:
#  317
#  983
# 4577
# 8675

neighborscount = inrangecount(balltree, point, r, true) # if you were just interested in the number of points, this function will count them without allocating arrays for the indexes

Using on-disk data sets

By default, the trees store a copy of the data provided during construction. This is problematic in case you want to work on data sets which are larger than available memory, thus wanting to mmap the data or want to store the data in a different place, outside the tree.

DataFreeTree can be used to strip a constructed tree of its data field and re-link it with that data at a later stage. An example of using a large on-disk data set looks like this:

using Mmap
ndim = 2; ndata = 10_000_000_000
data = Mmap.mmap(datafilename, Matrix{Float32}, (ndim, ndata))
data[:] = rand(Float32, ndim, ndata)  # create example data
dftree = DataFreeTree(KDTree, data)

dftree now only stores the indexing data structures. It can be passed around, saved and reloaded independently of data.

To perform look-ups, dftree is re-linked to the underlying data:

tree = injectdata(dftree, data)  # yields a KDTree
knn(tree, data[:,1], 3)  # perform operations as usual

Author

Kristoffer Carlsson - @KristofferC - [email protected]

nearestneighbors.jl's People

Contributors

ablaom avatar andyferris avatar bioturbonick avatar briochemc avatar c42f avatar carlolucibello avatar davnn avatar dkarrasch avatar eliascarv avatar femtocleaner[bot] avatar fredrikekre avatar goretkin avatar ilyaorson avatar joshchristie avatar juliohm avatar kristofferc avatar okonsamuel avatar pablosanjose avatar schmrlng avatar shushman avatar staticfloat avatar tkelman avatar tlnagy avatar visr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nearestneighbors.jl's Issues

[Breaking] MethodError: no method matching length(::Type{Array{Float64,1}})

After my latest update I do:

NearestNeighbors
a = [rand(3) for i in 1:100]
KDTree(a)

MethodError: no method matching length(::Type{Array{Float64,1}})
Closest candidates are:
  length(!Matched::SimpleVector) at essentials.jl:256
  length(!Matched::Base.MethodList) at reflection.jl:558
  length(!Matched::MethodTable) at reflection.jl:634
  ...
in NearestNeighbors.KDTree at NearestNeighbors\src\kd_tree.jl:34
in #KDTree#12 at NearestNeighbors\src\kd_tree.jl:36
in NearestNeighbors.TreeData at NearestNeighbors\src\tree_data.jl:14

Getting indices of nearest neighbor in one go

Hi Kristoffer

Let's say that I have a tree:

using NearestNeighbors, StaticArrays
data = [rand(SVector{3}) for i in 1:1000]
tree = KDTree(data, Euclidean())

I found that I am implementing an algorithm where I want the nearest neighbor of a point very frequently. Sometimes I am also quering about it multiple times. I was wondering whether there is some way I can get the indices of nearest neighbors from the tree structure.

Of course I know I can do:

nind = [knn(tree, data[j], 1, false, i -> i == j)[1][1] for j in 1:length(data)]

but is there some "faster way" ?

edit: the tree is not an argument to my high-level function but created inside. so if there is a way to do what I want even during construction, then it will still be super useful

`sqrt` error with `knn` on `BallTree`

I'm trying to get nearest neighbours with Haversine distances. I use these two .asc (delimited) files:
current: https://www.dropbox.com/s/egfv2tjhbjwacl4/PresentDaySpeciesOverlap.asc?dl=0
future: https://www.dropbox.com/s/r8mirakim75utyd/Future2080SpeciesOverlap.asc?dl=0

It's hard to make the MWE really minimal:

import DelimitedFiles: readdlm, writedlm
import NearestNeighbors: BallTree, knn
import Distances: Haversine

current = replace!(readdlm("PresentDaySpeciesOverlap.asc", ' ', skipstart = 6), -9999=>NaN)
future = replace!(readdlm("Future2080SpeciesOverlap.asc", ' ', skipstart = 6), -9999=>NaN)

present = [[x[1], x[2]] for x in CartesianIndices(current) if current[x] == 1]
potential = [[x[1], x[2]] for x in CartesianIndices(current) if current[x] == 0 && future[x] == 1]

tree = BallTree(float(reduce(hcat, present)), Haversine(6371))
inds, dists = knn(tree, float(reduce(hcat, potential)), 1) #errors

The error message is

DomainError with -5.551115123125783e-17:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
throw_complex_domainerror(::Symbol, ::Float64) at math.jl:32
sqrt at math.jl:492 [inlined]
(::Haversine{Int64})(::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::StaticArrays.SArray{Tuple{2},Float64,1,2}) at haversine.jl:33
add_points_knn! at generic.jl:24 [inlined]
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:161
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:173
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:180
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:180
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:180
knn_kernel!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Int64, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::typeof(NearestNeighbors.always_false)) at ball_tree.jl:178
_knn(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Array{Int64,1}, ::Array{Float64,1}, ::Function) at ball_tree.jl:148
knn_point!(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::StaticArrays.SArray{Tuple{2},Float64,1,2}, ::Bool, ::Array{Float64,1}, ::Array{Int64,1}, ::Function) at knn.jl:30
knn(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Array{StaticArrays.SArray{Tuple{2},Float64,1,2},1}, ::Int64, ::Bool, ::Function) at knn.jl:22
knn(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Array{Float64,2}, ::Int64, ::Bool, ::Function) at knn.jl:55
knn(::BallTree{StaticArrays.SArray{Tuple{2},Float64,1,2},2,Float64,Haversine{Int64}}, ::Array{Float64,2}, ::Int64) at knn.jl:48
getdists(::Array{Float64,2}, ::Array{Float64,2}) at dist.jl:14
top-level scope at dist.jl:23

Seems like some 0 distance is interpreted as a small (<eps) negative number.

Fast nearest-neighbor searching for nonlinear signal processing

Hi Kristoffer here is the algorithm that I mentioned:

Merkwirth, Parlitz, Lauterborn - 2000 - Fast nearest-neighbor searching for nonlinear signal processing.pdf

A professor in my institute (U. Parlitz) has always been telling me that their algorithm was "very fast" but I never really compared it with anything... In any case, I now have the code that implements the algorithm:

https://gist.github.com/Datseris/5aee24f34cb187db1bb7eb85f65818d7

problem is: it is many lines of C code, which I don't really understand :D If anyone can put em to use, I leave them here!

I also tag @andyferris and @JoshChristie since I mentioned this to them as well during Andy's talk.

~~ Let me know if there is some code file missing, I had like 20 of them but they told me the 2 main files are nn_prepare and nn_search

BallTree only works with any Distances.UnionMetrics

The BallTree implementation relies on the column-selecting evaluation methods in https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/evaluation.jl which only work for metrics that can be expressed as element-wise reductions (i.e. Distances.UnionMetrics). Does the underlying math of the BallTree truly apply to any Metric (or SemiMetric)? If that is the case, a fallback implementation relying on slice would be nice so that the weighted metrics and user-defined metrics can be used.

`inrange` returning indices ordered by distance

I was using the option sortres in inrange thinking that it was ordering the indices from smallest to largest distance. I then realized, by reading the source code, that the ordering is on the indices found.

Is there any efficient method for getting a list of indices ordered by distance? That would be useful.

Investigate into Fixed Size Arrays / dim parameterization

It could be possible to get some speedups by using a vector of fixed size arrays instead of matrices to represent the data in the trees. The loop counts would then be known at compile time and it is possible that more efficient code could be generated.

Another way to know the loop counts at compile time is to parametrize the trees on the dimension of the points.

For small dimensions it should be possible for LLVM to for example unroll the loops if it think it is beneficial.

Both things are worth looking into.

Is it possible to use integers instead of floats?

I would like to use NearestNeighbors.jl to find the nearest string in Hamming distance from a given input string. Do I have to convert them to floats to use the NearestNeighbors library? I would prefer to avoid this because of precision issues and memory use. For example, I can't use ints with BallTree:

data = rand(1:4, (10, 10^4))
balltree = BallTree(data, Hamming())

gives the following error:

LoadError: MethodError: `convert` has no method matching convert(::Type{NearestNeighbors.BallTree{T<:AbstractFloat,M<:Distances.Metric}}, ::Array{Int64,2}, ::Distances.Hamming)
This may have arisen from a call to the constructor NearestNeighbors.BallTree{T<:AbstractFloat,M<:Distances.Metric}(...),
since type constructors fall back to convert methods.
Closest candidates are:
  NearestNeighbors.BallTree{T<:AbstractFloat,M<:Distances.Metric}(!Matched::Array{T<:AbstractFloat,2}, ::M<:Distances.Metric)
  call{T}(::Type{T}, ::Any)
  convert{T}(::Type{T}, !Matched::T)
  ...
while loading In[53], in expression starting on line 2

 in call at essentials.jl:57

Is there a specific reason that AbstractFloat is the parent type? Distances.jl can handle other types just fine:

evaluate(Hamming(), ["A","B","C"], ["A","B","D"])

I'm using Julia v0.4.1 and the masters of both NearestNeighbors and Distances

Not possible to save KDTree type to .jld2

Hey there,
I want to experiment with different tree types and want to save
the type I used in each example alongside the data.
I came upon this weird error that only shows up for KDTree.

I was also able to save all other custom types I could think of so
I'm not sure what the issue is.

julia> using NearestNeighbors, FileIO
julia> save("Foo.jld2", "Bar", NNTree)

julia> save("Foo.jld2", "Bar", BallTree)

julia> save("Foo.jld2", "Bar", DataFreeTree)

julia> save("Foo.jld2", "Bar", BruteTree)

julia> save("Foo.jld2", "Bar", KDTree)
Error encountered while saving "/home/jonas/Foo.jld2".
Fatal error:
ERROR: MethodError: Cannot `convert` an object of type Type{Minkowski} to an object of type DataType
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:154
Stacktrace:
 [1] setindex!(::Array{DataType,1}, ::Type, ::Int64) at ./array.jl:769
 [2] copyto!(::IndexLinear, ::Array{DataType,1}, ::IndexLinear, ::Array{Any,1}) at ./abstractarray.jl:731
 [3] copyto! at ./abstractarray.jl:723 [inlined]
 [4] Type at ./array.jl:497 [inlined]
 [5] h5convert!(::JLD2.IndirectPointer, ::Type{JLD2.Vlen{JLD2.OnDiskRepresentation{(0, 16),Tuple{String,Array{Any,1}},Tuple{JLD2.Vlen{String},JLD2.Vlen{JLD2.RelOffset}}}()}}, ::JLD2.JLDFile{JLD2.MmapIO}, ::Union, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:964
 [6] write_data(::JLD2.MmapIO, ::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::Type{JLD2.Vlen{JLD2.OnDiskRepresentation{(0, 16),Tuple{String,Array{Any,1}},Tuple{JLD2.Vlen{String},JLD2.Vlen{JLD2.RelOffset}}}()}}, ::JLD2.HasReferences, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/dataio.jl:111
 [7] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::JLD2.WriteDataspace{0,Tuple{}}, ::JLD2.CommittedDatatype, ::Type{JLD2.Vlen{JLD2.OnDiskRepresentation{(0, 16),Tuple{String,Array{Any,1}},Tuple{JLD2.Vlen{String},JLD2.Vlen{JLD2.RelOffset}}}()}}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:427
 [8] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:521
 [9] write_ref(::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:536
 [10] h5convert! at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:658 [inlined]
 [11] macro expansion at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:491 [inlined]
 [12] h5convert!(::JLD2.IndirectPointer, ::JLD2.OnDiskRepresentation{(0, 16, 24),Tuple{Symbol,Any,Any},Tuple{JLD2.Vlen{String},JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.JLDFile{JLD2.MmapIO}, ::TypeVar, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:491
 [13] write_data(::JLD2.MmapIO, ::JLD2.JLDFile{JLD2.MmapIO}, ::TypeVar, ::JLD2.OnDiskRepresentation{(0, 16, 24),Tuple{Symbol,Any,Any},Tuple{JLD2.Vlen{String},JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.HasReferences, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/dataio.jl:111
 [14] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::JLD2.WriteDataspace{0,Tuple{}}, ::JLD2.CommittedDatatype, ::JLD2.OnDiskRepresentation{(0, 16, 24),Tuple{Symbol,Any,Any},Tuple{JLD2.Vlen{String},JLD2.RelOffset,JLD2.RelOffset}}, ::TypeVar, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:427
 [15] write_dataset at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:521 [inlined]
 [16] write_ref_mutable at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:526 [inlined]
 [17] write_ref at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:534 [inlined]
 [18] h5convert! at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:658 [inlined]
 [19] h5convert!(::JLD2.IndirectPointer, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.JLDFile{JLD2.MmapIO}, ::UnionAll, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:986
 [20] write_data(::JLD2.MmapIO, ::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.HasReferences, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/dataio.jl:111
 [21] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::JLD2.WriteDataspace{0,Tuple{}}, ::JLD2.CommittedDatatype, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:427
 [22] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:521
 [23] write_ref(::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:536
 [24] h5convert! at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:658 [inlined]
 [25] h5convert!(::JLD2.IndirectPointer, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.JLDFile{JLD2.MmapIO}, ::UnionAll, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/data.jl:987
 [26] write_data(::JLD2.MmapIO, ::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::JLD2.HasReferences, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/dataio.jl:111
 [27] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::JLD2.WriteDataspace{0,Tuple{}}, ::JLD2.CommittedDatatype, ::JLD2.OnDiskRepresentation{(0, 8),Tuple{TypeVar,Any},Tuple{JLD2.RelOffset,JLD2.RelOffset}}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:427
 [28] write_dataset(::JLD2.JLDFile{JLD2.MmapIO}, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/datasets.jl:521
 [29] write(::JLD2.Group{JLD2.JLDFile{JLD2.MmapIO}}, ::String, ::Type, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/groups.jl:119
 [30] write(::JLD2.JLDFile{JLD2.MmapIO}, ::String, ::UnionAll, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/jonas/.julia/packages/JLD2/KjBIK/src/JLD2.jl:327
 [31] #38 at /home/jonas/.julia/packages/JLD2/KjBIK/src/loadsave.jl:103 [inlined]
 [32] #jldopen#31(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(JLD2, Symbol("##38#39")){String,UnionAll,Tuple{}}, ::String, ::Vararg{String,N} where N) at /home/jonas/.julia/packages/JLD2/KjBIK/src/loadsave.jl:4
 [33] jldopen(::Function, ::String, ::String) at /home/jonas/.julia/packages/JLD2/KjBIK/src/loadsave.jl:2
 [34] #save#37(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::File{DataFormat{:JLD2}}, ::String, ::Type) at /home/jonas/.julia/packages/JLD2/KjBIK/src/loadsave.jl:101
 [35] save(::File{DataFormat{:JLD2}}, ::String, ::Type) at /home/jonas/.julia/packages/JLD2/KjBIK/src/loadsave.jl:98
 [36] top-level scope at none:0

Define function `nn`

Suppose I want only the nearest-neighbour for a number of points, then I'd like knn to return a Vector{Int} instead of a Vector{Vector{Int}}. Maybe worthwhile defining a second search function nn that does this?

Interpret 1-dimensional array as 1 point

I have code that makes 3xN arrays as input to BallTree. However, when there is only one point, it produces a 3-member 1-dimensional array.

When this 1-dimensional array is input to the Tree constructors, I get the error MethodError: no method matching KDTree(::Array{Float64,1}).

I believe the package would be more robust if the Tree constructor accepted 1-dimensional arrays and interpreted it as a single point.

Workaround: Use reshape to enforce 2-dimensionality

Add support for Julia 0.7

Opening this mainly for tracking purposes, so I know when I can update my packages.

using NearestNeighbors
data = rand(3, 10^4)

julia> kdtree = KDTree(data; leafsize = 10)
โ”Œ Warning: `Array{T, 2}(m::Int, n::Int) where T` is deprecated, use `Array{T, 2}(undef, m, n)` instead.
โ”‚   caller = Type at <missing>:0 [inlined]
โ”” @ Core <missing>:0
โ”Œ Warning: `reinterpret(::Type{T}, a::Array{S}, dims::NTuple{N, Int}) where {T, S, N}` is deprecated, use `reshape(reinterpret(T, vec(a)), dims)` instead.
โ”‚   caller = #KDTree#13(::Int64, ::Bool, ::Bool, ::Array{Float64,2}, ::Symbol, ::Type, ::Array{Float64,2}, ::Euclidean) at kd_tree.jl:79
โ”” @ NearestNeighbors kd_tree.jl:79
ERROR: MethodError: no method matching KDTree(::Base.ReinterpretArray{StaticArrays.SArray{Tuple{3},Float64,1,3},1,Float64,Array{Float64,1}}, ::Euclidean; leafsize=10, storedata=true, reorder=true, reorderbuffer=StaticArrays.SArray{Tuple{3},Float64,1,3}[], indicesfor=:data)
Closest candidates are:
  KDTree(::Array{V<:AbstractArray,1}, ::M<:Union{Chebyshev, Cityblock, Euclidean, Minkowski}; leafsize, storedata, reorder, reorderbuffer, indicesfor) where {V<:AbstractArray, M<:Union{Chebyshev, Cityblock, Euclidean, Minkowski}} at C:\Users\datseris\.julia\packages\NearestNeighbors\nEb3\src\kd_tree.jl:34
  KDTree(::Array{T<:AbstractFloat,2}, ::M<:Union{Chebyshev, Cityblock, Euclidean, Minkowski}; leafsize, storedata, reorder, reorderbuffer, indicesfor) where {T<:AbstractFloat, M<:Union{Chebyshev, Cityblock, Euclidean, Minkowski}} at C:\Users\datseris\.julia\packages\NearestNeighbors\nEb3\src\kd_tree.jl:77
Stacktrace:
 [1] #KDTree#13(::Int64, ::Bool, ::Bool, ::Array{Float64,2}, ::Symbol, ::Type, ::Array{Float64,2}, ::Euclidean) at C:\Users\datseris\.julia\packages\NearestNeighbors\nEb3\src\kd_tree.jl:85
 [2] Type at .\<missing>:0 [inlined] (repeats 2 times)
 [3] top-level scope

Modify `inrange` to accept hyper-rectangle?

As I understand, inrange currently accepts a radius r which is used to determine if points exist within the hypersphere with radius r. I need to find the points existing within the hyperrectangle ฯต, a vector of the lengths of each dimension of the hyperrectangle.

I wanted to get your thoughts (and any pointers on how to go about implementing this) before I wasted any time.

In-place version of knn

Can I submit a PR to add a knn! function that fills the first arguments instead of allocating and returning? I have to call knn 10^6 times in my applications, that would make a huge performance boost.

Also, do you have a better API than the following?

knn!(inds, dists, ...)

Maybe define a

struct KNNResult{T<:Number}
  inds::Vector{Int}
  dists::Vector{T}
end

and pass that instead of passing two arguments to fill?

knn!(result, ...) # in-place
r = knn(...) # ask users to unpack manually: r.inds, r.dists

RFH: 0.5-style comprehensions broke the way you're using testsets

You might be using @testset for in a more complicated way than most other people? I think there's a base Julia bug here, but I could use a hand in reducing the test case from here. See http://pkg.julialang.org/detail/NearestNeighbors.html

Test Summary: | Pass  Total
  inrange     |   24     24
ERROR: LoadError: LoadError: syntax: break or continue outside loop
 in include_from_node1(::String) at ./loading.jl:426 (repeats 2 times)
 in process_options(::Base.JLOptions) at ./client.jl:266
 in _start() at ./client.jl:322
while loading /home/vagrant/.julia/v0.5/NearestNeighbors/test/test_monkey.jl, in expression starting on line 6
while loading /home/vagrant/.julia/v0.5/NearestNeighbors/test/runtests.jl, in expression starting on line 34

I bisected the failure to Julia commit JuliaLang/julia@cd35df536bd, and I'm pretty sure this is an unintended consequence. Base apparently needs more tests of nested testsets with break/continue in them.

using KDtree as a type?

I've just seen the error the first time, i used (some time ago...) KDTree in a type definition

type pgraph
    x_screen::Array{Float64,1}
    y_screen::Array{Float64,1}
    x_last::Real
    y_last::Real
    kdtree::NearestNeighbors.KDTree{Float64,Distances.Euclidean}
end

which is later on instantiated with

p1 = pgraph(x,y,0.0,0.0,KDTree(zeros(2,2));

and p1.kdtree then given a value when applicable.

How to do that now?

Type stability in tree constructors

Given some data X, i want to create tree (KDTree, or Ball)

X = [1. 2; 3 4]
@code_warntype KDTree(X)

Body::KDTree{_1,Euclidean,_2} where _2 where _1
75 1 โ”€ %1 = %new(Distances.Euclidean, 0.0)::Euclidean           โ”‚โ•ปโ•ท Type
   โ”‚   %2 = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{Float64,2}, 0, 0, 0, 0))::Array{Float64,2}
   โ”‚   %3 = invoke NearestNeighbors.:(#KDTree#17)(10::Int64, true::Bool, true::Bool, %2::Array{Float64,2}, _1::Type, _2::Array{Float64,2}, %1::Euclidean)::KDTree{_1,Euclidean,_2} where _2 where _1
   โ””โ”€โ”€      return %3  
@code_warntype BallTree(X)

Body::BallTree{_1,_2,_3,Euclidean} where _3 where _2 where _1
82 1 โ”€ %1 = %new(Distances.Euclidean, 0.0)::Euclidean           โ”‚โ•ปโ•ท Type
   โ”‚   %2 = $(Expr(:foreigncall, :(:jl_alloc_array_2d), Array{Float64,2}, svec(Any, Int64, Int64), :(:ccall), 3, Array{Float64,2}, 0, 0, 0, 0))::Array{Float64,2}
   โ”‚   %3 = invoke NearestNeighbors.:(#BallTree#19)(10::Int64, true::Bool, true::Bool, %2::Array{Float64,2}, _1::Type, _2::Array{Float64,2}, %1::Euclidean)::BallTree{_1,_2,_3,Euclidean} where _3 where _2 where _1
   โ””โ”€โ”€      return %3  

Omitting results to points themselves

I would like to get the k nearest points that are not the querying points when using the same data for building the tree and querying it. Is it possible to do this with the skip function? The skip function seems to take integer arguments corresponding to tree indices, but it is unclear to me what this means.

Inf points causes corruption with BallTree

To implement an exclusive nearest neighbor algorithm, I 'removed' points from my points array by setting their values to Inf and rebuilding the BallTree from the edited array. I did this to avoid copying the array each time and to preserve indexing.

However, the results output by knn are incorrect and unstable. Once the tree is created, it will always produce the same wrong output. However, regenerating the tree on the same data and trying knn again can produce different wrong output. The indexes are often some absurdly large integer such as 874275200 with an Inf distance. This is despite the fact that there are definitely points with non-infinite values in the tree that should be chosen as nearest neighbors.

The same dataset produces correct and consistent results when using BruteTree and KDTree.

I'll see if I can provide a reproduction. The issue didn't start to occur until ~half the points were Inf.

Performance benchmarking

As discussed in JuliaGeometry/KDTrees.jl#20, nearest neighbor performance can depend rather strongly on the data distribution. For real data, there's various "interesting" cases which occur quite commonly, including at least:

  • Query points can systematically be outliers relative to the data in the tree. This reduces the efficiency of early bailout during tree traversal.
  • Data in the tree can have outliers. This can effect the current tree building heuristic by forcing a lot of long skinny rectangles.
  • Data in the tree can reside on subspaces in the higher dimensional space. It would be interesting to see how this effects query efficiency relative to the same data in lower dimensions.

Is there a plan/preference for where benchmark code should go, or what form it should take? (#3 obviously relevant).

Creating tree on views

I have a (large) 3 dimensional tensor and want to create a tree using SubArrays

3ร—3ร—3 Array{Float64,3}:
[:, :, 1] =
 0.576401   0.152728  0.496296
 0.0284128  0.994683  0.640066
 0.0910659  0.257534  0.939821

[:, :, 2] =
 0.758803   0.0218801  0.192174
 0.591324   0.019357   0.802315
 0.0582035  0.895237   0.474521

[:, :, 3] =
 0.437771   0.333887  0.920422
 0.0445638  0.429712  0.19376 
 0.245626   0.243987  0.065141

But attempting to create a BallTree results in the below error.

julia> NearestNeighbors.BallTree([view(toy,:,2:3,i) for i in 1:size(toy,3)],NearestNeighbors.Euclidean())
ERROR: MethodError: no method matching length(::Type{SubArray{Float64,2,Array{Float64,3},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64},Int64},true}})
Closest candidates are:
  length(::Core.SimpleVector) at essentials.jl:561
  length(::Base.MethodList) at reflection.jl:801
  length(::Core.MethodTable) at reflection.jl:875
  ...
Stacktrace:
 [1] NearestNeighbors.TreeData(::Array{SubArray{Float64,2,Array{Float64,3},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64},Int64},true},1}, ::Int64) at [...]/src/tree_data.jl:14
 [2] #BallTree#18(::Int64, ::Bool, ::Bool, ::Array{SubArray{Float64,2,Array{Float64,3},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64},Int64},true},1}, ::Type, ::Array{SubArray{Float64,2,Array{Float64,3},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64},Int64},true},1}, ::Distances.Euclidean) at [...]/src/ball_tree.jl:39
 [3] NearestNeighbors.BallTree(::Array{SubArray{Float64,2,Array{Float64,3},Tuple{Base.Slice{Base.OneTo{Int64}},UnitRange{Int64},Int64},true},1}, ::Distances.Euclidean) at [...]/src/ball_tree.jl:37
 [4] top-level scope at none:0

While length is defined on SubArray it doesn't seem to be available for ::Type{SubArray[...]} and I'm not sure how and when this creeps up.
I would think creating trees on views would be desirable.

Autodiff support

Would it make sense to be able to autodiff through knns?

I tried using ForwardDiff

using NearestNeighbors
using ForwardDiff

data = rand(3, 10^3)
kdtree = KDTree(data)

ForwardDiff.gradient(x->sum(knn(kdtree, x, 4)[2]), [0.,0.,0.])

but due to the definition of distances

dist = Vector{get_T(eltype(V))}(undef, k)

and the fact that a dual number distance is asigned to elements of that vector

dist_d = evaluate(tree.metric, tree.data[idx], point, do_end)
...
best_dists[1] = dist_d

I get the following error with the code above:

julia> ForwardDiff.gradient(x->sum(knn(kdtree, x, 4)[2]), [0.,0.,0.])
ERROR: MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3})
Closest candidates are:
  Float64(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200
  Float64(::T) where T<:Number at boot.jl:718
  Float64(::Int8) at float.jl:60
  ...
Stacktrace:
 [1] convert(::Type{Float64}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3}) at .\number.jl:7
 [2] setindex!(::Array{Float64,1}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3}, ::Int64) at .\array.jl:782
 [3] add_points_knn! at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\tree_ops.jl:104 [inlined]
 [4] knn_kernel!(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Int64, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}, ::Array{Int64,1}, ::Array{Float64,1}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3}, ::typeof(NearestNeighbors.always_false)) at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\kd_tree.jl:174
 [5] knn_kernel!(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Int64, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}, ::Array{Int64,1}, ::Array{Float64,1}, ::ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3}, ::typeof(NearestNeighbors.always_false)) at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\kd_tree.jl:196 (repeats 7 times)
 [6] _knn(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}, ::Array{Int64,1}, ::Array{Float64,1}, ::Function) at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\kd_tree.jl:158
 [7] knn_point!(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}, ::Bool, ::Array{Float64,1}, ::Array{Int64,1}, ::Function) at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\knn.jl:31
 [8] knn at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\knn.jl:44 [inlined]
 [9] knn(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}, ::Int64) at C:\Users\sebastian\.julia\packages\NearestNeighbors\pb8hw\src\knn.jl:41
 [10] (::var"#3#4")(::Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}) at .\REPL[7]:1
 [11] vector_mode_gradient(::var"#3#4", ::Array{Float64,1}, ::ForwardDiff.GradientConfig{ForwardDiff.Tag{var"#3#4",Float64},Float64,3,Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}}) at C:\Users\sebastian\.julia\packages\ForwardDiff\Asf4O\src\apiutils.jl:37
 [12] gradient(::Function, ::Array{Float64,1}, ::ForwardDiff.GradientConfig{ForwardDiff.Tag{var"#3#4",Float64},Float64,3,Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}}, ::Val{true}) at C:\Users\sebastian\.julia\packages\ForwardDiff\Asf4O\src\gradient.jl:17
 [13] gradient(::Function, ::Array{Float64,1}, ::ForwardDiff.GradientConfig{ForwardDiff.Tag{var"#3#4",Float64},Float64,3,Array{ForwardDiff.Dual{ForwardDiff.Tag{var"#3#4",Float64},Float64,3},1}}) at C:\Users\sebastian\.julia\packages\ForwardDiff\Asf4O\src\gradient.jl:15 (repeats 2 times)
 [14] top-level scope at REPL[7]:1

Weird behaviour with Periodic Euclidean metric

I'm somewhat confused by the following (non-working) example:

using Distances,NearestNeighbors
pred = Distances.PeriodicEuclidean([Inf,2.5])
l = [0.0 0.0; 0.0 2.5]
S = NearestNeighbors.BallTree(l,pred)
println(inrange(S,[0.0,0.0],1e-2,true))
println(pred(l[:,1],l[:,2]))

This code prints [1] and then 0.0. I'm surprised because I would have expected the first println() to print [1,2] instead. As far as I remember, this kind of code worked fine before; I'm unfortunately unable to isolate a commit that changed things.

Use Pairs for inrange?

Right now there are two return values from inrange the indices to the points and the distances. Maybe it would be more convenient to return just one vector of eltype Pair{Int, T} with the index and the distance both included in the pair.

error with points as vector of points?

Great package! So fast! ๐Ÿ˜ƒ Anyway, I thought this MWE (based on the readme example except for using points instead of point) would work (since "points can also be a vector of other vectors where each element in the outer vector is considered a point.") but it throws an error:

julia> using NearestNeighbors

julia> data = rand(3, 10^4);

julia> k = 3 ;

julia> kdtree = KDTree(data) ;

julia> points = [rand(3) for i in 1:10]
10-element Array{Array{Float64,1},1}:
 [0.68304, 0.983004, 0.220862]
 [0.626807, 0.61079, 0.968843]
 [0.192098, 0.157959, 0.300556]
 [0.300309, 0.502549, 0.653381]
 [0.160819, 0.653002, 0.9979]
 [0.809764, 0.0521815, 0.0290807]
 [0.20466, 0.592894, 0.656526]
 [0.370667, 0.0579244, 0.930079]
 [0.593999, 0.934687, 0.555133]
 [0.60504, 0.78062, 0.243998]

julia> idxs, dists = knn(kdtree, points, k, true)
ERROR: MethodError: no method matching length(::Type{Array{Float64,1}})
Closest candidates are:
  length(::Core.SimpleVector) at essentials.jl:561
  length(::Base.MethodList) at reflection.jl:801
  length(::Core.MethodTable) at reflection.jl:875
  ...
Stacktrace:
 [1] check_input(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Array{Array{Float64,1},1}) at /Users/benoitpasquier/.julia/packages/NearestNeighbors/N7lgR/src/NearestNeighbors.jl:29
 [2] knn(::KDTree{StaticArrays.SArray{Tuple{3},Float64,1,3},Euclidean,Float64}, ::Array{Array{Float64,1},1}, ::Int64, ::Bool, ::Function) at /Users/benoitpasquier/.julia/packages/NearestNeighbors/N7lgR/src/knn.jl:16 (repeats 2 times)
 [3] top-level scope at none:0

Maybe I'm doing this wrong?

Allow different types for the index storage

Using NearestNeighbors.jl, I found that often times I need just the positions of the points and am not interested in the actual indices and/or I'm okay with reordering all of my dataset to suit the KDTree better (but the dataset itself being too large for me being comfortable with the KDTree storing a copy of it).

I can get that by building a tree, reordering my data and assembling a new tree with the reordered data and indices collect(eachindex(xs)):

tree = KDTree(xs, storedata = false)
xs .= xs[tree.indices]
tree1 = KDTree(xs, tree.hyper_rec, collect(eachindex(xs)), tree.metric, tree.nodes, tree.tree_data, true)

That however still leaves me with a quite large (and kind of useless) array of indices.

So I altered the KDTree to accept other types for indices:

struct KDTree{V <: AbstractVector,M <: MinkowskiMetric,T,TIV} <: NNTree{V,M}
    data::Vector{V}
    hyper_rec::HyperRectangle{T}
    indices::TIV
    metric::M
    nodes::Vector{KDNode{T}}
    tree_data::TreeData
    reordered::Bool
end

For one, this allows me to replace the Vector{Int} by a UnitRange{Int} which drastically cuts memory use.

Further more, I can use something like

struct IdentityArray end
Base.getindex(::IdentityArray, index) = index

tree3 = KDTree(xs, tree.hyper_rec, IdentityArray(), tree.metric, tree.nodes, tree.tree_data, true)

to get a good bit of extra performance as well (tested with 100M Vec3f0's):

@btime inrange(tree1, p, r) # ~ 38ms, tree with `Vector{Int}` for indices
@btime inrange(tree2, p, r) # ~ 34ms, tree with `eachindex(xs)` for indices
@btime inrange(tree3, p, r) # ~ 32ms, tree with `IdentityArray()` for indices

If desired, I can cook up a proper PR for this, just wanted to test the waters on how much interest for a change like this exists.

BruteTree SemiMetric support

BruteTrees are a simple structure that can support ::Distances.SemiMetric. Is there any reason for which these are not allowed? Im obviously referring to the cosine distance which is a very popular choice in many applications.

A minimal example would be:

bt = BruteTree(rand(2,10), CosineDist())  # currently fails
idxs, dists = knn(bt, rand(2))

A - quickly found - reference regarding kdtrees.

Implement threading.

I started implementing some threading support using the threading branch in master julia. The linked gist below implements threaded kd tree creations. Code is non polished.

https://gist.github.com/KristofferC/15beed394361a325ff33

The speedup is about 2x using 4 cores.

include("thread_tree.jl")
data = rand(5, 3*10^6)
@time KDTree(data);
#  1.009976 seconds (295.92 k allocations: 36.561 MB)
@time NearestNeighbors.KDTree(data; reorder=false);
#  1.963014 seconds (300.02 k allocations: 36.622 MB, 0.42% gc time)

The technique I use is the following:
Run a single threaded kd tree builder until the same number of sub trees as the number of threads are created (currently only a power of 2 number of threads is supported). Use the data from the single threaded creator and run a completely normal kd builder in parallel on each sub tree.

It would be cool to see how this scales with even more cores.

Next step is to check out threading the knn and range searches as well as see if this could be implemented into master and have it cleanly fall back to single threaded when threading is not enabled.

Not working with Adjoint type?

Imagine you have a dataset X, then it is naturally to write something like:

KDTree(Matrix(X)')

It does not working right now. Instead, one should write two times:

KDTree(Matrix(Matrix(X)'))

Can it be fixed?

Approximate Nearest Neighbors algorithms

Hello!

I am guessing @KristofferC already knows this, but there is a "new" approach to nearest neighbors called "approximate nearest neigbhors" for the kNN problem. For many applications these approaches are just as good as the "exact" methods. JuliaDynamics is very interested on this!

From the discussions that lead to #69 , I was told that the "state of the art"/"fastest" algorithm (faster than the one proposed in #69), is the one described here: https://arxiv.org/abs/1603.09320 that uses "Hierarchical Navigable Small World" graphs (hnsw). So that if I wanted to implement a faster kNN approach, I should do this one instead of the one in #69 .

In addition it is very interesting to share these benchmarks: https://github.com/erikbern/ann-benchmarks which are in python but they have a lot of different algortithms. This gives a nice perspective.

What is the consensus on having the hnsw method in this repo? Do we agree on that (because of the "approximate")? I think it fits, but a new repo in JuliaDynamics can also be made if putting it here is a problem.

By the way, there is already some code for this method: (C++, header only https://github.com/nmslib/hnsw )

@andyferris I am tagging you as well because you were interested in the discussion in #69 . @JonasIsensee was also mentioned in the proposal of this method.


p.s.: I intend to use some funding I got to get someone contribute a faster kNN method

Should there exist an empty KDTree?

On julia 0.4 I found it useful to define an empty KDTree with 0 points (i.e.tree = KDTree(rand(3,0))).
Is there currently a way to do this with the new changes? Is this the right thing to be doing?

One-off EXCEPTION_ILLEGAL_INSTRUCTION in ball_tree.jl

This only happened once. It was in a long-running Julia session but the only things I was defining into Main were unrelated to your code. To be clear, I was using my fork on the #1nn branch, but the error didn't originate from the part I touched.

Possibly related: When I restarted Julia, the main thread couldn't communicate with the workers it spawned. When I restarted Julia again though, all seemed fine.

Any ideas?

Unreachable reached at 0000000042B5E2A9

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ILLEGAL_INSTRUCTION at 0x42b5e2a9 -- #BallTree#19 at C:\Users\nicho\.julia\packages\NearestNeighbors\9UciS\src\ball_tree.jl:90
in expression starting at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\paper\run_optimize.jl:29
#BallTree#19 at C:\Users\nicho\.julia\packages\NearestNeighbors\9UciS\src\ball_tree.jl:90
Type at C:\Users\nicho\.julia\packages\NearestNeighbors\9UciS\src\ball_tree.jl:82
Type at C:\Users\nicho\.julia\packages\NearestNeighbors\9UciS\src\ball_tree.jl:82
findneighbors at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\analysis\grouping.jl:79
findtemporalneighbors at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\analysis\grouping.jl:56
groupby_localmax_temporallimit at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\analysis\grouping.jl:18 [inlined]
getmolecules at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\analysis\process.jl:28 [inlined]
_broadcast_getindex_evalf at .\broadcast.jl:625 [inlined]
_broadcast_getindex at .\broadcast.jl:598 [inlined]
getindex at .\broadcast.jl:558 [inlined]
macro expansion at .\broadcast.jl:888 [inlined]
macro expansion at .\simdloop.jl:77 [inlined]
copyto! at .\broadcast.jl:887 [inlined]
copyto! at .\broadcast.jl:842 [inlined]
copy at .\broadcast.jl:818 [inlined]
materialize at .\broadcast.jl:798
jl_apply_generic at /home/Administrator/buildbot/worker/package_win64/build/src/home/Administrator/buildbot/worker/package_win64/build/src\gf.c:2197
top-level scope at C:\Users\nicho\SMLMAssociationAnalysis_NCB.jl\src\paper\run_optimize.jl:60
jl_toplevel_eval_flex at /home/Administrator/buildbot/worker/package_win64/build/src/home/Administrator/buildbot/worker/package_win64/build/src\toplevel.c:809
jl_parse_eval_all at /home/Administrator/buildbot/worker/package_win64/build/src/home/Administrator/buildbot/worker/package_win64/build/src\ast.c:873
include_string at .\loading.jl:1064
#178 at C:\Users\nicho\.julia\packages\Atom\4EgxD\src\eval.jl:100

Running on GPU using CuArray

I'm new to Julia, and I'm trying it out because it seems to be the simplest way to program GPUs right now using CuArray. I need a fast kd-tree algorithm that runs on a GPU, among other things, and I found NearestNeighbors.jl. How much work would be involved in getting this code working out of CuArray instances rather than its current CPU-bound data structures?

Periodic Metric

I am thinking of using this package to implement a neighborlist for molecular simulation. But it would need to support a periodic distance, or in other words the Euclidean distance on a torus. I would implement this of course myself, but I thought I'd ask first wether this is even feasible or is there a fundamental challenge that goes beyond just implementing a new metric?

Add skip predicate to `inrange`

Is this not possible to be done, which is why it does not exist?
If yes, sorry for the reduntant issue!

Since it is common that inrange can be used for points that also belong in the dataset that composes a tree, a skip predate could be useful. But also generally as well.

For now I use

point = dataset[n]
idxs = inrange(tree, point, ฯต)
deleteat!(idxs, findin(idxs, n))

User-defined metrics

Thanks for a great package.

I was able to do a knn search on a BallTree using my own distance by simply doing this:

import NearestNeighbors: Metric, evaluate, euclidean

type MyMetric <: Metric end

evaluate(d::MyMetric,a, b)=min(1.0,euclidean(a,b))
evaluate(d::MyMetric,a::AbstractMatrix, b::AbstractArray,col::Int, do_end::Bool=true)=
    evaluate(d, slice(a, :, col), b)

This is already awesome! What would be even better is if an official API was provided for doing this. This would guarantee that a quick hack like the above will not break at the next update and also that proper checks and optimizations are enforced.

Support for Neighbor Pairs Searching

The scipy KDTree package provides a method query_pairs (in this link) which returns all the pairs within a distance for the given tree. This package seems only to provide nearest neighbor searching for a given point by the inrange function, rather than for pairs.

Such feature could also be implemented in this Julia package and should be helpful for users.

More generic data

(In my understanding) the package currently assumes the n sample points constituting the data are given as an (d,n) array, where each sample point is a d-dimensional vector of some sort. It would be great if we could have access to a more generic data presentation (n widgets of WidgetType types). This would work particularly well with #23 and would have applications in tons of contexts. Eg you have sentences and you define distance between sentences, US Senators and you define ideological distances between senators, pictures of cats, Julia packages, etc.

Dynamic trees

This package has excellent performance for queries on a static set of points. For streaming data, with points added and earliest points removed to limit storage size, can similar performance be achieved through an extension of the implementation?

Procopiuc, Agarwal, Arge, and Vitter, "Bkd-tree: A dynamic scalable kd-tree" (2003), seem to observe that this is possible, by avoiding memory I/O problems of earlier methods. Is there interest in testing their ideas with an implementation?

[Question] Using a skip function that requires data point index for multiple point knn

Hi there, I am using the syntax

knn(tree, points, k, true)

to get the NN of my points, which is a Vector(SVector), because the docs indicate this is the most performant version to do it. points are all points that also exist inside the data that made the tree. I of course need to exclude all these points from my found neighbors, but I get around this by findin the k+1 neighbors and skipping the first (since I true for the sort argument).

I need to use a skip clause for neighbors, but unfortunately the skip clause does not depend on the points themselves, but on the indices.

Specifically, assume that points[1] has index n in the original data. When I am searching for neighbors of points[1], I want to be able to exclude neighbors that have indices n \pm w, where w a small integer, typically 1 or 2.

I know how to do this using the syntax knn(tree, point, k, true), i.e. doing one search at a time.

Do you have any idea on how to do this for multiple points?

Fixing `ReinterpretArray` breakage - first PR

The julialang PR #23750 merged last month changed the output of reinterpret into a lazy ReinterpretArray <: AbstractArray type. This has broken NearestNeighbors in a few places, where the output of reinterpret was expected to be of type Vector instead. In particular the example in the README.md

using NearestNeighbors
data = rand(3, 10^4)
kdtree = KDTree(data; leafsize = 10)

fails with a MethodError.

I have made a branch locally with a fix for this, but this being my very first PR I am posting here what I did in case I messed up.

Since ReinterpretArray obeys the AbstractArray interface I just changed the signature of a number of functions expecting a ::Vector{T} into ::AbstractVector{T}. I also modified the BruteTree constructor to ensure conversion of generic AbstractArrays into a Vector. These changes make the README.md example work again, and they also makes it possible to once more pass tests up to a point. Tests eventually fail in relation to changes in broadcast, I think, but I think those failures are unrelated to ReinterpretArray. I will look into those later.

I would like some tips on how to proceed to make a PR. I assume I have to push my branch and then start the PR, but I have no permissions for this. What is the standard procedure?

Trees for non-Metrics?

For NLP it is common to want to use CosineDist,
which is a SemiMetric.

This is not going to be compatible with the BallTree, I think.

but it should be fine with the BruteTree.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.