Coder Social home page Coder Social logo

bd tree about rann HOT 16 OPEN

jefferislab avatar jefferislab commented on July 25, 2024
bd tree

from rann.

Comments (16)

jefferis avatar jefferis commented on July 25, 2024 1

So I tried a little debugging and it seems that the problem is in the recursive function building the bd tree i.e. rbd_tree:

library(RANN)
nn2(query=matrix(c(1,1),ncol=2), data=matrix(rep(2, 4),ncol=2), k=1L, treetype="bd")
[... repeats many times]
    frame #52356: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd030, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52357: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd0d0, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52358: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd170, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52359: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd210, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52360: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd2b0, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52361: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd350, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52362: 0x00000001023e877c RANN.so`rbd_tree(pa=<unavailable>, pidx=<unavailable>, n=<unavailable>, dim=2, bsp=1, bnd_box=0x00007fff5fbfd3d0, splitter=<unavailable>, shrink=<unavailable>)(double**, int*, ANNorthRect const&, int, int, int&, double&, int&), ANNshrinkRule) at bd_tree.cpp:399 [opt]
    frame #52363: 0x00000001023e8548 RANN.so`ANNbd_tree::ANNbd_tree(this=0x00000001025426c0, pa=0x0000000102546a40, n=2, dd=2, bs=1, split=ANN_KD_SUGGEST, shrink=<unavailable>) at bd_tree.cpp:144 [opt]
    frame #52364: 0x00000001023e7917 RANN.so`::get_NN_2Set(data=<unavailable>, query=0x00000001034cd278, D=<unavailable>, ND=<unavailable>, NQ=<unavailable>, K=<unavailable>, EPS=0x0000000103534dc8, SEARCHTYPE=0x0000000103534d90, USEBDTREE=0x0000000103535110, SQRAD=0x00000001035350d8, nn_index=<unavailable>, distances=<unavailable>) at NN.cc:48 [opt]
    frame #52365: 0x0000000100114fe4 libR.dylib`do_dotCode(call=0x0000000103776610, op=<unavailable>, args=<unavailable>, env=<unavailable>) at dotcode.c:0 [opt]
    frame #52366: 0x0000000100146ba8 libR.dylib`bcEval(body=<unavailable>, rho=0x0000000103537498, useCache=<unavailable>) at eval.c:6781 [opt]
    frame #52367: 0x00000001001417db libR.dylib`Rf_eval(e=<unavailable>, rho=<unavailable>) at eval.c:624 [opt]
    frame #52368: 0x000000010015414c libR.dylib`R_execClosure(call=0x00000001073fe3e8, newrho=0x0000000103537498, sysparent=<unavailable>, rho=<unavailable>, arglist=<unavailable>, op=<unavailable>) at eval.c:0 [opt]
    frame #52369: 0x0000000100141d0c libR.dylib`Rf_eval(e=0x00000001073fe618, rho=0x0000000101851b98) at eval.c:747 [opt]
    frame #52370: 0x0000000100184bda libR.dylib`Rf_ReplIteration(rho=0x0000000101851b98, savestack=<unavailable>, browselevel=0, state=0x00007fff5fbfe700) at main.c:258 [opt]
    frame #52371: 0x00000001001861bf libR.dylib`run_Rmainloop [inlined] R_ReplConsole(rho=<unavailable>, savestack=0, browselevel=0) at main.c:308 [opt]
    frame #52372: 0x0000000100186156 libR.dylib`run_Rmainloop at main.c:1082 [opt]
    frame #52373: 0x0000000100000f5b R`main + 27
    frame #52374: 0x00007fffd1e62235 libdyld.dylib`start + 1
    frame #52375: 0x00007fffd1e62235 libdyld.dylib`start + 1
(lldb)

from rann.

jefferis avatar jefferis commented on July 25, 2024

Thanks for the bug report. There seems to be an issue with the ANN library's bd method when there are duplicate points in the target matrix. It's not really something I can debug in a hurry and I have previously considers inactivating the bd method for this reason.

from rann.

gdkrmr avatar gdkrmr commented on July 25, 2024

you could simply test for duplicated points if the user chooses "bd"

from rann.

jefferis avatar jefferis commented on July 25, 2024

This is exactly what is done by another ANN wrapper:

https://github.com/cran/yaImpute/blob/master/R/ann.R#L17

but the time spent for the check makes using the bd tree pointless. So that is why I hesitated the last time this bug was reported. My feeling is that if the functionality is either unsafe or uneconomically slow it should just be eliminated.

The only other alternative I see is to add an argument like bd.check.dups=T to give people the option to trade safety for speed if they are convinced there are no duplicates.

from rann.

krlmlr avatar krlmlr commented on July 25, 2024

bd tree should really be able to handle this. Can we request a patch upstream, or do it ourselves?

from rann.

jefferis avatar jefferis commented on July 25, 2024

You're right of course @krlmlr. I'm on holiday right now so won't be doing this myself, but if you want to try, I think you could contact the senior author of ANN for input. Best, Greg.

Sent from my iPhone

On 31 Aug 2016, at 12:36, Kirill Müller [email protected] wrote:

bd tree should really be able to handle this. Can we request a patch upstream, or do it ourselves?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

from rann.

twolodzko avatar twolodzko commented on July 25, 2024

@jefferis I wanted to ask if there was any progress on this? The bug seems to persist in the CRAN version. At minimum, there should be a safeguard for the code not to give segfaults.

from rann.

Jean-Romain avatar Jean-Romain commented on July 25, 2024

Hi, In my case I get a segfault with bd.

from rann.

jefferis avatar jefferis commented on July 25, 2024

To reiterate, the bug is in the upstream library and I don't have time to chase it down. There must be a comparison logic error somewhere that is revealed with two identical points.

I can either remove bd functionality altogether or provide a check for identical points that can be turned off at the user's discretion. But as a reminder, checking for identical points is so expensive that it makes the bd tree pointless.

Votes?

from rann.

twolodzko avatar twolodzko commented on July 25, 2024

@jefferis then maybe: remove, document it in help and e-mail the author of the algorithm?

from rann.

Jean-Romain avatar Jean-Romain commented on July 25, 2024

Well a segfault is a big problem! I think you could document this bug. I don't know how bd can speed up the search compared to kd (indeed it crashed) but if the gain is big, document the bug, otherwise remove the feature.

from rann.

Jean-Romain avatar Jean-Romain commented on July 25, 2024

I made few benchmark with 300k 3D points

RANN::nn2(X, X, k = k, treetype = "kd", searchtype = "radius", radius = 2)
  • Check for duplicates 22 ms
  • k = 10
    • kd: 1.0 s
    • bd: 1.6 s
  • k = 60
    • kd: 1.6 s
    • bd: 2.2 s

Maybe I miss something. Maybe 300k is not big enough to take advantage of bd. In any case checking for duplicates is ways faster than the search. I made other tests and kdout performed bd every time.

from rann.

jefferis avatar jefferis commented on July 25, 2024

from rann.

Jean-Romain avatar Jean-Romain commented on July 25, 2024

I check dupes with the data.table tools.

  dupes  = duplicated(X, by = c("x", "y", "z"))

x being a data.table. Then I removed the dupes to use bd. I did not estimate the tree build time. I only estimated the total search time (build + search). The ultimate goal being to speed up the search I tried to estimated how much time is lost in finding duplicates against how much time is gained using bd instead of kd in a given situation. For the last one the result was not the one expected.

from rann.

MattMyint avatar MattMyint commented on July 25, 2024

Hi,

Just as some extra info, this issue pops up in another function that uses nn2, but it seems in that case, there don't seem to be any duplicate entries in that matrix.

These are the steps I took

> sample_expr1 <- read.table("~/sample_expr_matrix1.txt", header = TRUE, stringsAsFactors = FALSE)
> any(duplicated(sample_expr1))
[1] FALSE
> dim(sample_expr1)
[1] 7615   15
> dim(unique(sample_expr1))
[1] 7615   15

Perhaps I'm searching for duplicates the wrong way? Either way, we've switched to using kd

from rann.

Jean-Romain avatar Jean-Romain commented on July 25, 2024

You are search for duplicated over the 15 columns. I don't have a clue of what your data contains but I guess over the 15 columns there are no duplicates i.e not line with twice the same 15 numbers. The question is: is there duplicates over the 2 or 3 columns used as coordinates.

from rann.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.