Comments (7)
I may have spoken too soon; it is possible numba is magically doing all the AVX craziness for me, and my bottlenecks are elsewhere. In the meantime, if you are going to explore the alternative approach (instead of local join), you might want to look at: https://arxiv.org/abs/1804.03032 which does essentially something like that. I can't speak for the results quoted in the paper, but I do believe you can get something in a pretty decent ballpark with that sort of approach.
from pynndescent.
Have you got a link to where it's failing?
from pynndescent.
The latest round has pynndescent not looking good, while prior rounds had it doing reasonably well. Not sure about the dates on all this, except that the most recent run was from June 10 of this year.
from pynndescent.
I have identified a lot of places for improvement and am working on it now. I think some of the major issues may have been related to the fact that it pulled directly from master and may have gotten a slightly broken build, and memory usage issues (which have since been fixed). I will try to get a pull request in for the various improvements soon.
from pynndescent.
I have recently put in a bit more work to a C++ version of nearest neighbor descent, for eventual use with uwot
. I have, of course, shamelessly copied the code in pynndescent. The development of the "parallel_everything" branch has there been of great interest. I realize it's a work in progress so I don't want to assume any of this is intended to be fully working, but I wanted to ask about some changes on that branch to help with my own dev work, and this issue seems like a good place to do it.
What I'm most interested in is the high memory setup. What is the purpose of the in_graph
set in the high-memory code path? It looks like it keeps track of any indices that have ever been added successfully to the current_graph
. For indices which are not in the set, they are unchecked_push
ed into the heap, as opposed to a push
for the low-memory path, which adds a check that the index isn't already in the heap via a linear scan of the list with the indices in. This seems to trade-off the cost of doing multiple set look ups and potential insertions versus the linear scan. In my C++ code, the cost of set look ups is definitely more expensive than an array scan, so the high memory code path is always slower than the low memory one, even with a large number of neighbors (like 150). Is there a particular scenario where this works out better?
I think there is an opportunity to re-use the in_graph
set in generate_graph_updates
(and hence create a _high_memory
variant), because if the q in in_graph[p] and p in in_graph[q]
is carried out there, you can skip the update and hence the distance calculation. As in_graph
is only being read from here, it shouldn't ruin any parallelization. I think you would see speed improvements here with large n_neighbors
.
But why are only vertices that successfully make it into current_graph
stored in in_graph
? Is it just to prevent run-away memory usage? The use of a set in the current master (at least with single threading) is to cache pretty much every pair of candidates that are tried. Has this been abandoned for being too memory intensive? Storing every pair and taking advantage of the fact that under those conditions you can just store one of p,q
or q,p
(e.g. where q > p
), and so save on the number of insertions and look-ups, has been pretty much the only way I've got a noticeable performance improvement over a low-memory using MNIST and 15 nearest neighbors as my benchmark (and using multiple threads helps narrow the performance gap even there).
Also, I see that rho
is no longer used. Is the thinking here that just controlling the general neighbor list size via max_candidates
is effectively the same as sampling?
from pynndescent.
Thanks for all the questions. The parallel_everything
branch is indeed a bit of a playground for a wide variety of different experimental approaches; a lot of it started out as trying to take advantage of some newer numba features. You are correct that the in_graph
was a replacement for the previous tried
set of pairs with a goal of reducing excessive memory usage somewhat. I haven't done extensive benchmarking on this, but I do recall that the high_memory
option was slightly faster (perhaps due to how numba processes things, which will be different to C++ in places). I am certainly not wedded to this, and may simply go with the low memory option as the default once I have some benchmarks run to test the differences. Ultimately the paired tried
set was proving to memory intensive for various cases that cropped up, so I really wanted to improve the memory use, or move away from it entirely if possible. As noted, this was all a bit experimental.
As for the lack of rho -- there were actually some (subtle) bugs in the original new_build_candidates
that meant it wasn't doing quite what one might want it to be doing. It was close enough that it worked well enough anyway, but in weeding out that issue it made it possible to ignore rho and use the max_candidates
bound to, as you note, effectively do the same job.
At this point I am getting pretty good parallelism (not ideal, but decent) for both the trees and the nndescent. Ultimately a lot of the work is really going into the search phase (which is probably of less interest to you, as UMAP doesn't use it, except for the transform). This has actually gone surprisingly well, and the ann-benchmark results I see now are quite strong (at least on my local machine). The one remaining big holdout/difference is the distance computations themselves. It seems that HNSW and friends use a AVX2 (or SSE) specific distance computation that takes advantage of the 256bit wide computations for SIMD parallelism within the actual distance computation. I have so far not been able to coax numba to do the same. Since you are in straight C++ this is certainly something you can (and should) take advantage of. I would point you to https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h as a reference of what they are doing. In my current experiments pynndescent actually outperforms hnswlib for some datasets when I turn the SIMD distance computation off, so this really is the remaining difference.
from pynndescent.
Thank you for all these details, especially about the memory use. From the sound of things, I may have to abandon the set-based high memory approach entirely for now. In rnndescent, I have split the R-interfacing code from the main nearest neighbor routines, and I would like to eventually have the C++ code be a standalone project. So actually all the work going into pynndescent is of (eventual) interest to me, although improving uwot's nearest neighbor calculations is the highest priority.
Ironically, the reason there are those USE_AVX
flags in the HNSW distance calculations is because I added them in a PR (although most of the credit for the work should go to the ever-industrious @LTLA), except the goal was to turn them off! R doesn't allow C++ code to be built with architecture specific flags. The difference is indeed noticeable 😞 -- if I ever get the standalone C++ code up and running I will make it a project to add these optimizations back to my own distance calculations.
In terms of parallelism, copying your approach has definitely been faster than anything I was trying. One thing that I've noticed is that although the local join approach is faster than a straight neighbor-of-a-neighbor search it does make writing to shared data extremely difficult because it's effectively impossible to organize the graph updates into thread-safe windows. So there are always bottlenecks where you have to be single-threaded, at least in C++/RcppParallel, where the cost of thread-related code and having to lock data structures to safely write doesn't save any time. The use of a shared set-like structure compounds the issue. I may have to experiment with creating a mutex pool to reduce locking, but I am bound to do this badly. This is one of the times that giving up, and moving to Python and Numba looks very tempting 😄
One final strategy I had for parallelism with a high memory mode was abandoning the local join approach for neighbor-of-a-neighbor (while retaining the incremental search strategy). Although you do more and more indirect array reading, you are always writing to the same part of the graph so you can be multi-threaded without locking, and you can get some advantages of a high-memory mode by storing a set of the currently seen indices in the neighbors of neighbors. To control memory usage this can be made local to the current iteration so that there would never be more than n_threads
being stored at a time. I may return to this and see if there are any scenarios where it outperforms local join without excessive memory usage.
from pynndescent.
Related Issues (20)
- `make_dense_tree()` with `angular=True` can segfault on poorly-behaved datasets HOT 1
- access distances in heap HOT 3
- Sample identifiers for semantic search HOT 2
- uint8 as internal data HOT 1
- Cosine metric - error "Negative values in data passed to precomputed distance matrix" HOT 2
- Question about covariance matrix used when using Mahalanobis distance
- Tests fail: E SystemError: initialization of _internal failed without raising an exception
- Newest version breaks with UMAP HOT 3
- Slice error using mac M1-max ARM HOT 6
- Exceedingly large amount of memory usage
- Very high memory usage HOT 7
- Specifying threshold in distance metrics
- API to save and load index from disk
- Minor bias in split selection? HOT 1
- true_angular is not a distance?
- TSSS missing a factor of 2
- Reverse diversification is actually forward diversification again
- pynndescent might break with next numba release HOT 3
- How to navigate pyinstaller HOT 1
- Querying the training set: runtime tradeoff for large k HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pynndescent.