Comments (8)
I'm not sure quite what you mean in suggestion one -- can you explain a little further?
For option two there used to be an option that tried to detect passing in the training data as query data and just returned the neighbor graph -- we could instead initialize with the neighbor graph. Would that suffice, or did you have more complex plans in mind?
from pynndescent.
Let me explain myself. Currently i'm working with the Large Margin Nearest Neighbor (LMNN), the algorithm needs to compute a priori the k nearest neighbors with the same class label of each sample input, as determined by Euclidean distance. I'm currently trying to improve this step with NN-Descent.
it would be necessary a function with the training data and class label as inputs and return the k nearest neighbors with the same class label of each sample input in the training data (avoiding calculating the distance of each sample input from itself of course, e.g., Distance(x_i,x_j)==0 where x_i == x_j).
from pynndescent.
Ah, I see what you want to do now. It isn't so obvious how to integrate that into the code in a way that wouldn't be a very special case handling. It would likely be easier to simply split the training (and test) data by label and have n_classes many indices, and simply query in the index corresponding to the label of the query point.
from pynndescent.
Well, yes. But doing that only fix half of the problem. The function will still compute the distance of each sample from itself. I mean, i could look for the k+1 nearest neighbors and in some weird cases i have notice that the distances are not always sorted (the distance of the sample x_i from x_i is not the first in the vector). This would imply extra work from the user.
Also i was thinking in a function that could compute the k most representative data points of each class, i mean the k-samples that most minimize the distance with the other data points of the same class. It could be interesting.
from pynndescent.
I would be happy to see a pull request to implement such a thing, but given that a user can work around these issues on their end I don't see this as a high enough priority feature for me to spend time implementing it myself.
from pynndescent.
Adding my 2 cents here since I had to handle a similar problem. Suppose you build a KNN index on a set of points data
as follows:
index_knn = NNDescent(data, **params)
The nearest neighbors of data
can be accessed from the KNN index as follows:
nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]
Each row of nn_indices
and nn_distances
has an entry corresponding to the point itself, which we want to remove. Ideally, we would expect that nn_indices[i, 0] == i
and that nn_distances[i, 0] = 0
. However, if the data has some overlapping points this may not be true. Hence, without any assumptions, I wrote this little function that removes the "self neighbor" entries from nn_indices
and nn_distances
. It is numba compiled to be fast.
import numpy as np
from numba import njit, float64, int64
from numba.types import Tuple
@njit(Tuple((int64[:, :], float64[:, :]))(int64[:, :], float64[:, :]))
def remove_self_neighbors(index_neighbors_, distance_neighbors_):
"""
Given the index and distances of k nearest neighbors of a list of query points, remove points from their
own neighbor list.
:param index_neighbors_: numpy array of the index of `k` neighbors for a list of points. Has shape `(n, k)`,
where `n` is the number of query points.
:param distance_neighbors_: numpy array of the distance of `k` neighbors for a list of points.
Also has shape `(n, k)`.
:return: (index_neighbors, distance_neighbors), where each of them has shape `(n, k - 1)`.
"""
n, k = index_neighbors_.shape
index_neighbors = np.zeros((n, k - 1), dtype=index_neighbors_.dtype)
distance_neighbors = np.zeros((n, k - 1), dtype=distance_neighbors_.dtype)
for i in range(n):
j1 = j2 = 0
while j1 < (k - 1) and j2 < k:
if index_neighbors_[i, j2] != i:
index_neighbors[i, j1] = index_neighbors_[i, j2]
distance_neighbors[i, j1] = distance_neighbors_[i, j2]
j1 += 1
j2 += 1
return index_neighbors, distance_neighbors
Query the k + 1
nearest neighbors of points in data
and then call the above function as follows:
nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]
nn_indices, nn_distances = remove_self_neighbors(nn_indices[:, :(k + 1)], nn_distances[:, :(k + 1)])
Hope this helps!
from pynndescent.
@lmcinnes Is it correct to assume that the output of the query
method of NNDescent
returns the neighbors sorted by increasing distance to the query point? Thanks.
from pynndescent.
Technically you can make that assumption, but in some cases, like mine, i have some points that overlap and appear in first place, instead of the main point. So, the position or even the existence in the vector of the measured point can not be assume if you have several overlapping point.
from pynndescent.
Related Issues (20)
- `make_dense_tree()` with `angular=True` can segfault on poorly-behaved datasets HOT 1
- access distances in heap HOT 3
- Sample identifiers for semantic search HOT 2
- uint8 as internal data HOT 1
- Cosine metric - error "Negative values in data passed to precomputed distance matrix" HOT 2
- Question about covariance matrix used when using Mahalanobis distance
- Tests fail: E SystemError: initialization of _internal failed without raising an exception
- Newest version breaks with UMAP HOT 3
- Slice error using mac M1-max ARM HOT 6
- Exceedingly large amount of memory usage
- Very high memory usage HOT 7
- Specifying threshold in distance metrics
- API to save and load index from disk
- Minor bias in split selection? HOT 1
- true_angular is not a distance?
- TSSS missing a factor of 2
- Reverse diversification is actually forward diversification again
- pynndescent might break with next numba release HOT 3
- How to navigate pyinstaller HOT 1
- Querying the training set: runtime tradeoff for large k HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pynndescent.