In the querying operation have the option of the k-NN of the same class as the q

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

An possible improvement about pynndescent HOT 8 OPEN

lmcinnes commented on May 21, 2024

An possible improvement

from pynndescent.

Comments (8)

lmcinnes commented on May 21, 2024

I'm not sure quite what you mean in suggestion one -- can you explain a little further?

For option two there used to be an option that tried to detect passing in the training data as query data and just returned the neighbor graph -- we could instead initialize with the neighbor graph. Would that suffice, or did you have more complex plans in mind?

from pynndescent.

TheCodeHere commented on May 21, 2024

Let me explain myself. Currently i'm working with the Large Margin Nearest Neighbor (LMNN), the algorithm needs to compute a priori the k nearest neighbors with the same class label of each sample input, as determined by Euclidean distance. I'm currently trying to improve this step with NN-Descent.
it would be necessary a function with the training data and class label as inputs and return the k nearest neighbors with the same class label of each sample input in the training data (avoiding calculating the distance of each sample input from itself of course, e.g., Distance(x_i,x_j)==0 where x_i == x_j).

from pynndescent.

lmcinnes commented on May 21, 2024

Ah, I see what you want to do now. It isn't so obvious how to integrate that into the code in a way that wouldn't be a very special case handling. It would likely be easier to simply split the training (and test) data by label and have n_classes many indices, and simply query in the index corresponding to the label of the query point.

from pynndescent.

TheCodeHere commented on May 21, 2024

Well, yes. But doing that only fix half of the problem. The function will still compute the distance of each sample from itself. I mean, i could look for the k+1 nearest neighbors and in some weird cases i have notice that the distances are not always sorted (the distance of the sample x_i from x_i is not the first in the vector). This would imply extra work from the user.

Also i was thinking in a function that could compute the k most representative data points of each class, i mean the k-samples that most minimize the distance with the other data points of the same class. It could be interesting.

from pynndescent.

lmcinnes commented on May 21, 2024

I would be happy to see a pull request to implement such a thing, but given that a user can work around these issues on their end I don't see this as a high enough priority feature for me to spend time implementing it myself.

from pynndescent.

jayaram-r commented on May 21, 2024

Adding my 2 cents here since I had to handle a similar problem. Suppose you build a KNN index on a set of points data as follows:

index_knn = NNDescent(data, **params)

The nearest neighbors of data can be accessed from the KNN index as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]

Each row of nn_indices and nn_distances has an entry corresponding to the point itself, which we want to remove. Ideally, we would expect that nn_indices[i, 0] == i and that nn_distances[i, 0] = 0. However, if the data has some overlapping points this may not be true. Hence, without any assumptions, I wrote this little function that removes the "self neighbor" entries from nn_indices and nn_distances. It is numba compiled to be fast.

import numpy as np
from numba import njit, float64, int64
from numba.types import Tuple

@njit(Tuple((int64[:, :], float64[:, :]))(int64[:, :], float64[:, :]))
def remove_self_neighbors(index_neighbors_, distance_neighbors_):
    """
    Given the index and distances of k nearest neighbors of a list of query points, remove points from their
    own neighbor list.
    :param index_neighbors_: numpy array of the index of `k` neighbors for a list of points. Has shape `(n, k)`,
                             where `n` is the number of query points.
    :param distance_neighbors_: numpy array of the distance of `k` neighbors for a list of points.
                                Also has shape `(n, k)`.
    :return: (index_neighbors, distance_neighbors), where each of them has shape `(n, k - 1)`.
    """
    n, k = index_neighbors_.shape
    index_neighbors = np.zeros((n, k - 1), dtype=index_neighbors_.dtype)
    distance_neighbors = np.zeros((n, k - 1), dtype=distance_neighbors_.dtype)
    for i in range(n):
        j1 = j2 = 0
        while j1 < (k - 1) and j2 < k:
            if index_neighbors_[i, j2] != i:
                index_neighbors[i, j1] = index_neighbors_[i, j2]
                distance_neighbors[i, j1] = distance_neighbors_[i, j2]
                j1 += 1

            j2 += 1

    return index_neighbors, distance_neighbors

Query the k + 1 nearest neighbors of points in data and then call the above function as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]
nn_indices, nn_distances = remove_self_neighbors(nn_indices[:, :(k + 1)], nn_distances[:, :(k + 1)])

Hope this helps!

from pynndescent.

jayaram-r commented on May 21, 2024

@lmcinnes Is it correct to assume that the output of the query method of NNDescent returns the neighbors sorted by increasing distance to the query point? Thanks.

from pynndescent.

TheCodeHere commented on May 21, 2024

Technically you can make that assumption, but in some cases, like mine, i have some points that overlap and appear in first place, instead of the main point. So, the position or even the existence in the vector of the measured point can not be assume if you have several overlapping point.

from pynndescent.

An possible improvement about pynndescent HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent