Coder Social home page Coder Social logo

An possible improvement about pynndescent HOT 8 OPEN

lmcinnes avatar lmcinnes commented on May 21, 2024
An possible improvement

from pynndescent.

Comments (8)

lmcinnes avatar lmcinnes commented on May 21, 2024

I'm not sure quite what you mean in suggestion one -- can you explain a little further?

For option two there used to be an option that tried to detect passing in the training data as query data and just returned the neighbor graph -- we could instead initialize with the neighbor graph. Would that suffice, or did you have more complex plans in mind?

from pynndescent.

TheCodeHere avatar TheCodeHere commented on May 21, 2024

Let me explain myself. Currently i'm working with the Large Margin Nearest Neighbor (LMNN), the algorithm needs to compute a priori the k nearest neighbors with the same class label of each sample input, as determined by Euclidean distance. I'm currently trying to improve this step with NN-Descent.
it would be necessary a function with the training data and class label as inputs and return the k nearest neighbors with the same class label of each sample input in the training data (avoiding calculating the distance of each sample input from itself of course, e.g., Distance(x_i,x_j)==0 where x_i == x_j).

from pynndescent.

lmcinnes avatar lmcinnes commented on May 21, 2024

Ah, I see what you want to do now. It isn't so obvious how to integrate that into the code in a way that wouldn't be a very special case handling. It would likely be easier to simply split the training (and test) data by label and have n_classes many indices, and simply query in the index corresponding to the label of the query point.

from pynndescent.

TheCodeHere avatar TheCodeHere commented on May 21, 2024

Well, yes. But doing that only fix half of the problem. The function will still compute the distance of each sample from itself. I mean, i could look for the k+1 nearest neighbors and in some weird cases i have notice that the distances are not always sorted (the distance of the sample x_i from x_i is not the first in the vector). This would imply extra work from the user.

Also i was thinking in a function that could compute the k most representative data points of each class, i mean the k-samples that most minimize the distance with the other data points of the same class. It could be interesting.

from pynndescent.

lmcinnes avatar lmcinnes commented on May 21, 2024

I would be happy to see a pull request to implement such a thing, but given that a user can work around these issues on their end I don't see this as a high enough priority feature for me to spend time implementing it myself.

from pynndescent.

jayaram-r avatar jayaram-r commented on May 21, 2024

Adding my 2 cents here since I had to handle a similar problem. Suppose you build a KNN index on a set of points data as follows:

index_knn = NNDescent(data, **params)

The nearest neighbors of data can be accessed from the KNN index as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]

Each row of nn_indices and nn_distances has an entry corresponding to the point itself, which we want to remove. Ideally, we would expect that nn_indices[i, 0] == i and that nn_distances[i, 0] = 0. However, if the data has some overlapping points this may not be true. Hence, without any assumptions, I wrote this little function that removes the "self neighbor" entries from nn_indices and nn_distances. It is numba compiled to be fast.

import numpy as np
from numba import njit, float64, int64
from numba.types import Tuple

@njit(Tuple((int64[:, :], float64[:, :]))(int64[:, :], float64[:, :]))
def remove_self_neighbors(index_neighbors_, distance_neighbors_):
    """
    Given the index and distances of k nearest neighbors of a list of query points, remove points from their
    own neighbor list.
    :param index_neighbors_: numpy array of the index of `k` neighbors for a list of points. Has shape `(n, k)`,
                             where `n` is the number of query points.
    :param distance_neighbors_: numpy array of the distance of `k` neighbors for a list of points.
                                Also has shape `(n, k)`.
    :return: (index_neighbors, distance_neighbors), where each of them has shape `(n, k - 1)`.
    """
    n, k = index_neighbors_.shape
    index_neighbors = np.zeros((n, k - 1), dtype=index_neighbors_.dtype)
    distance_neighbors = np.zeros((n, k - 1), dtype=distance_neighbors_.dtype)
    for i in range(n):
        j1 = j2 = 0
        while j1 < (k - 1) and j2 < k:
            if index_neighbors_[i, j2] != i:
                index_neighbors[i, j1] = index_neighbors_[i, j2]
                distance_neighbors[i, j1] = distance_neighbors_[i, j2]
                j1 += 1

            j2 += 1

    return index_neighbors, distance_neighbors

Query the k + 1 nearest neighbors of points in data and then call the above function as follows:

nn_indices = index_knn._neighbor_graph[0]
nn_distances = index_knn._neighbor_graph[1]
nn_indices, nn_distances = remove_self_neighbors(nn_indices[:, :(k + 1)], nn_distances[:, :(k + 1)])

Hope this helps!

from pynndescent.

jayaram-r avatar jayaram-r commented on May 21, 2024

@lmcinnes Is it correct to assume that the output of the query method of NNDescent returns the neighbors sorted by increasing distance to the query point? Thanks.

from pynndescent.

TheCodeHere avatar TheCodeHere commented on May 21, 2024

Technically you can make that assumption, but in some cases, like mine, i have some points that overlap and appear in first place, instead of the main point. So, the position or even the existence in the vector of the measured point can not be assume if you have several overlapping point.

from pynndescent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.