Hi, do you have any example of ngt usage on C++ with hamming distance? <p dir=

I made an example for hamming distance. Since this uses <a href="https://github.com/ya

どういたしまして Dou itashimashite. Now I think <a href="https://github.com/

[Question] Hamming distance C++ example about ngt HOT 12 CLOSED

yahoojapan commented on May 22, 2024

[Question] Hamming distance C++ example

from ngt.

Comments (12)

masajiro commented on May 22, 2024

I made an example for hamming distance. Since this uses the data in this repository, run this in the top directory of this repository. The number of dimensions must be 16 for 128 bits. I hope this example makes sense.

#include        "NGT/Index.h"

int
main(int argc, char **argv)
{
  string        indexFile       = "index";
  string        objectFile      = "./data/sift-dataset-5k.tsv";
  string        queryFile       = "./data/sift-query-3.tsv";
  uint32_t      bitSize         = 128;

  NGT::Property property;
  property.dimension = bitSize / 8;
  property.graphType = NGT::Property::GraphType::GraphTypeANNG;
  property.objectType = NGT::ObjectSpace::ObjectType::Uint8;
  property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeHamming;

  try {
    NGT::Index  index;
    index.createGraphAndTree(indexFile, property);
    index.open(indexFile);
    ifstream    is(objectFile);
    string line;
    while (getline(is, line)) {
      vector<string> tokens;
      NGT::Common::tokenize(line, tokens, " \t");      // split an object string into values by separators.                  
      vector<float> obj;
      for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
        obj.push_back(NGT::Common::strtol(*ti));
      }
      obj.resize(property.dimension);  // shrink vectors because original ones are too long for this sample.                 
      index.append(obj);
    }
    index.createIndex(16);
    index.saveIndex(indexFile);
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  try {
    NGT::Index  index(indexFile);
    ifstream    is(queryFile);
    string line;
    while (getline(is, line)) {
      NGT::Object *query = 0;
      {
        vector<string> tokens;
        NGT::Common::tokenize(line, tokens, " \t");
        vector<float> obj;
        for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
          obj.push_back(NGT::Common::strtol(*ti));
        }
        obj.resize(property.dimension);
        cout << "Query: ";
        for (size_t i = 0; i < obj.size(); i++) {
          cout << obj[i] << " ";
        }
        cout << endl << endl;
        query = index.allocateObject(obj);
      }
      NGT::SearchContainer sc(*query);
      NGT::ObjectDistances objects;
      sc.setResults(&objects);
      sc.setSize(10);
      sc.setEpsilon(0.2);

      index.search(sc);
      cout << "Rank\tID\tDistance" << endl;
      for (size_t i = 0; i < objects.size(); i++) {
        cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << " : ";
        NGT::ObjectSpace &objectSpace = index.getObjectSpace();
        uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
        for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
          cout << static_cast<int>(object[i]) << " ";
        }
        cout << endl;
      }
      cout << endl;
      index.deleteObject(query);
    }
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }
  return 0;
}

from ngt.

TieSKey commented on May 22, 2024

Fantastic. Arigatou gosaimasu Iwasaki-hakase.

So going by your example, the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right? (which is perfectly logical but sometimes implementation details can introduce some oddities)
As part of my phd tesis I'm trying to interface this library with opencv and dlib. ~~Working over this example to implement overloads of allocateObject (and related methods) to take vector as params seems to be working correctly but it needs more testing.~~
Edit: just noticed that Uint8 is just a def for unsigned char so no need to touch base code.

from ngt.

masajiro commented on May 22, 2024

どういたしまして Dou itashimashite.

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right?

I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.

from ngt.

TieSKey commented on May 22, 2024

I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.

Indeed, I was wondering if maybe some memory alignment or input data separation was dependent on the elements count when adding data to the index. So having N/4 4byte elements instead of N 1byte elements could cause problems or unexpected behaviors. Well, in your example the number of elements is originally the same and the extra bytes are discarded later.

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

Oh, I overlooked this. An Uint8 overload would be a lot clearer, yes, and since the underlying function doing the memory allocation/copying is already templated, the required changes are minimal.

from ngt.

masajiro commented on May 22, 2024

Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.

To solve the confusion, I updated Index::append() to be able to use with an uint8 argument. You can write as the example below.

vector<uint8_t> obj;
...
index.append(obj);

from ngt.

masajiro commented on May 22, 2024

Below is a sample for hamming distance with the latest version for your reference.

#include        "NGT/Index.h"

using namespace std;

int
main(int argc, char **argv)
{
  string        indexFile       = "index";
  string        objectFile      = "./data/sift-dataset-5k.tsv";
  string        queryFile       = "./data/sift-query-3.tsv";
  size_t        bitSize         = 128;
  // index construction                                                                                                              
  try {
    NGT::Property       property;                                                              
    property.dimension          = bitSize / 8;                                                      
    property.objectType         = NGT::ObjectSpace::ObjectType::Uint8;                                              
    property.distanceType       = NGT::Index::Property::DistanceType::DistanceTypeHamming;
    NGT::Index  index(property);
    ifstream    is(objectFile);
    string      line;
    while (getline(is, line)) {
      vector<string>    tokens;
      NGT::Common::tokenize(line, tokens, " \t");      // split an object string into values by separators.                          
      vector<uint8_t>   obj;                                                                                              
      for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
        obj.push_back(NGT::Common::strtol(*ti));
      }
      obj.resize(property.dimension);  // cut off unnecessary data in the file.                                                       
      index.append(obj);
    }
    index.createIndex(16);
    index.saveIndex(indexFile);
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  // nearest neighbor search                                                                                                         
  try {
    NGT::Index          index(indexFile);
    NGT::Property       property;
    index.getProperty(property);
    ifstream            is(queryFile);
    string      line;
    while (getline(is, line)) {
      vector<uint8_t>   query;
      {
        vector<string>  tokens;
        NGT::Common::tokenize(line, tokens, " \t");
        for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
          query.push_back(NGT::Common::strtol(*ti));
        }
        query.resize(property.dimension);
        cout << "Query : ";
        for (size_t i = 0; i < query.size(); i++) {
          cout << std::bitset<8>(query[i]) << " ";                                                            
        }
      }

      NGT::SearchQuery          sc(query);
      NGT::ObjectDistances      objects;
      sc.setResults(&objects);
      sc.setSize(10);
      sc.setEpsilon(0.2);

      index.search(sc);
      cout << endl << "Rank\tID\tDistance" << std::showbase << endl;
      for (size_t i = 0; i < objects.size(); i++) {
        cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << "\t: ";
        NGT::ObjectSpace &objectSpace = index.getObjectSpace();
        uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
        for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
          cout << std::bitset<8>(object[i]) << " ";                                                                  
        }
        cout << endl;
      }
      cout << endl;
    }
  } catch (NGT::Exception &err) {
    cerr << "Error " << err.what() << endl;
    return 1;
  } catch (...) {
    cerr << "Error" << endl;
    return 1;
  }

  return 0;
}

from ngt.

TieSKey commented on May 22, 2024

Looks good, thx again!

Btw, ONNG takes a really long time to build, around 15-20 mins on my computer, that's expected. But what about panng? does it build any faster?

from ngt.

masajiro commented on May 22, 2024

PANNG can be built faster than ONNG with the recommended number of edges 100. But, first, you may want to reduce the number of edges of ONNG (ANNG) to build it faster, because 100 edges are too many for most datasets. The number of edges can be specified with NGT::Property::edgeSizeForCreation or -E of ngt create command.

from ngt.

TieSKey commented on May 22, 2024

Akemashite omedeto gosaimasu.

Running some more benchmarks, when I construct an ONNG index by setting:
property.graphType = NGT::Property::GraphType::GraphTypeONNG

it actually takes less time than if I construct an ANNG index with 10-20 edges and then run the optimizer over it, which if I understand correctly, constructs a PANNG:

 NGT::GraphOptimizer optimizer;
                optimizer.numOfOutgoingEdges = 4;
                optimizer.numOfIncomingEdges = 8;
                optimizer.execute(*index);

Is this correct? Am I misunderstanding something?

Thx as always for your assistance.

from ngt.

masajiro commented on May 22, 2024

あけましておめでとうございます。今年もよろしくお願い致します。
Akemashite omedetou gozaimasu. Kotoshimo yoroshiku onegai itashimasu.

I experimentally added the type ONNG (NGT::Property::GraphType::GraphTypeONNG) to the index construction in order to directly construct an ONNG index without constructing an ANNG index. Unfortunately, the pseudo ONNG in this way did not achieve the same performance as that of an authentic ONNG index that is constructed from an ANNG index. Thus, I don't recommend you to use an pseudo ONNG index as an authentic ONNG index. However, you might able to use the pseudo ONNG index as an ANNG index, when you construct ONNG. It means that, as you mentioned, first construct an pseudo ONNG, then optimize it to construct ONNG. I expect that the ONNG index optimized from a pseudo ONNG index, which is not a PANNG index, achieves almost the same performance as that of the ONNG index from ANNG.

PANNG can be created with this command.

The index constructed by using the following manner might be almost the same as PANNG. But, I am not so sure.

NGT::GraphOptimizer optimizer;
                optimizer.numOfOutgoingEdges = 4;
                optimizer.numOfIncomingEdges = INT32_MAX;
                optimizer.execute(*index);

from ngt.

TieSKey commented on May 22, 2024

Thanks as always for your assistance.

I'm finishing up an academic publication regarding ANN algorithms and their use for Augmented Reality and wanted to publicly thank you for your assistance if u are ok with that.

from ngt.

masajiro commented on May 22, 2024

I'm glad to help you!

from ngt.

[Question] Hamming distance C++ example about ngt HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent