Comments (12)
I made an example for hamming distance. Since this uses the data in this repository, run this in the top directory of this repository. The number of dimensions must be 16 for 128 bits. I hope this example makes sense.
#include "NGT/Index.h"
int
main(int argc, char **argv)
{
string indexFile = "index";
string objectFile = "./data/sift-dataset-5k.tsv";
string queryFile = "./data/sift-query-3.tsv";
uint32_t bitSize = 128;
NGT::Property property;
property.dimension = bitSize / 8;
property.graphType = NGT::Property::GraphType::GraphTypeANNG;
property.objectType = NGT::ObjectSpace::ObjectType::Uint8;
property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeHamming;
try {
NGT::Index index;
index.createGraphAndTree(indexFile, property);
index.open(indexFile);
ifstream is(objectFile);
string line;
while (getline(is, line)) {
vector<string> tokens;
NGT::Common::tokenize(line, tokens, " \t"); // split an object string into values by separators.
vector<float> obj;
for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
obj.push_back(NGT::Common::strtol(*ti));
}
obj.resize(property.dimension); // shrink vectors because original ones are too long for this sample.
index.append(obj);
}
index.createIndex(16);
index.saveIndex(indexFile);
} catch (NGT::Exception &err) {
cerr << "Error " << err.what() << endl;
return 1;
} catch (...) {
cerr << "Error" << endl;
return 1;
}
try {
NGT::Index index(indexFile);
ifstream is(queryFile);
string line;
while (getline(is, line)) {
NGT::Object *query = 0;
{
vector<string> tokens;
NGT::Common::tokenize(line, tokens, " \t");
vector<float> obj;
for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
obj.push_back(NGT::Common::strtol(*ti));
}
obj.resize(property.dimension);
cout << "Query: ";
for (size_t i = 0; i < obj.size(); i++) {
cout << obj[i] << " ";
}
cout << endl << endl;
query = index.allocateObject(obj);
}
NGT::SearchContainer sc(*query);
NGT::ObjectDistances objects;
sc.setResults(&objects);
sc.setSize(10);
sc.setEpsilon(0.2);
index.search(sc);
cout << "Rank\tID\tDistance" << endl;
for (size_t i = 0; i < objects.size(); i++) {
cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << " : ";
NGT::ObjectSpace &objectSpace = index.getObjectSpace();
uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
cout << static_cast<int>(object[i]) << " ";
}
cout << endl;
}
cout << endl;
index.deleteObject(query);
}
} catch (NGT::Exception &err) {
cerr << "Error " << err.what() << endl;
return 1;
} catch (...) {
cerr << "Error" << endl;
return 1;
}
return 0;
}
from ngt.
Fantastic. Arigatou gosaimasu Iwasaki-hakase.
So going by your example, the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right? (which is perfectly logical but sometimes implementation details can introduce some oddities)
As part of my phd tesis I'm trying to interface this library with opencv and dlib. Working over this example to implement overloads of allocateObject (and related methods) to take vector as params seems to be working correctly but it needs more testing.
Edit: just noticed that Uint8 is just a def for unsigned char so no need to touch base code.
from ngt.
どういたしまして Dou itashimashite.
Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.
the way we input data to the index (in chunks as floats, string, etc) doesn't affect the algorithm right?
I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.
from ngt.
I am not sure how you want to apply hamming distance to your floats and string because hamming distance is a bitwise operator.
Indeed, I was wondering if maybe some memory alignment or input data separation was dependent on the elements count when adding data to the index. So having N/4 4byte elements instead of N 1byte elements could cause problems or unexpected behaviors. Well, in your example the number of elements is originally the same and the extra bytes are discarded later.
Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.
Oh, I overlooked this. An Uint8 overload would be a lot clearer, yes, and since the underlying function doing the memory allocation/copying is already templated, the required changes are minimal.
from ngt.
Now I think Index::append() is confusing when using hamming distance. One float value is forcedly set to an unsigned char variable. This means that a float value must be positive and smaller than 256.
To solve the confusion, I updated Index::append() to be able to use with an uint8 argument. You can write as the example below.
vector<uint8_t> obj;
...
index.append(obj);
from ngt.
Below is a sample for hamming distance with the latest version for your reference.
#include "NGT/Index.h"
using namespace std;
int
main(int argc, char **argv)
{
string indexFile = "index";
string objectFile = "./data/sift-dataset-5k.tsv";
string queryFile = "./data/sift-query-3.tsv";
size_t bitSize = 128;
// index construction
try {
NGT::Property property;
property.dimension = bitSize / 8;
property.objectType = NGT::ObjectSpace::ObjectType::Uint8;
property.distanceType = NGT::Index::Property::DistanceType::DistanceTypeHamming;
NGT::Index index(property);
ifstream is(objectFile);
string line;
while (getline(is, line)) {
vector<string> tokens;
NGT::Common::tokenize(line, tokens, " \t"); // split an object string into values by separators.
vector<uint8_t> obj;
for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
obj.push_back(NGT::Common::strtol(*ti));
}
obj.resize(property.dimension); // cut off unnecessary data in the file.
index.append(obj);
}
index.createIndex(16);
index.saveIndex(indexFile);
} catch (NGT::Exception &err) {
cerr << "Error " << err.what() << endl;
return 1;
} catch (...) {
cerr << "Error" << endl;
return 1;
}
// nearest neighbor search
try {
NGT::Index index(indexFile);
NGT::Property property;
index.getProperty(property);
ifstream is(queryFile);
string line;
while (getline(is, line)) {
vector<uint8_t> query;
{
vector<string> tokens;
NGT::Common::tokenize(line, tokens, " \t");
for (vector<string>::iterator ti = tokens.begin(); ti != tokens.end(); ++ti) {
query.push_back(NGT::Common::strtol(*ti));
}
query.resize(property.dimension);
cout << "Query : ";
for (size_t i = 0; i < query.size(); i++) {
cout << std::bitset<8>(query[i]) << " ";
}
}
NGT::SearchQuery sc(query);
NGT::ObjectDistances objects;
sc.setResults(&objects);
sc.setSize(10);
sc.setEpsilon(0.2);
index.search(sc);
cout << endl << "Rank\tID\tDistance" << std::showbase << endl;
for (size_t i = 0; i < objects.size(); i++) {
cout << i + 1 << "\t" << objects[i].id << "\t" << objects[i].distance << "\t: ";
NGT::ObjectSpace &objectSpace = index.getObjectSpace();
uint8_t *object = static_cast<uint8_t*>(objectSpace.getObject(objects[i].id));
for (size_t i = 0; i < objectSpace.getDimension(); ++i) {
cout << std::bitset<8>(object[i]) << " ";
}
cout << endl;
}
cout << endl;
}
} catch (NGT::Exception &err) {
cerr << "Error " << err.what() << endl;
return 1;
} catch (...) {
cerr << "Error" << endl;
return 1;
}
return 0;
}
from ngt.
Looks good, thx again!
Btw, ONNG takes a really long time to build, around 15-20 mins on my computer, that's expected. But what about panng? does it build any faster?
from ngt.
PANNG can be built faster than ONNG with the recommended number of edges 100. But, first, you may want to reduce the number of edges of ONNG (ANNG) to build it faster, because 100 edges are too many for most datasets. The number of edges can be specified with NGT::Property::edgeSizeForCreation or -E of ngt create command.
from ngt.
Akemashite omedeto gosaimasu.
Running some more benchmarks, when I construct an ONNG index by setting:
property.graphType = NGT::Property::GraphType::GraphTypeONNG
it actually takes less time than if I construct an ANNG index with 10-20 edges and then run the optimizer over it, which if I understand correctly, constructs a PANNG:
NGT::GraphOptimizer optimizer;
optimizer.numOfOutgoingEdges = 4;
optimizer.numOfIncomingEdges = 8;
optimizer.execute(*index);
Is this correct? Am I misunderstanding something?
Thx as always for your assistance.
from ngt.
あけましておめでとうございます。今年もよろしくお願い致します。
Akemashite omedetou gozaimasu. Kotoshimo yoroshiku onegai itashimasu.
I experimentally added the type ONNG (NGT::Property::GraphType::GraphTypeONNG
) to the index construction in order to directly construct an ONNG index without constructing an ANNG index. Unfortunately, the pseudo ONNG in this way did not achieve the same performance as that of an authentic ONNG index that is constructed from an ANNG index. Thus, I don't recommend you to use an pseudo ONNG index as an authentic ONNG index. However, you might able to use the pseudo ONNG index as an ANNG index, when you construct ONNG. It means that, as you mentioned, first construct an pseudo ONNG, then optimize it to construct ONNG. I expect that the ONNG index optimized from a pseudo ONNG index, which is not a PANNG index, achieves almost the same performance as that of the ONNG index from ANNG.
PANNG can be created with this command.
The index constructed by using the following manner might be almost the same as PANNG. But, I am not so sure.
NGT::GraphOptimizer optimizer;
optimizer.numOfOutgoingEdges = 4;
optimizer.numOfIncomingEdges = INT32_MAX;
optimizer.execute(*index);
from ngt.
Thanks as always for your assistance.
I'm finishing up an academic publication regarding ANN algorithms and their use for Augmented Reality and wanted to publicly thank you for your assistance if u are ok with that.
from ngt.
I'm glad to help you!
from ngt.
Related Issues (20)
- How to create a QBG with Capi ? HOT 7
- Specify num_threads for searching HOT 3
- Linking issue HOT 2
- Add more ngt_insert_index_as_TYPE methods to C API HOT 2
- Feature request: Command line output option that doesn't require intensive deserialization
- Add new QBG methods to C API HOT 7
- Is there any benchmark result for NGT QG/QBG? HOT 4
- file descriptor leak on `index.build_index` HOT 2
- Missing functions and types in the C API HOT 14
- Python bindings for QG/QBG HOT 3
- bugs HOT 3
- Fixed seeds for deterministic results HOT 1
- Building with -DNGT_QBG_DISABLED=ON still trying to link with LAPACK and BLAS HOT 2
- How to update NGT from older version to new one? HOT 2
- pip install in python==3.11 fails : could not find a version that satisfies HOT 4
- Running sample codes (e.g., qg-l2-float) HOT 6
- Remove vector causes error, only for "Normalised" distance types: Index.h:remove:1544: Not found the specified id HOT 7
- Quantization in qg-l2-float.cpp HOT 1
- colab import ngt dose not work HOT 1
- Paper and references for QG HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ngt.