jbellis / jvector Goto Github PK

JVector: the most advanced embedded vector search engine

License: Apache License 2.0

Java 96.62% Python 0.53% Shell 0.24% C 2.61%

ann java knn machine-learning search-engine similarity-search vector-search

jvector's Introduction

Introduction to approximate nearest neighbor search

Exact nearest neighbor search (k-nearest-neighbor or KNN) is prohibitively expensive at higher dimensions, because approaches to segment the search space that work in 2D or 3D like quadtree or k-d tree devolve to linear scans at higher dimensions. This is one aspect of what is called “the curse of dimensionality.”

With larger datasets, it is almost always more useful to get an approximate answer in logarithmic time, than the exact answer in linear time. This is abbreviated as ANN (approximate nearest neighbor) search.

There are two broad categories of ANN index:

Partition-based indexes, like LSH or IVF or SCANN
Graph indexes, like HNSW or DiskANN

Graph-based indexes tend to be simpler to implement and faster, but more importantly they can be constructed and updated incrementally. This makes them a much better fit for a general-purpose index than partitioning approaches that only work on static datasets that are completely specified up front. That is why all the major commercial vector indexes use graph approaches.

JVector is a graph index in the DiskANN family tree.

JVector Architecture

JVector is a graph-based index that builds on the DiskANN design with composeable extensions.

JVector implements a single-layer graph with nonblocking concurrency control, allowing construction to scale linearly with the number of cores:

The graph is represented by an on-disk adjacency list per node, with additional data stored inline to support two-pass searches, with the first pass powered by lossily compressed representations of the vectors kept in memory, and the second by a more accurate representation read from disk. The first pass can be performed with

Product quantization (PQ), optionally with anisotropic weighting
Binary quantization (BQ)
Fused ADC, where PQ codebooks are transposed and written inline with the graph adjacency list

The second pass can be performed with

Full resolution float32 vectors

This two-pass design reduces memory usage and reduces latency while preserving accuracy.

Additionally, JVector is unique in offering the ability to construct the index itself using two-pass searches, allowing larger-than-memory indexes to be built:

This is important because it allows you to take advantage of logarithmic search within a single index, instead of spilling over to linear-time merging of results from multiple indexes.

JVector step-by-step

All code samples are from SiftSmall in the JVector source repo, which includes the siftsmall dataset as well. Just import the project in your IDE and click Run to try it out!

Step 1: Build and query an index in memory

First the code:

    public static void siftInMemory(ArrayList<VectorFloat<?>> baseVectors) throws IOException {
        // infer the dimensionality from the first vector
        int originalDimension = baseVectors.get(0).length();
        // wrap the raw vectors in a RandomAccessVectorValues
        RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, originalDimension);

        // score provider using the raw, in-memory vectors
        BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, VectorSimilarityFunction.EUCLIDEAN);
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp,
                                                               ravv.dimension(),
                                                               16, // graph degree
                                                               100, // construction search depth
                                                               1.2f, // allow degree overflow during construction by this factor
                                                               1.2f)) // relax neighbor diversity requirement by this factor
        {
            // build the index (in memory)
            OnHeapGraphIndex index = builder.build(ravv);

            // search for a random vector
            VectorFloat<?> q = randomVector(originalDimension);
            SearchResult sr = GraphSearcher.search(q,
                                                   10, // number of results
                                                   ravv, // vectors we're searching, used for scoring
                                                   VectorSimilarityFunction.EUCLIDEAN, // how to score
                                                   index,
                                                   Bits.ALL); // valid ordinals to consider
            for (SearchResult.NodeScore ns : sr.getNodes()) {
                System.out.println(ns);
            }
        }
    }

Commentary:

All indexes assume that you have a source of vectors that has a consistent, fixed dimension (number of float32 components).
Vector sources are usually represented as a subclass of RandomAccessVectorValues, which offers a simple API around getVector / getVectorInto. Do be aware of isValueShared() in a multithreaded context; as a rule of thumb, in-memory RAVV will not use shared values and from-disk RAVV will (as an optimization to avoid allocating a new on-heap vector for every call).
You do NOT have to provide all the vectors to the index at once, but since this is a common scenario when prototyping, a convenience method is provided to do so. We will cover how to build an index incrementally later.
For the overflow Builder parameter, the sweet spot is about 1.2 for in-memory construction and 1.5 for on-disk. (The more overflow is allowed, the fewer recomputations of best edges are required, but the more neighbors will be consulted in every search.)
The alpha parameter controls the tradeoff between edge distance and diversity; usually 1.2 is sufficient for high-dimensional vectors; 2.0 is recommended for 2D or 3D datasets. See the DiskANN paper for more details.
The Bits parameter to GraphSearcher is intended for controlling your resultset based on external predicates and won’t be used in this tutorial.

Step 2: more control over GraphSearcher

Keeping the Builder the same, the updated search code looks like this:

            // search for a random vector using a GraphSearcher and SearchScoreProvider
            VectorFloat<?> q = randomVector(originalDimension);
            try (GraphSearcher searcher = new GraphSearcher(index)) {
                SearchScoreProvider ssp = SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
                SearchResult sr = searcher.search(ssp, 10, Bits.ALL);
                for (SearchResult.NodeScore ns : sr.getNodes()) {
                    System.out.println(ns);
                }
            }

Commentary:

Searcher allocation is modestly expensive since there is a bunch of internal state to initialize at construction time. Therefore, JVector supports pooling searchers (e.g. with ExplicitThreadLocal, covered below).
When managing GraphSearcher instances, you’re also responsible for constructing a SearchScoreProvider, which you can think of as: given a query vector, tell JVector how to compute scores for other nodes in the index. SearchScoreProvider can be exact, as shown here, or a combination of an ApproximateScoreFunction and a Reranker, covered below.

Step 3: Measuring recall

A blisteringly-fast vector index isn’t very useful if it doesn’t return accurate results. As a sanity check, SiftSmall includes a helper method testRecall. Wiring up that to our code mostly involves turning the SearchScoreProvider into a factory lambda:

            Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
            testRecall(index, queryVectors, groundTruth, sspFactory);

If you run the code, you will see slight differences in recall every time (printed by testRecall):

(OnHeapGraphIndex) Recall: 0.9898
...
(OnHeapGraphIndex) Recall: 0.9890

This is expected given the approximate nature of the index being created and the perturbations introduced by the multithreaded concurrency of the build call.

Step 4: write and load index to and from disk

The code:

        Path indexPath = Files.createTempFile("siftsmall", ".inline");
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, ravv.dimension(), 16, 100, 1.2f, 1.2f)) {
            // build the index (in memory)
            OnHeapGraphIndex index = builder.build(ravv);
            // write the index to disk with default options
            OnDiskGraphIndex.write(index, ravv, indexPath);
        }

        // on-disk indexes require a ReaderSupplier (not just a Reader) because we will want it to
        // open additional readers for searching
        ReaderSupplier rs = new SimpleMappedReaderSupplier(indexPath);
        OnDiskGraphIndex index = OnDiskGraphIndex.load(rs);
        // measure our recall against the (exactly computed) ground truth
        Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
        testRecall(index, queryVectors, groundTruth, sspFactory);

Commentary:

We can write indexes that are constructed in-memory, like this one, to disk with a single method call.
Loading and searching against on-disk indexes require a ReaderSupplier, which supplies RandomAccessReader objects. The RandomAccessReader interface is intended to be extended by the consuming project. For instance, DataStax Astra implements a RandomAccessReader backed by the Cassandra chunk cache. JVector provides two implementations out of the box.
- SimpleMappedReader: implemented using FileChannel.map, which means it is compatible with all Java versions that can run JVector, but also means it is limited to 2GB file sizes. SimpleMappedReader is primarily intended for example code.
- MemorySegmentReader: implemented using the newer MemorySegment API, with no file size limit, but is limited to Java 22+. (The actual MemorySegmentReader code is compatible with Java 20+, but we left it in the 22+ module for convenience. The motivated reader is welcome to refactor the build to improve this.) If you have no specialized requirements then MemorySegmentReader is recommended for production.

Step 5: use compressed vectors in the search

Compressing the vectors with product quantization is done as follows:

        // compute and write compressed vectors to disk
        Path pqPath = Files.createTempFile("siftsmall", ".pq");
        try (DataOutputStream out = new DataOutputStream(new BufferedOutputStream(Files.newOutputStream(pqPath)))) {
            // Compress the original vectors using PQ. this represents a compression ratio of 128 * 4 / 16 = 32x
            ProductQuantization pq = ProductQuantization.compute(ravv,
                                                                 16, // number of subspaces
                                                                 256, // number of centroids per subspace
                                                                 true); // center the dataset
            ByteSequence<?>[] compressed = pq.encodeAll(ravv);
            // write the compressed vectors to disk
            PQVectors pqv = new PQVectors(pq, compressed);
            pqv.write(out);
        }

JVector also supports Binary Quantization, but BQ is generally less useful than PQ since it takes such a large toll on search accuracy.

Then we can wire up the compressed vectors to a two-phase search by getting the fast ApproximateScoreFunction from PQVectors, and the Reranker from the index View:

        ReaderSupplier rs = new MMapReaderSupplier(indexPath);
        OnDiskGraphIndex index = OnDiskGraphIndex.load(rs);
        // load the PQVectors that we just wrote to disk
        try (RandomAccessReader in = new SimpleMappedReader(pqPath)) {
            PQVectors pqv = PQVectors.load(in);
            // SearchScoreProvider that does a first pass with the loaded-in-memory PQVectors,
            // then reranks with the exact vectors that are stored on disk in the index
            Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> {
                ApproximateScoreFunction asf = pqv.precomputedScoreFunctionFor(q, VectorSimilarityFunction.EUCLIDEAN);
                Reranker reranker = index.getView().rerankerFor(q, VectorSimilarityFunction.EUCLIDEAN);
                return new SearchScoreProvider(asf, reranker);
            };
            // measure our recall against the (exactly computed) ground truth
            testRecall(index, queryVectors, groundTruth, sspFactory);
        }

PQVectors offers both precomputedScoreFunctionFor and scoreFunctionFor factories. As the name implies, the first precalculates the fragments of distance needed from the PQ codebook for assembly by ADC (asymmetric distance computation). This is faster for searching all but the smallest of indexes, but if you do have a tiny index or you need to perform ADC in another context, the scoreFunctionFor version with no precomputation will come in handy.

This set of functionality is the classic DiskANN design.

Step 6: building a larger-than-memory index

JVector can also apply PQ compression to allow indexing a larger than memory dataset: only the compressed vectors are kept in memory.

First we need to set up a PQVectors instance that we can add new vectors to, and a BuildScoreProvider using it:

        // compute the codebook, but don't encode any vectors yet
        ProductQuantization pq = ProductQuantization.compute(ravv, 16, 256, true);

        // as we build the index we'll compress the new vectors and add them to this List backing a PQVectors;
        // this is used to score the construction searches
        List<ByteSequence<?>> incrementallyCompressedVectors = new ArrayList<>();
        PQVectors pqv = new PQVectors(pq, incrementallyCompressedVectors);
        BuildScoreProvider bsp = BuildScoreProvider.pqBuildScoreProvider(VectorSimilarityFunction.EUCLIDEAN, pqv);

Then we need to set up an OnDiskGraphIndexWriter with full control over the construction process.

        Path indexPath = Files.createTempFile("siftsmall", ".inline");
        Path pqPath = Files.createTempFile("siftsmall", ".pq");
        // Builder creation looks mostly the same
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, ravv.dimension(), 16, 100, 1.2f, 1.2f);
             // explicit Writer for the first time, this is what's behind OnDiskGraphIndex.write
             OnDiskGraphIndexWriter writer = new OnDiskGraphIndexWriter.Builder(builder.getGraph(), indexPath)
                     .with(new InlineVectors(ravv.dimension()))
                     .withMapper(new OnDiskGraphIndexWriter.IdentityMapper())
                     .build();
             // output for the compressed vectors
             DataOutputStream pqOut = new DataOutputStream(new BufferedOutputStream(Files.newOutputStream(pqPath))))

Once that’s done, we can index vectors one at a time:

            // build the index vector-at-a-time (on disk)
            for (VectorFloat<?> v : baseVectors) {
                // compress the new vector and add it to the PQVectors (via incrementallyCompressedVectors)
                int ordinal = incrementallyCompressedVectors.size();
                incrementallyCompressedVectors.add(pq.encode(v));
                // write the full vector to disk
                writer.writeInline(ordinal, Feature.singleState(FeatureId.INLINE_VECTORS, new InlineVectors.State(v)));
                // now add it to the graph -- the previous steps must be completed first since the PQVectors
                // and InlineVectorValues are both used during the search that runs as part of addGraphNode construction
                builder.addGraphNode(ordinal, v);
            }

Finally, we need to run cleanup() and write the index and the PQVectors to disk:

            // cleanup does a final enforcement of maxDegree and handles other scenarios like deleted nodes
            // that we don't need to worry about here
            builder.cleanup();

            // finish writing the index (by filling in the edge lists) and write our completed PQVectors
            writer.write(Map.of());
            pqv.write(pqOut);

Commentary:

The search code doesn’t change when switching to incremental index construction – it’s the same index structure on disk, just (potentially) much larger.
OnDiskGraphIndexWriter::writeInline is threadsafe via synchronization, but care must be taken that the support structures are threadsafe as well if you plan to use them in a multithreaded scenario (which this example is not). Alternatively, you can serialize the updates to PQVectors and leave only the call to GraphIndexBuilder::addGraphNode concurrent. This represents the lion’s share of construction time so you will see good performance with that approach.

Less-obvious points

Embeddings models product output from a consistent distribution of vectors. This means that you can save and re-use ProductQuantization codebooks, even for a different set of vectors, as long as you had a sufficiently large training set to build it the first time around. ProductQuantization.MAX_PQ_TRAINING_SET_SIZE (128,000 vectors) has proven to be sufficiently large.
JDK ThreadLocal objects cannot be referenced except from the thread that created them. This is a difficult design into which to fit caching of Closeable objects like GraphSearcher. JVector provides the ExplicitThreadLocal class to solve this.
Fused ADC is only compatible with Product Quantization, not Binary Quantization. This is no great loss since very few models generate embeddings that are best suited for BQ. That said, BQ continues to be supported with non-Fused indexes.
JVector heavily utilizes the Panama Vector API(SIMD) for ANN indexing and search. We have seen cases where the memory bandwidth is saturated during indexing and product quantization and can cause the process to slow down. To avoid this, the batch methods for index and PQ builds use a PhysicalCoreExecutor to limit the amount of operations to the physical core count. The default value is 1/2 the processor count seen by Java. This may not be correct in all setups (e.g. no hyperthreading or hybrid architectures) so if you wish to override the default use the -Djvector.physical_core_count property, or pass in your own ForkJoinPool instance.

Advanced features

Fused ADC is represented as a Feature that is supported during incremental index construction, like InlineVectors above. See the Grid class for sample code.
Anisotropic PQ is built into the ProductQuantization class and can improve recall, but nobody knows how to tune it (with the T/threshold parameter) except experimentally on a per-model basis, and choosing the wrong setting can make things worse. From Figure 3 in the paper:
JVector supports in-place deletes via GraphIndexBuilder::markNodeDeleted. Deleted nodes are removed and connections replaced during GraphIndexBuilder::cleanup, with runtime proportional to the number of deleted nodes.
To checkpoint a graph and reload it for continued editing, use OnHeapGraphIndex::save and GraphIndexBuilder.load.

The research behind the algorithms

Foundational work: HNSW and DiskANN papers, and a higher level explainer
Anisotropic PQ paper
Quicker ADC paper

Developing and Testing

This project is organized as a multimodule Maven build. The intent is to produce a multirelease jar suitable for use as a dependency from any Java 11 code. When run on a Java 20+ JVM with the Vector module enabled, optimized vector providers will be used. In general, the project is structured to be built with JDK 20+, but when JAVA_HOME is set to Java 11 -> Java 19, certain build features will still be available.

Base code is in jvector-base and will be built for Java 11 releases, restricting language features and APIs appropriately. Code in jvector-twenty will be compiled for Java 20 language features/APIs and included in the final multirelease jar targeting supported JVMs. jvector-multirelease packages jvector-base and jvector-twenty as a multirelease jar for release. jvector-examples is an additional sibling module that uses the reactor-representation of jvector-base/jvector-twenty to run example code. jvector-tests contains tests for the project, capable of running against both Java 11 and Java 20+ JVMs.

To run tests, use mvn test. To run tests against Java 20+, use mvn test. To run tests against Java 11, use mvn -Pjdk11 test. To run a single test class, use the Maven Surefire test filtering capability, e.g., mvn -Dtest=TestNeighborArray test. You may also use method-level filtering and patterns, e.g., mvn -Dtest=TestNeighborArray#testRetain* test.

You can run SiftSmall and Bench directly to get an idea of what all is going on here. Bench will automatically download required datasets to the fvec and hdf5 directories. The files used by SiftSmall can be found in the siftsmall directory in the project root.

To run either class, you can use the Maven exec-plugin via the following incantations:

mvn compile exec:exec@bench

or for Sift:

mvn compile exec:exec@sift

Bench takes an optional benchArgs argument that can be set to a list of whitespace-separated regexes. If any of the provided regexes match within a dataset name, that dataset will be included in the benchmark. For example, to run only the glove and nytimes datasets, you could use:

mvn compile exec:exec@bench -DbenchArgs="glove nytimes"

To run Sift/Bench without the JVM vector module available, you can use the following invocations:

mvn -Pjdk11 compile exec:exec@bench

mvn -Pjdk11 compile exec:exec@sift

The ... -Pjdk11 invocations will also work with JAVA_HOME pointing at a Java 11 installation.

To release, configure ~/.m2/settings.xml to point to OSSRH and run mvn -Prelease clean deploy.

jvector's People

Contributors

Stargazers

Watchers

Forkers

zznate tjake jkni pushkala-datastax shaunakdas88 jeremiahdjordan dlg99 chrislin22 msmygit mike-tr-adamson phact bradfordcp neiko2002 manoj-inukolunu whoschek blueoceandevops cloudmarc delysid749 cc-umu ailabteam denisbelogradskiy anguyen777 di-dimmasik levboguslavskiy youya9 sameron saddam213 mgrygles-lab madesaguna magicpr palakiew justone25 misselvexu shepardo jooekong zkenk akay231 goodbey857 inayet maljefairi paulm18 tori-ham san-ctuary farhadfa22 2567176308 hudawei996 hg9forever davidalphafox anhlt18vn peak915 tarungopalkrishna serkan-ozal a925907195 heemin32 automationkit kevindrosendahl yabinmeng holmistr jbhateja vbekiaris chengsecret frosner robertomalatesta xjtushilei mdogan wrmay kim-eungseo siddhsql michaeljmarshall maxwell-guo mdumandag luyuncheng shanthshivam vahidsh1 hansolo lightningrob douglasrolins rd-99 shahinsharifi chaojun-zhang philipdomann santoshdahale namtran2299 stevew00ds shultseva murodin russcam zll600 mehmettokgoz jatin-bhateja kernelzeroday kustomzone hbcbh1999 pmcfadin aamedina spullara diegopacheco svenefftinge aalbahem hushuitian

jvector's Issues

Add tests

Add back visited count to search results

Means we'll need a new container class for the results, which makes me sad, but I think this will be important to expose to C* metrics

mvn release task that checks for license headers?

Not sure if this is a big ask or a small one. If it's a lot of work we can skip it.

Automate releases as GitHub workflow

We should automate the release process once we're comfortable with the results. I'd prefer a workflow that runs when an appropriate tag is pushed, but I'm open to other options.

GraphIndex.View should be AutoCloseable

GraphIndex.getView returns a View interface that isn't AutoCloseable but the OnDiskGraphIndex.OnDiskView implementation of View is Autocloseable. This means that users of the GraphIndex.getView method can't do:

try (var view = graph.getView())
{
  ...
}

This is easy to workaround but annoying.

Add example invocations into readme

Need a quick description of running via maven exec plugin that @jkni put in. Also will add a stanza for SiftSmall

Try out removing the COHHG CompletionTracker

I think it's not necessary for Vamana single-level graphs

Add annotations indicating when a method returns a pooled float[]

Something like @PooledFloat and @unpooled/unshared/unique/distinctFloat. (Naming things is hard.) Go through the RAVV implementations and PQ methods and annotate them appropriately.

Add native simd vector support

This will take moving to MemorySegments and FloatBuffers.

Then c code using jextract

Optimize selectDiverse + copyDiverse sequence in CNS

Even more important now that enforceMaxConnLimit relies on this too.

add back caching of the closest nodes to the entry point

Modularize the maven build

This is some maven refactoring that will involve:

Move the current pom to be a 'parent' pom
Creation of a core directory which will be the library classes and related unit tests
Creation of a example directory with will be the current Bench and Sift classes
Update the readme to reflect the above

Both subdirs will need child poms.

Add Java21 support

JDK 20 build stuff

mvn exec does not build before executing

so if you don't "clean install" manually first you will either get ClassNotFoundException (moderately confusing and bad) or it will silently run the last version that you built (very very bad).

figure out why sometimes Bench build takes 10x longer

Investigate kmeans stopping points

Currently we stop after max iterations or when < 1% of points change centroids.

Changed centroids is technically a bad measurement since the centroids themselves move. But maybe it's close enough?

Alternatively, the "correct" way is to check the residual distance from vectors to centroid.

We should find out

(0) How good does kmeans have to be to achieve the recall that we're looking for? Can we do fewer iterations and be effectively unchanged?
(1) What % of vectors are still changing centroids at that point?
(2) Is residual distance materially better in practice, given that it's more expensive to compute?

Add test for GraphCache

Find out if NBHM is better or worse than ConcurrentHashMap in OnHeapGraphIndex

See commit history for where NBHM got added.

Will probably need to test on low dimension vectors (i.e. random data, not the grid search) so that vector comparison time doesn't swamp the Map in the measurement.

Change package to io.github.jbellis

This better aligns with the artifact groupId.

Make PQ take a RAVV instead of List<float[]>

This will allow doing PQ against a subset of vectors without reading them all into memory, more easily

ensure builds are jdk-11 compatible

thinking of OSS C* here

update SiftTest to perform both in-memory and from-disk tests

Recall should be ~0.975 for in-memory and ~0.96 for from-disk (with PQ)

Add decodedSquareDistance fast-path to pq

Fixing this:
// TODO implement other similarity functions efficiently

(I'm not sure it's possible to do the same for cosine.)

Compare new Simd code with hnswrecall + lucene

I remember seeing 40% improvement enabling simd, we're seeing less now. Did Lucene's hand-unrolled simd add value?

find out how much we can compress openai embeddings vectors

right now Bench hardcodes dimensions at

        var pqDims = ds.baseVectors.get(0).length / 2;

because with the datasets from ann-benchmarks (nytimes-256, glove-200) this gives virtually no recall loss at overquery=2.

however, OpenAI's embeddings vectors are ridiculously overparameterized, we should be able to get more aggressive there

length / 4 ? length / 8? let's find out

Get PQ working with grid search

Mostly this means, merge the PQ branch of hnswrecall

JVector recall regression vs HNSW

HNSW (concurrent5 branch + hnswrecall)

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
HNSW   M=16 ef=100: top 100/1 recall 0.7170, build 37.37s, query 11.68s. 209069030 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
HNSW   M=16 ef=100: top 100/1 recall 0.6954, build 81.76s, query 8.32s. 241306820 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
HNSW   M=16 ef=100: top 100/1 recall 0.6329, build 142.23s, query 12.33s. 251840840 nodes visited

JVector

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index   M=16 ef=100: top 100/1 recall 0.6393, build 45.96s, query 10.84s. 165252200 nodes visited

hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index   M=16 ef=100: top 100/1 recall 0.5328, build 133.32s, query 9.93s. 229974700 nodes visited

hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index   M=16 ef=100: top 100/1 recall 0.3820, build 210.34s, query 11.57s. 183856980 nodes visited

Investigate caching PQ similarity subcalculations

Add info line when vector api is correctly enabled

Currently we have a warning when it's NOT enabled but nothing when it IS enabled (so, running w/ enabled is indistinguishable from running on an unsupported JDK which is also silent)

In Lucene it looks like

INFO: Java vector incubator API enabled; uses preferredBitSize=256

why does Bench OOM after several data files?

It sure looks to me like we retain no references to the DataSet after we're done with it, but on a 16GB heap I reliably enter GC hell by the 4th dataset. (In my current order, the 4th is Sift, which is a smaller set of data than Glove-200, which comes 3rd. So it's not just increasing data size.)

Document how to embed JVector

Merge OnDiskHnswGraphTest from C* into JVector

FullyConnected and RandomlyConnected classes may be useful. VectorCache and CachedLevel will not, we don't have levels anymore.

(RandomlyConnected is only used in C* VectorCacheTest but we could improve OnDiskHnswGraph by using that too, maybe?)

optimization broke pq save + load

Added failing TestPQ.testSaveLoad to illustrate

Add the rest of the methods in SimdOps to VectorUtil + VectorUtilSupport so it runs w/o Panama

Performance improvements broke PQ

before 2b03f5f

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 32.82s,
PQ encode 14.51s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5968, build 8.41s, query 3.61s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.5897, build 8.41s, query 9.32s. -1 nodes visited

After:

hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 20.01s,
PQ encode 1.65s,
Index   M=8 ef=60 PQ=false: top 100/1 recall 0.5927, build 8.31s, query 3.65s. -1 nodes visited
Index   M=8 ef=60 PQ=true: top 100/1 recall 0.0005, build 8.31s, query 10.49s. -1 nodes visited

add Searcher Builder option withoutExactOrdering, see how much it improves Bench times

if withoutExactOrdering is specified, don't bother loading the vector from disk or re-sorting at the end

Add additional project README with basic instructions

We can expand on it as we go.

Comparison of DiskANN vs Lucene HNSW

Need to use the actual Lucene disk classes for this. Have sample code deep in old vsearch history, will dig it up. Using my concurrent5 Lucene branch is kind of cheating but I think it's okay vs spending 32x as much time waiting for serial Lucene to build the graph, the actual graphs produced are nearly equivalent.

Goal is to build Deep100M index on machine with enough memory (64GB should be enough) and then serve it from disk on either a smaller VM or a container. I could be wrong but I don't think we can keep it from using all the ram as disk cache otherwise.

Packaging must produce a releasable javadoc jar

We do not currently produce a javadoc jar. Quick attempt to add one surfaces many issues building javadocs. Resolve these and get the jar produced by packaging.

mvn:exec doesn't use JAVA_HOME

$ ./mvnw exec:exec@bench
10:58:11,915 [INFO] Scanning for projects...
10:58:12,000 [INFO]
10:58:12,000 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
10:58:12,002 [INFO] Building jvector 1.0-SNAPSHOT
10:58:12,002 [INFO] --------------------------------[ jar ]---------------------------------
10:58:12,069 [INFO]
10:58:12,070 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
Error occurred during initialization of boot layer
java.lang.module.FindException: Module jdk.incubator.vector not found
10:58:12,220 [ERROR] Command execution failed.

$ ./mvnw -v
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/jonathan/.m2/wrapper/dists/apache-maven-3.6.3-bin/1iopthnavndlasol9gbrbg6bf2/apache-maven-3.6.3
Java version: 20.0.1, vendor: Oracle Corporation, runtime: /home/jonathan/.jdks/openjdk-20.0.1

$ java -version
openjdk version "11.0.19" 2023-04-18

$ env |grep JAVA
JAVA_HOME=/home/jonathan/.jdks/openjdk-20.0.1/

I believe the problem is that pom.xml executes java instead of using JAVA_HOME. (As the output above shows, mvnw itself is using JAVA_HOME correctly.)

Experiment with 4-bit PQ

Interested to see how much accuracy we lose if we do 4-bit PQ (two PQ entries per byte) intead of 8-.

Rename CachingGraphIndex.BFS_DISTANCE?

This is a cosmetic thing, but always good to avoid confusion.

CachingGraphIndex.BFS_DISTANCE is a parameter used to populate GraphCache. Currently, we are actually performing DFS with a stack (the recursive call stack), rather than BFS with a queue.

Immediate possible "fixes":

Simplest thing is rename BFS_DISTANCE to DFS_DISTANCE (or more generically just DISTANCE) , and keep the current DFS implementation
Slightly more work is to keep the naming and re-implement our graph traversal using BFS. Our queue would hold (remaining_steps, node) pairs

Also, one other minor thing. The set of nodes that are distance = 0 from the entry node should correspond to precisely the entry node; this check should probably be a strict < 0 check.

add PQ unit tests

General documentation and API index

First off, great work!

It'd be very helpful if there were general documentation which helped map the theory and concepts to the class hierarchy or the main facades.

That may be augmented w/ potentially more examples and definitely an API index to browse.

Put the Disk in DiskANN

It may be useful to look at https://github.com/datastax/cassandra/tree/diskann, but this only added PQ to the existing HNSW structure. What we want to do here is implement full diskann, which means saving vectors as part of the graph structure, and fetching them as we perform BFS to minimize "seeks" (mmap paging-in) compared to reading them separately at the end of the search to re-rank in exact order.

Will also need to add a isApproximate() flag (or the opposite, isExact()) so that search knows when it needs to do the re-rank.

I think the right place for this is NeighborSimilarity.

mvnw exec:exec@bench fails

$ ./mvnw exec:exec@bench
11:14:40,831 [INFO] Scanning for projects...
11:14:40,876 [INFO]
11:14:40,876 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
11:14:40,877 [INFO] Building jvector 1.0-SNAPSHOT
11:14:40,877 [INFO] --------------------------------[ jar ]---------------------------------
11:14:40,907 [INFO]
11:14:40,907 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
WARNING: Using incubator modules: jdk.incubator.vector
Heap space available is 34359738368
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to open file '/home/jonathan/Projects/jvector/hdf5/nytimes-256-angular.hdf5' . Is it a HDF5 file?
        at io.jhdf.HdfFile.<init>(HdfFile.java:228)
        at com.github.jbellis.jvector.example.Bench.load(Bench.java:149)
        at com.github.jbellis.jvector.example.Bench.gridSearch(Bench.java:245)
        at com.github.jbellis.jvector.example.Bench.main(Bench.java:232)
Caused by: java.nio.file.NoSuchFileException: hdf5/nytimes-256-angular.hdf5
        at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
        at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
        at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:308)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:367)
        at io.jhdf.HdfFile.<init>(HdfFile.java:188)
        ... 3 more
11:14:41,055 [ERROR] Command execution failed.

I note that mvnw clean install works and tests pass.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.