jbellis / jvector Goto Github PK
View Code? Open in Web Editor NEWJVector: the most advanced embedded vector search engine
License: Apache License 2.0
JVector: the most advanced embedded vector search engine
License: Apache License 2.0
Recall should be ~0.975 for in-memory and ~0.96 for from-disk (with PQ)
I think it's not necessary for Vamana single-level graphs
This is some maven refactoring that will involve:
core
directory which will be the library classes and related unit testsexample
directory with will be the current Bench and Sift classesBoth subdirs will need child poms.
Mostly this means, merge the PQ branch of hnswrecall
if withoutExactOrdering is specified, don't bother loading the vector from disk or re-sorting at the end
We do not currently produce a javadoc jar. Quick attempt to add one surfaces many issues building javadocs. Resolve these and get the jar produced by packaging.
This will take moving to MemorySegments and FloatBuffers.
Then c code using jextract
We can expand on it as we go.
$ ./mvnw exec:exec@bench
11:14:40,831 [INFO] Scanning for projects...
11:14:40,876 [INFO]
11:14:40,876 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
11:14:40,877 [INFO] Building jvector 1.0-SNAPSHOT
11:14:40,877 [INFO] --------------------------------[ jar ]---------------------------------
11:14:40,907 [INFO]
11:14:40,907 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
WARNING: Using incubator modules: jdk.incubator.vector
Heap space available is 34359738368
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread "main" io.jhdf.exceptions.HdfException: Failed to open file '/home/jonathan/Projects/jvector/hdf5/nytimes-256-angular.hdf5' . Is it a HDF5 file?
at io.jhdf.HdfFile.<init>(HdfFile.java:228)
at com.github.jbellis.jvector.example.Bench.load(Bench.java:149)
at com.github.jbellis.jvector.example.Bench.gridSearch(Bench.java:245)
at com.github.jbellis.jvector.example.Bench.main(Bench.java:232)
Caused by: java.nio.file.NoSuchFileException: hdf5/nytimes-256-angular.hdf5
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:224)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:308)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:367)
at io.jhdf.HdfFile.<init>(HdfFile.java:188)
... 3 more
11:14:41,055 [ERROR] Command execution failed.
I note that mvnw clean install works and tests pass.
This is a cosmetic thing, but always good to avoid confusion.
CachingGraphIndex.BFS_DISTANCE is a parameter used to populate GraphCache
. Currently, we are actually performing DFS with a stack (the recursive call stack), rather than BFS with a queue.
Immediate possible "fixes":
BFS_DISTANCE
to DFS_DISTANCE
(or more generically just DISTANCE
) , and keep the current DFS implementation(remaining_steps, node)
pairsAlso, one other minor thing. The set of nodes that are distance = 0 from the entry node should correspond to precisely the entry node; this check should probably be a strict < 0
check.
$ ./mvnw exec:exec@bench
10:58:11,915 [INFO] Scanning for projects...
10:58:12,000 [INFO]
10:58:12,000 [INFO] ---------------------< com.github.jbellis:jvector >---------------------
10:58:12,002 [INFO] Building jvector 1.0-SNAPSHOT
10:58:12,002 [INFO] --------------------------------[ jar ]---------------------------------
10:58:12,069 [INFO]
10:58:12,070 [INFO] --- exec-maven-plugin:3.1.0:exec (bench) @ jvector ---
Error occurred during initialization of boot layer
java.lang.module.FindException: Module jdk.incubator.vector not found
10:58:12,220 [ERROR] Command execution failed.
$ ./mvnw -v
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /home/jonathan/.m2/wrapper/dists/apache-maven-3.6.3-bin/1iopthnavndlasol9gbrbg6bf2/apache-maven-3.6.3
Java version: 20.0.1, vendor: Oracle Corporation, runtime: /home/jonathan/.jdks/openjdk-20.0.1
$ java -version
openjdk version "11.0.19" 2023-04-18
$ env |grep JAVA
JAVA_HOME=/home/jonathan/.jdks/openjdk-20.0.1/
I believe the problem is that pom.xml executes java
instead of using JAVA_HOME. (As the output above shows, mvnw itself is using JAVA_HOME correctly.)
Currently we have a warning when it's NOT enabled but nothing when it IS enabled (so, running w/ enabled is indistinguishable from running on an unsupported JDK which is also silent)
In Lucene it looks like
INFO: Java vector incubator API enabled; uses preferredBitSize=256
Not sure if this is a big ask or a small one. If it's a lot of work we can skip it.
See commit history for where NBHM got added.
Will probably need to test on low dimension vectors (i.e. random data, not the grid search) so that vector comparison time doesn't swamp the Map in the measurement.
It may be useful to look at https://github.com/datastax/cassandra/tree/diskann, but this only added PQ to the existing HNSW structure. What we want to do here is implement full diskann, which means saving vectors as part of the graph structure, and fetching them as we perform BFS to minimize "seeks" (mmap paging-in) compared to reading them separately at the end of the search to re-rank in exact order.
Will also need to add a isApproximate()
flag (or the opposite, isExact()
) so that search knows when it needs to do the re-rank.
I think the right place for this is NeighborSimilarity
.
Currently we stop after max iterations or when < 1% of points change centroids.
Changed centroids is technically a bad measurement since the centroids themselves move. But maybe it's close enough?
Alternatively, the "correct" way is to check the residual distance from vectors to centroid.
We should find out
(0) How good does kmeans have to be to achieve the recall that we're looking for? Can we do fewer iterations and be effectively unchanged?
(1) What % of vectors are still changing centroids at that point?
(2) Is residual distance materially better in practice, given that it's more expensive to compute?
It sure looks to me like we retain no references to the DataSet after we're done with it, but on a 16GB heap I reliably enter GC hell by the 4th dataset. (In my current order, the 4th is Sift, which is a smaller set of data than Glove-200, which comes 3rd. So it's not just increasing data size.)
Means we'll need a new container class for the results, which makes me sad, but I think this will be important to expose to C* metrics
HNSW (concurrent5 branch + hnswrecall)
hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
HNSW M=16 ef=100: top 100/1 recall 0.7170, build 37.37s, query 11.68s. 209069030 nodes visited
hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
HNSW M=16 ef=100: top 100/1 recall 0.6954, build 81.76s, query 8.32s. 241306820 nodes visited
hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
HNSW M=16 ef=100: top 100/1 recall 0.6329, build 142.23s, query 12.33s. 251840840 nodes visited
JVector
hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
Index M=16 ef=100: top 100/1 recall 0.6393, build 45.96s, query 10.84s. 165252200 nodes visited
hdf5/glove-100-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 100
Index M=16 ef=100: top 100/1 recall 0.5328, build 133.32s, query 9.93s. 229974700 nodes visited
hdf5/glove-200-angular.hdf5: 1183514 base and 10000 query vectors loaded, dimensions 200
Index M=16 ef=100: top 100/1 recall 0.3820, build 210.34s, query 11.57s. 183856980 nodes visited
Something like @PooledFloat
and @unpooled/unshared/unique/distinctFloat. (Naming things is hard.) Go through the RAVV implementations and PQ methods and annotate them appropriately.
Interested to see how much accuracy we lose if we do 4-bit PQ (two PQ entries per byte) intead of 8-.
thinking of OSS C* here
FullyConnected and RandomlyConnected classes may be useful. VectorCache and CachedLevel will not, we don't have levels anymore.
(RandomlyConnected is only used in C* VectorCacheTest but we could improve OnDiskHnswGraph by using that too, maybe?)
We should automate the release process once we're comfortable with the results. I'd prefer a workflow that runs when an appropriate tag is pushed, but I'm open to other options.
right now Bench hardcodes dimensions at
var pqDims = ds.baseVectors.get(0).length / 2;
because with the datasets from ann-benchmarks (nytimes-256, glove-200) this gives virtually no recall loss at overquery=2.
however, OpenAI's embeddings vectors are ridiculously overparameterized, we should be able to get more aggressive there
length / 4 ? length / 8? let's find out
I remember seeing 40% improvement enabling simd, we're seeing less now. Did Lucene's hand-unrolled simd add value?
Fixing this:
// TODO implement other similarity functions efficiently
(I'm not sure it's possible to do the same for cosine.)
First off, great work!
It'd be very helpful if there were general documentation which helped map the theory and concepts to the class hierarchy or the main facades.
That may be augmented w/ potentially more examples and definitely an API index to browse.
Added failing TestPQ.testSaveLoad to illustrate
Even more important now that enforceMaxConnLimit relies on this too.
Need a quick description of running via maven exec plugin that @jkni put in. Also will add a stanza for SiftSmall
This will allow doing PQ against a subset of vectors without reading them all into memory, more easily
GraphIndex.getView
returns a View
interface that isn't AutoCloseable
but the OnDiskGraphIndex.OnDiskView
implementation of View
is Autocloseable
. This means that users of the GraphIndex.getView
method can't do:
try (var view = graph.getView())
{
...
}
This is easy to workaround but annoying.
This better aligns with the artifact groupId.
before 2b03f5f
hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 32.82s,
PQ encode 14.51s,
Index M=8 ef=60 PQ=false: top 100/1 recall 0.5968, build 8.41s, query 3.61s. -1 nodes visited
Index M=8 ef=60 PQ=true: top 100/1 recall 0.5897, build 8.41s, query 9.32s. -1 nodes visited
After:
hdf5/nytimes-256-angular.hdf5: 289761 base and 9991 query vectors loaded, dimensions 256
PQ@128 build 20.01s,
PQ encode 1.65s,
Index M=8 ef=60 PQ=false: top 100/1 recall 0.5927, build 8.31s, query 3.65s. -1 nodes visited
Index M=8 ef=60 PQ=true: top 100/1 recall 0.0005, build 8.31s, query 10.49s. -1 nodes visited
so if you don't "clean install" manually first you will either get ClassNotFoundException (moderately confusing and bad) or it will silently run the last version that you built (very very bad).
Need to use the actual Lucene disk classes for this. Have sample code deep in old vsearch history, will dig it up. Using my concurrent5 Lucene branch is kind of cheating but I think it's okay vs spending 32x as much time waiting for serial Lucene to build the graph, the actual graphs produced are nearly equivalent.
Goal is to build Deep100M index on machine with enough memory (64GB should be enough) and then serve it from disk on either a smaller VM or a container. I could be wrong but I don't think we can keep it from using all the ram as disk cache otherwise.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.