Comments (12)
This has been merged (94a444f) and will be included in 0.7.0 (#508). Thanks @seancarroll for the suggestion!
from pgvector.
@ankane I'm not sure how you measure demand, but this is also something we're interested in for the same use case (image hash similarity search).
from pgvector.
@ankane this is something I am very interested in as well.
from pgvector.
The above query does similarity search for Hamming distance on 256 bit vectors (which is the hash size produced by PDQ). It gets the 10 nearest neighbors, but you could also add a WHERE
clause to filter by distance as well.
SELECT id FROM items
WHERE bit_count(image_hash # $1) <= 10 -- threshold distance D
ORDER BY bit_count(image_hash # $1) -- similarity search
LIMIT 10; -- N number of results
(replace $1
with the query hash)
from pgvector.
Binary text embedding models appear to have had a big uptick in capability recently. Index support for hamming distance search would be a major step
from pgvector.
Support for HNSW indexing is available in the bitvector
branch for anyone who wants to try it (in a non-production environment).
CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3));
INSERT INTO items (embedding) VALUES (B'000'), (B'111');
CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
SELECT * FROM items ORDER BY embedding <~> B'101' LIMIT 5;
from pgvector.
Hi @seancarroll, thanks for the suggestion. Added an initial version (for exact search) in the hamming-distance branch.
from pgvector.
It looks like it's currently possible to do exact search with bit strings (without pgvector), fwiw.
CREATE FUNCTION hamming_distance(a varbit, b varbit) RETURNS float8
LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
RETURN bit_count(a # b);
SELECT hamming_distance(B'1111111100000000', B'1111000000001111');
Indexing will take a bit more work (to index non-vector types, but that's already planned), but would like to see how much demand there is for this first.
from pgvector.
Hi @ankane thanks for the follow up. Yea I'm interested in the indexing use case. For the use case I was considering it would be indexing a binary vector but good to know indexing on non-vector types is planned.
from pgvector.
It might be worth trying exact search depending on the size of your data.
CREATE TABLE items (id bigserial PRIMARY KEY, image_hash bit(256));
INSERT INTO items (image_hash) SELECT i::bit(256) FROM generate_series(1, 1000000) i;
SELECT id FROM items ORDER BY bit_count(image_hash # 500000::bit(256)) LIMIT 10;
The SELECT
query completes in under 70 ms on my machine with 1 million hashes.
from pgvector.
Sorry I should have been clearer in the description of the use case. What you have certainly works great for an exact search but the target use case would be for a similarity search. One of the interesting features of PDQ hashes is that perceptually similar images produce similar outputs and their similarity can be determined via hamming distance.
So the use case would be indexing a set of images and then being able to do a similarity query to do something like give me N number of results within some threshold distance D from this vector
from pgvector.
You're absolutely right. I think an index might still be valuable but I'm no expert. Providing crude numbers on my mac 1m returns within 300 ms or so while 10mil takes roughly 4.5sec. Granted there are likely ways to get some perf boot. Anywho, I appreciate all the details you provided
from pgvector.
Related Issues (20)
- how to list existing databases in postgresql using python HOT 2
- HNSW Indexing and Filtering HOT 2
- A question about building index in background. HOT 1
- Installation instructions unclear HOT 1
- Large vector data type will cause performance decline? HOT 1
- A question regard table_open() in background worker when building index HOT 3
- jVector Implementation
- Type Error when working with Langchain (Missing Positional Argument: evalue) HOT 1
- pgvector still use row-based storage instead of columnar storage ? HOT 1
- Can't get the query planner to use HNSW index HOT 3
- 【search failed】 2000w、768dim, data search failed HOT 1
- ERROR: index row size 6160 exceeds btree version 4 maximum 2704 for index HOT 3
- Make difficulties HOT 2
- Table Insert Performance with HNSW Index HOT 3
- Comparison with high-precision data HOT 2
- Weight in the filters HOT 5
- can't make pgvector HOT 1
- src\bitvec.c(43): warning C4141: 'dllexport': used more than once HOT 6
- Porting indexes from pinecone to pgvector HOT 1
- Error when creating a halfvec_ip_ops index HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgvector.