Would be interested to see pgvector support hamming distance. An example of an existin

This has been merged (<a class="commit-link" data-hovercard-type="commit" data-hoverca

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Support For Hamming Distance about pgvector HOT 12 CLOSED

seancarroll commented on June 14, 2024 6

Support For Hamming Distance

from pgvector.

Comments (12)

ankane commented on June 14, 2024 9

This has been merged (94a444f) and will be included in 0.7.0 (#508). Thanks @seancarroll for the suggestion!

from pgvector.

dannyseismic commented on June 14, 2024 5

@ankane I'm not sure how you measure demand, but this is also something we're interested in for the same use case (image hash similarity search).

from pgvector.

pshrest2 commented on June 14, 2024 4

@ankane this is something I am very interested in as well.

from pgvector.

ankane commented on June 14, 2024 1

The above query does similarity search for Hamming distance on 256 bit vectors (which is the hash size produced by PDQ). It gets the 10 nearest neighbors, but you could also add a WHERE clause to filter by distance as well.

SELECT id FROM items
    WHERE bit_count(image_hash # $1) <= 10  -- threshold distance D
    ORDER BY bit_count(image_hash # $1)     -- similarity search
    LIMIT 10;                               -- N number of results

(replace $1 with the query hash)

from pgvector.

a-mckinley commented on June 14, 2024 1

Binary text embedding models appear to have had a big uptick in capability recently. Index support for hamming distance search would be a major step

from pgvector.

ankane commented on June 14, 2024 1

Support for HNSW indexing is available in the bitvector branch for anyone who wants to try it (in a non-production environment).

CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3));
INSERT INTO items (embedding) VALUES (B'000'), (B'111');
CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
SELECT * FROM items ORDER BY embedding <~> B'101' LIMIT 5;

from pgvector.

ankane commented on June 14, 2024

Hi @seancarroll, thanks for the suggestion. Added an initial version (for exact search) in the hamming-distance branch.

from pgvector.

ankane commented on June 14, 2024

It looks like it's currently possible to do exact search with bit strings (without pgvector), fwiw.

CREATE FUNCTION hamming_distance(a varbit, b varbit) RETURNS float8
    LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
    RETURN bit_count(a # b);

SELECT hamming_distance(B'1111111100000000', B'1111000000001111');

Indexing will take a bit more work (to index non-vector types, but that's already planned), but would like to see how much demand there is for this first.

from pgvector.

seancarroll commented on June 14, 2024

Hi @ankane thanks for the follow up. Yea I'm interested in the indexing use case. For the use case I was considering it would be indexing a binary vector but good to know indexing on non-vector types is planned.

from pgvector.

ankane commented on June 14, 2024

It might be worth trying exact search depending on the size of your data.

CREATE TABLE items (id bigserial PRIMARY KEY, image_hash bit(256));
INSERT INTO items (image_hash) SELECT i::bit(256) FROM generate_series(1, 1000000) i;
SELECT id FROM items ORDER BY bit_count(image_hash # 500000::bit(256)) LIMIT 10;

The SELECT query completes in under 70 ms on my machine with 1 million hashes.

from pgvector.

seancarroll commented on June 14, 2024

Sorry I should have been clearer in the description of the use case. What you have certainly works great for an exact search but the target use case would be for a similarity search. One of the interesting features of PDQ hashes is that perceptually similar images produce similar outputs and their similarity can be determined via hamming distance.

So the use case would be indexing a set of images and then being able to do a similarity query to do something like give me N number of results within some threshold distance D from this vector

from pgvector.

seancarroll commented on June 14, 2024

You're absolutely right. I think an index might still be valuable but I'm no expert. Providing crude numbers on my mac 1m returns within 300 ms or so while 10mil takes roughly 4.5sec. Granted there are likely ways to get some perf boot. Anywho, I appreciate all the details you provided

from pgvector.

Support For Hamming Distance about pgvector HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent