Coder Social home page Coder Social logo

Comments (12)

ankane avatar ankane commented on June 14, 2024 9

This has been merged (94a444f) and will be included in 0.7.0 (#508). Thanks @seancarroll for the suggestion!

from pgvector.

dannyseismic avatar dannyseismic commented on June 14, 2024 5

@ankane I'm not sure how you measure demand, but this is also something we're interested in for the same use case (image hash similarity search).

from pgvector.

pshrest2 avatar pshrest2 commented on June 14, 2024 4

@ankane this is something I am very interested in as well.

from pgvector.

ankane avatar ankane commented on June 14, 2024 1

The above query does similarity search for Hamming distance on 256 bit vectors (which is the hash size produced by PDQ). It gets the 10 nearest neighbors, but you could also add a WHERE clause to filter by distance as well.

SELECT id FROM items
    WHERE bit_count(image_hash # $1) <= 10  -- threshold distance D
    ORDER BY bit_count(image_hash # $1)     -- similarity search
    LIMIT 10;                               -- N number of results

(replace $1 with the query hash)

from pgvector.

a-mckinley avatar a-mckinley commented on June 14, 2024 1

Binary text embedding models appear to have had a big uptick in capability recently. Index support for hamming distance search would be a major step

from pgvector.

ankane avatar ankane commented on June 14, 2024 1

Support for HNSW indexing is available in the bitvector branch for anyone who wants to try it (in a non-production environment).

CREATE TABLE items (id bigserial PRIMARY KEY, embedding bit(3));
INSERT INTO items (embedding) VALUES (B'000'), (B'111');
CREATE INDEX ON items USING hnsw (embedding bit_hamming_ops);
SELECT * FROM items ORDER BY embedding <~> B'101' LIMIT 5;

from pgvector.

ankane avatar ankane commented on June 14, 2024

Hi @seancarroll, thanks for the suggestion. Added an initial version (for exact search) in the hamming-distance branch.

from pgvector.

ankane avatar ankane commented on June 14, 2024

It looks like it's currently possible to do exact search with bit strings (without pgvector), fwiw.

CREATE FUNCTION hamming_distance(a varbit, b varbit) RETURNS float8
    LANGUAGE SQL IMMUTABLE STRICT PARALLEL SAFE
    RETURN bit_count(a # b);

SELECT hamming_distance(B'1111111100000000', B'1111000000001111');

Indexing will take a bit more work (to index non-vector types, but that's already planned), but would like to see how much demand there is for this first.

from pgvector.

seancarroll avatar seancarroll commented on June 14, 2024

Hi @ankane thanks for the follow up. Yea I'm interested in the indexing use case. For the use case I was considering it would be indexing a binary vector but good to know indexing on non-vector types is planned.

from pgvector.

ankane avatar ankane commented on June 14, 2024

It might be worth trying exact search depending on the size of your data.

CREATE TABLE items (id bigserial PRIMARY KEY, image_hash bit(256));
INSERT INTO items (image_hash) SELECT i::bit(256) FROM generate_series(1, 1000000) i;
SELECT id FROM items ORDER BY bit_count(image_hash # 500000::bit(256)) LIMIT 10;

The SELECT query completes in under 70 ms on my machine with 1 million hashes.

from pgvector.

seancarroll avatar seancarroll commented on June 14, 2024

Sorry I should have been clearer in the description of the use case. What you have certainly works great for an exact search but the target use case would be for a similarity search. One of the interesting features of PDQ hashes is that perceptually similar images produce similar outputs and their similarity can be determined via hamming distance.

So the use case would be indexing a set of images and then being able to do a similarity query to do something like give me N number of results within some threshold distance D from this vector

from pgvector.

seancarroll avatar seancarroll commented on June 14, 2024

You're absolutely right. I think an index might still be valuable but I'm no expert. Providing crude numbers on my mac 1m returns within 300 ms or so while 10mil takes roughly 4.5sec. Granted there are likely ways to get some perf boot. Anywho, I appreciate all the details you provided

from pgvector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.