Coder Social home page Coder Social logo

pgvector-python's Issues

No operator matches the given name and argument types

When I try to retrieve/query embeddings from db:

from pgvector.psycopg import register_vector
import psycopg
conn = psycopg.connect(dbname='db')
conn.autocommit = True
register_vector(conn)
conn.execute('SELECT * FROM document ORDER BY embedding <=> %s LIMIT 1', (embeddings,))
rows = cur.fetchall()

I get:

UndefinedFunction: operator does not exist: vector <=> double precision[]
LINE 1: SELECT * FROM document ORDER BY embedding <=> $1 LIMIT 1
                                                  ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

What am I doing wrong ?

lib version:
psycopg 3.1.8
also observed with lib:
psycopg2-binary 2.9.5

Question : can i use 1024 dimensions size vector ?

Hi, I tested your example using psycopg which is 3 dimensions and it works great. But when I tried to insert embedding from fastembed "intfloat/multilingual-e5-large" which is 1024 dimensions it failed
with message

psycopg.ProgrammingError: cannot adapt type '_GenericAlias' using placeholder '%s' (format: AUTO)

Or did I do something wrong ?

Thanks

Rename max_inner_product

Just wondering why the comparator is called max_inner_product. Shouldn't it be negative_inner_product? Seems like it could be confusing. Thanks for making the package!

HINT: No operator matches the given name and argument types. You might need to add explicit type casts.

I have implemented semantic search like so:

upsert

insert_vector_query = "INSERT INTO holidays_item_embeddings (packageId, packageName, url, packageDestination, embedding) VALUES (%s, %s, %s, %s, %s)" 
connection = psycopg2.connect(**db_params)
cursor = connection.cursor()
cursor.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(connection)
cursor.executemany(insert_vector_query, items_to_insert) 

search

import psycopg2
from pgvector.psycopg2 import register_vector

connection = psycopg2.connect(**db_params)
cursor = connection.cursor()
register_vector(connection)

select_query_2 = """
        SELECT * FROM holidays_item_embeddings ORDER BY embedding <-> %s LIMIT 20

I am getting the error: HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Could I please have some help, I have followed the instructions carefully of this library and am not able to get it working.

Psycopg2 error with example:

I'm using the example code with some modifications for database connection:

import os

import psycopg2
from dotenv import load_dotenv
from pgvector.psycopg2 import register_vector

load_dotenv(".env", override=True)
DBUSER = os.environ["DBUSER"]
DBPASS = os.environ["DBPASS"]
DBHOST = os.environ["DBHOST"]
DBNAME = os.environ["DBNAME"]
# Use SSL if not connecting to localhost
DBSSL = "disable"
if DBHOST != "localhost":
    DBSSL = "require"

conn = psycopg2.connect(database=DBNAME, user=DBUSER, password=DBPASS, host=DBHOST, sslmode=DBSSL)
conn.autocommit = True
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("DROP TABLE IF EXISTS items")
cur.execute("CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));")
register_vector(conn)

cur.execute("CREATE INDEX ON items USING hnsw (embedding vector_l2_ops)")

cur.execute("INSERT INTO items (embedding) VALUES ('[1, 2, 3]'), ('[-1, 1, 3]'), ('[0, -1, -2]');")

embedding = [3, 1, 2]
cur.execute('SELECT * FROM items ORDER BY embedding <-> %s LIMIT 5', (embedding,))
rows = cur.fetchall()
print(rows)

cur.close()

When I run it, I get the following error:

Traceback (most recent call last):
  File "/workspace/examples/psycopg_items.py", line 30, in <module>
    cur.execute('SELECT * FROM items ORDER BY embedding <-> %s LIMIT 5', (embedding,))
psycopg2.errors.UndefinedFunction: operator does not exist: vector <-> integer[]
LINE 1: SELECT * FROM items ORDER BY embedding <-> ARRAY[3,1,2] LIMI...
                                               ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

possible to use HNSW early?

hey all, just curious I am not sure at what stage the development for django integration of pgvector but is there a version that I can pip install to use the hnsw index early?

How do i pass in np.ndarray with ndim > 1

I'm working on a project based of face recognition where i have to store the face-embedding into postgres DB. the face-recog part returns a numpy nd-array with 120 dimensions. which is the cropped out portion of the face. When i try to store this, pgvector complains that ndim needs to be 1. I cannot find any relevent documentation regarding this. How do i change it? wny help would be appreciated

Ideas

Please create a new issue to discuss any ideas or share your own.

  • Merge SQLModel tests - sqlmodel branch (waiting for SQLAlchemy 2 support)

Add support for printing SQL of SQLAlchemy query

pgvector.sqlalchemy.Vector doesn't provide a process_literal_param method, so any SQLALchemy query containing literal values for such column will fail to be rendered to final SQL including the values when literal_binds are enabled. Please implement the method.

Repro steps:

import sqlalchemy as sa
import sqlalchemy.dialects.postgresql
import pgvector.sqlalchemy as pgv

engine = sa.create_engine("postgresql+psycopg2://postgres:password@localhost:5432/neighbors-2")
conn = engine.connect()
table = sa.Table("foo", sa.MetaData(), sa.Column("col1", pgv.Vector(4)))
query = sa.select(table.c.col1).where((table.c.col1.max_inner_product([0.5, 0.5, 0.5, 0.5]) * -1) >= 0.1)
kwargs = {"literal_binds": True}
sql = str(query.compile(dialect=sqlalchemy.dialects.postgresql.dialect()), compile_kwargs=kwargs)

raises

sqlalchemy.exc.CompileError: No literal value renderer is available for literal value "[0.5, 0.5, 0.5, 0.5]" with datatype VECTOR(4)

(The query itself is OK which can be verified by using kwargs = {} -- the compilation returns 'SELECT foo.col1 \nFROM foo \nWHERE (foo.col1 <#> %(col1_1)s) * %(param_1)s >= %(param_2)s')

Workaround:

import pgvector.utils

class Vector(pgv.Vector, sa.TypeDecorator):
    impl = pgv.Vector
    cache_ok = True

    def process_literal_param(self, value, dialect):
        return repr(pgvector.utils.to_db(value, self.dim))

and use this wrapper class in the table definition instead of the library-provided pgv.Vector. The SQL rendering then works:

str(query.compile(dialect=sqlalchemy.dialects.postgresql.dialect(), compile_kwargs={"literal_binds": True}))
"SELECT foo.col1 \nFROM foo \nWHERE (foo.col1 <#> '[0.5,0.5,0.5,0.5]') * -1 >= 0.1"

How insert embedding data use sqlalchemy session.execute

Hello,

I am currently working on a Python script that involves inserting embedding data into a database. The snippet of the code I’m using is as follows:

session.execute(text(f"INSERT INTO mytable (id, embeddings) VALUES (1, {embedding_data})"))

However, I’ve encountered an error when attempting to pass OpenAI’s embedding object to the embedding_data placeholder. I’d appreciate any guidance on whether I’m using the API correctly. Is it possible to insert an OpenAI embedding object directly, or do I need to perform some type of serialization or conversion before insertion?

Thank you in advance for your help and suggestions.

serialize PGVector for aws lambda

i have the following code

def get_vectorstore(connection_string):
# Initialize the embedding function
embedding_func_instance = MyEmbeddings()

# Initialize the vectorstore from the existing database
vectorstore = PGVector(connection_string=connection_string, embedding_function=embedding_func_instance)
return vectorstore

Unable to marshal response: Object of type PGVector is not JSON serializable

[peewee] using Model.get_or_create() doesn't work with pgvector

Hello!

I'm trying to use the peewee method .get_or_create() in place of .create() but I'm getting the following error:

{...}
  File "/myprj/venv/lib/python3.11/site-packages/pgvector/peewee/__init__.py", line 16, in db_value
    return to_db(value)
           ^^^^^^^^^^^^
  File "/myprj/venv/lib/python3.11/site-packages/pgvector/utils/__init__.py", line 37, in to_db
    return '[' + ','.join([str(float(v)) for v in value]) + ']'
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'float' object is not iterable

Can you help to find out whether is this a bug or something I'm doing wrong?

Thanks!

Django migration to add VectorField unable to find vector.control

I'm not able to get the Django extension working in a project. I used the instructions on the README page, but I'm seeing the following error:

django.db.utils.OperationalError: could not open extension control file "/Library/PostgreSQL/14/share/postgresql/extension/vector.control": No such file or directory

Ideas:

  1. pgvector is not installing correctly for my postgres installation
  2. I'm not apply the first step to "Create the extension" on the README. (I'm interpreting this as "add a migration to the Django app's migrations directory.)
  3. do I need to add anything to settings.py?

Steps I ran:

ran pipenv install pgvector (pipenv b/c I'm using a environment with Piplock).

My migration file:

from django.db import migrations
import pgvector.django
from pgvector.django import VectorExtension

class Migration(migrations.Migration):

    dependencies = [
        ('backend_api', '0003_document_document_faiss_index'),  # a previous migration
    ]

    operations = [
        VectorExtension(),
        migrations.AddField(
            model_name='document',
            name='document_embeddings',
            field=pgvector.django.VectorField(dimensions=1536, null=True),
        ),
    ]

and my models.py file:

class Document(models.Model):
    ...
    document_embeddings = VectorField(dimensions=1536, null=True)
    ...

ran python manage.py migrate and got above error.

How to get distance by Psycopg 2

Hey, Thanks for this amazing project.

There is an example about how to get distance by django and SQLAlchemy but Psycopg2. Could you help about how to get the distance by Psycopg 2? Thanks in advance

SQLAlchemy order_by with ivfflat not returning any results

This is my adapted code based on the README:

from pgvector.sqlalchemy import Vector
from sqlalchemy import Index, create_engine, func, select, text
from sqlalchemy.orm import DeclarativeBase, Mapped, Session, mapped_column

from database import DATABASE_URI


# Define the models
class Base(DeclarativeBase):
    pass


class Item(Base):
    __tablename__ = "items"
    id: Mapped[int] = mapped_column(primary_key=True)
    embedding = mapped_column(Vector(3))


# Connect to the database
engine = create_engine(DATABASE_URI, echo=False)

# Create tables in database
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

# Insert data and issue queries
with Session(engine) as session:
    session.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))

    index = Index(
        "my_index",
        Item.embedding,
        postgresql_using="ivfflat",
        postgresql_with={"lists": 100},
        postgresql_ops={"embedding": "vector_l2_ops"},
    )

    index.create(engine)

    session.add_all(
        [
            Item(embedding=[1, 2, 3]),
            Item(embedding=[-1, 1, 3]),
            Item(embedding=[0, -1, -2]),
        ]
    )

    # Print all vectors
    all = session.scalars(select(Item))
    for item in all:
        print(item.embedding)

    # Find 2 closest vectors to [3, 1, 2]
    closest = session.scalars(select(Item).order_by(Item.embedding.l2_distance([3, 1, 2])).limit(2))
    print(list(closest))

When I use hnsw index, it works. However, with ivfflat, the closest variable is an empty list.

Question on (testing) cosine_similiarity with WHERE clauses

Hello! First off, thank you for this library and the terrific examples (I was especially appreciative of the clear-cut SQLModel starter code).

This is somewhat of a question, and somewhat just asking for confirmation that my interpretation is correct (as I was really scratching my head before starting to, maybe, make sense of things). I'm working on a SQModel-based codebase where I have approximately the following schema:

  • Document
    • id: UUID
  • Chunk
    • id: UUID
    • document_id == Document.id (foreign key relationship)
  • Embedding(s)
    • id: UUID
    • vector: Vector
    • chunk_id == Chunk.id (foreign key relationship)

(embeddings are stored separately from chunks since I want to experiment with different embedding models, and thus want to support multiple embedding tables)

I adapted this codebase's SQLModel tests to see if my querying strategy (which incorporated a WHERE clause) was working correctly. Things were fine for l2_distance and max_inner_product but my results were nondetermistic for cosine_distance -- sometimes they passed, sometimes they didn't.

I was able to make sense of this after learning that (at least in postgres) adding a WHERE clause can introduce some non-determinism in how fields with the same value are ordered. Since for cosine_distance, both [1,1,1] and [2,2,2] are 0 degrees away from the test case [1,1,1] (right?) I assume in my tests they were just flip-flopping in order.

Does this checkout to other folks? I'm not a very experienced database nor embeddings developer..

Average embeddings returns string not np array

Environment:
python: 3.11
pgvector: 0.1.8
sqlalchemy: 2.0.23

I am trying to take an average of some embeddings in my database, which seems to work but the return type appears to be a string representation of the vector and not a proper np array. Querying the vectors directly correctly returns a np array.

# Query for the average embedding
In [33]: statement = select(func.avg(models.DocumentChunkModel.embedding))

In [34]: print(statement)
SELECT avg(document_chunks.embedding) AS avg_1 
FROM document_chunks

In [35]: emb = await session.scalar(statement)

# Ouput type is a string
In [36]: type(emb)
Out[36]: str

In [37]: emb[:50]
Out[37]: '[-0.00082952296,-0.010899956,0.00761034,-0.0268747'

In [38]: 

In [38]: 

# Query for the embeddings directly
In [38]: statement = select(models.DocumentChunkModel.embedding)

In [39]: emb = await session.scalar(statement)

# Output is a proper np array
In [40]: type(emb)
Out[40]: numpy.ndarray

Advice for properly timing Django queryset evaluation

Hello, I was hoping to get some advice on how to best go about scoping Django querysets that hold vector data to ensure that the CosineDistance calculation is properly scoped to the relevant rows only. Here's how I'm currently querying the Django ORM:

Python:

from pgvector.django import CosineDistance

user_id = '...'
collection_id = '...'
distance = 0.8

scoped_queryset = (
    VSFItem.objects
    .filter(seg__assets__user__id=user_id)
    .filter(seg__assets__collection__id=collection_id)
)
embedding = VSFItem.objects.first().vector
similar_faces_qset = (
    scoped_queryset
    .alias(distance=CosineDistance("vector", embedding))
    .filter(distance__lt=distance)
)

Which, on calling print(similar_faces_qset.query), shows that the following SQL code has been generated:

SELECT "vsf_item"."seg_id",
       "vsf_item"."vector",
FROM   "vsf_item"
       INNER JOIN "segs"
               ON ( "vsf_item"."seg_id" =
                    "segs"."id" )
       INNER JOIN "assets"
               ON ( "segs"."asset_id" =
                    "assets"."id" )
WHERE  ( "assets"."user_id" =
         b679c79c - 0684 - 4f92 - 9a35 - 29824d62e569
         AND "assets"."collection_id" =
             e3faf9d4 - a7dc - 4710 - ad27 - 85619d34619c
         AND ( "vsf_item"."vector" <=> [-0.734, 0.1616 ... ]) < 0.8 ) 

The WHERE clause here, specifically the AND ( "vsf_item"."vector" <=> [-0.734, 0.1616 ... ]) < 0.8 ) suggests to me that the vector calculation might be happening across all rows of the "vsf_item" table. Is that accurate? If so, do you have any recommendations on how to best scope computation given the above queries? (I'm doing scoping similar in spirit to the above in different parts of my project, so was hoping for general guidelines re. best practices).

Thanks!

Adjacent issue that's worth mentioning: pgvector/pgvector#203 (comment)

Issue with asyncpg and numpy array

The following query is erroring with the numpy array.

        async with self.pool.acquire() as conn:
            async with conn.transaction():
                print(type(data["embedding"]), len(data["embedding"]))

                data["embedding"] = np.array(data["embedding"]) # tests accept this but not live?
                # data["embedding"] = str(np.array(data["embedding"]).tolist())
                data["metadata"] = safe_json_dumps(data["metadata"])
                await conn.execute(
                f"""
                INSERT INTO "{table}" (id, document_id, section_id, text, embedding, metadata) 
                VALUES ($1, $2, $3, $4, $5, $6) 
                ON CONFLICT (id) 
                DO UPDATE SET document_id = $2, section_id = $3, text = $4, embedding = $5, metadata = $6
                """,
                *(
                    data["id"],
                    data["document_id"],
                    data["section_id"],
                    data["text"],
                    data["embedding"],
                    data["metadata"],
                )
                )

Error Message

asyncpg.exceptions.DataError: invalid input for query argument $5: array([ 0.01184103, -0.01023413,  0.0018... (expected str, got ndarray)

I have converted to a string and it works but also bugs.

I have registered pgvector with

        pool = await asyncpg.create_pool(url)
        async with pool.acquire() as conn:
            await register_vector(conn)

Any suggestions?

[SQLModel] Select both columns and distance in the same query

Hi!
The README shows the following example for SQLModel:

from pgvector.sqlalchemy import Vector
from sqlalchemy import Column

class Item(SQLModel, table=True):
    embedding: List[float] = Field(sa_column=Column(Vector(3)))

# (...)

item = Item(embedding=[1, 2, 3])
session.add(item)
session.commit()

# (...)

# Get the nearest neighbors to a vector
session.exec(select(Item).order_by(Item.embedding.l2_distance([3, 1, 2])).limit(5))


# (...)

# Get the distance
session.exec(select(Item.embedding.l2_distance([3, 1, 2])))

I was wondering, how do you select the neighbors and the distance (for each neighbor) in the same query?
I tried

session.exec(select(Item, Item.embedding.l2_distance([3, 1, 2]).label("distance")).order_by(Item.embedding.l2_distance([3, 1, 2])).limit(5))

and it works, but distance is not included as an Item attribute. The output looks like (numbers invented):

[(Item(embedding=array([1, 2, 3], dtype=float32)), 2.335), (Item(embedding=array([3, 2, 1], dtype=float32)), 2.446)]

I would like to have, instead:

[Item(embedding=array([1, 2, 3], dtype=float32), distance=2.335), Item(embedding=array([3, 2, 1], dtype=float32), distance=2.446)]

I guess I have to write the Item model differently, but I'm not sure how to write the distance attribute, since it depends on another column (embedding) and on external data (the actual query embedding).

Thanks!

No support for vector deletion based on index?

Hi
Thanks for this awesome project.
I was just wondering if deleting a vector given index is still in works or will not be supported ?

Reason:
I want to be able to have CRUD functionality and not always delete the whole vector DB and reinitialize in case my data has come changes or is constantly updating.

Documentation

Hi, I am new to vector databases and trying to use this in a project, documentation for this is not in depth, can you please provide more detail on how to use it with python, may be give a simple but complete example where variables are not missing, like I was reading where the conn is coming from or where the mapped_column is coming from.

If you could provide an indepth example on how to create a collection and then insert vector data and some other data in the same table and how to fetch and do similarity search. It would be great. Thanks!

Can't register_vector(connection)

I have installed pgvector on Ubuntu 22.04 and checked that as follows:

$sudo apt install postgresql-15-pgvector
[sudo] password for :
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
postgresql-15-pgvector is already the newest version (0.4.2-1.pgdg22.04+1).
0 to upgrade, 0 to newly install, 0 to remove and 8 not to upgrade.
I further check by running:
CREATE EXTENSION vector;
and I get:
SQL Error [42710]: ERROR: extension "vector" already exists

So it looks as though everything is installed correctly.

however when I use the code as per docs:

from pgvector.psycopg2 import register_vector
register_vector(connection)

it throws an exception and I get:
vector type not found in the database

Python version is 3.10.9 and the connection is setup as follows:

        connection = psycopg2.connect(
            host=hostname, user=username, password=password,
            dbname=database)
        cur = connection.cursor()
        connection.autocommit = True

Help appreciated.

Integration with Django Admin

When attempting to create database records in a Django Admin application, pgvector values fail the basic, naive admin form validation with:

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Is there a common approach for working around this at the model (preferable) or admin form level? I've tried specifying a validator function via the model field's validators kwarg and the first round of "truthy" validation (i.e. the one referenced above) fails before my function is called.

Question: How to access distance values when querying with Django?

Hi there! Wanted to ask how one should access calculated diff values when ordering by L2Distance in Django?

from pgvector.django import L2Distance

Item.objects.order_by(L2Distance('embedding', [3, 1, 2]))[:5]

[3, 1, 2] vs. items[0].embedding?

Also, is it possible to get rows at a certain distance?

SELECT * FROM items WHERE embedding <-> '[3,1,2]' < 5;

Cannot adapt type 'ndarray' with placeholder %s

Hi team, I'm new to PGVector and I would like to do semantic search over a table in postgres. I have embeddings stored as vector already.

I do semantic search as follows:

def semantic_search(query):
    global model, conn

    embed = model.encode(query)
    results = conn.execute('SELECT id, Class, AssetTypeName, Name, Value, FormattedValue FROM assets ORDER BY embedding <=> %s LIMIT 5', (embed,))
    return results

Here, I get an error saying -

ProgrammingError: cannot adapt type 'ndarray' using placeholder '%s' (format: AUTO)

Please let me know what I have missed. Thanks.

sqlalchemy.exc.InvalidRequestError: Unknown PG numeric type: 24664

Thanks for this amazing library.

I'm using it with fastapi and sqlalchemy, and I'm able to insert vectors to the database. But when I need to query the table (for updating a row for example), I get the following error:

File "/usr/local/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 525, in result_processor
raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Unknown PG numeric type: 24664

I had the issue with inserting vectors as well, but I fixed it by changing from:

def create_record(session: Session, record: RecordCreate):
    new_record = Record(**record.dict())
    session.add(new_record)
    session.commit()
    session.refresh(new_record)
    return new_record

to:

def create_record(session: Session, record: RecordCreate):
    new_record = Record(**record.dict())
    session.add(new_record)
    session.commit()
    # session.refresh(new_record)
    return True

I have another function to update a database record if it exists, and it fails with the error mentioned above:

def create_or_update_record(session: Session, record: RecordCreate):
    db_record = session.exec(
        select(Record).where(Record.id == record.id)
    ).first()
    # rest of the function

Error:

api_1 | File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/fastapi.py", line 84, in _sentry_call
api_1 | return old_call(*args, **kwargs)
api_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^
api_1 | File "/usr/src/app/app/api/controllers/records.py", line 119, in create_or_update_record_endpoint
api_1 | return create_or_update_record(session, record)
api_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1 | File "/usr/src/app/app/api/controllers/records.py", line 53, in create_or_update_record
api_1 | db_record = session.exec(
api_1 | ^^^^^^^^^^^^^
api_1 | File "/usr/local/lib/python3.11/site-packages/sqlmodel/orm/session.py", line 60, in exec
api_1 | results = super().execute(
api_1 | ^^^^^^^^^^^^^^^^
rest of the error ...
api_1 | File "/usr/local/lib/python3.11/site-packages/sqlalchemy/sql/type_api.py", line 702, in _cached_result_processor
api_1 | d[coltype] = rp = d["impl"].result_processor(dialect, coltype)
api_1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
api_1 | File "/usr/local/lib/python3.11/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 525, in result_processor
api_1 | raise exc.InvalidRequestError(
api_1 | sqlalchemy.exc.InvalidRequestError: Unknown PG numeric type: 24664

Any idea how can I fix this?

Error when querying nearest vector

I am getting the same error when querying for similar vector. I have looked at the 50 as well as others linked there but still can't resolve. My encoding and insertions work correctly. I am using sentence transformer encoding and inserting by converting it into numpy.array using the following code

from sentence_transformers import SentenceTransformer
import numpy as np
from pgvector.psycopg import register_vector
import psycopg

model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
#read dataframe from csv and get the list to be encoded
string_list = list(df["NAME"][:10])

#Encode the string list
string_embeddings = model.encode(string_list)
conn = psycopg.connect(connection parameters)
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

conn.execute('DROP TABLE IF EXISTS master_vec')
conn.execute('CREATE TABLE master_vec (master_id bigserial PRIMARY KEY, name text, embedding vector(384))')


for master_name, embedding in zip(string_list, string_embeddings):
    conn.execute('INSERT INTO master_vec (name, embedding) VALUES (%s, %s)', (name, np.array(embedding)))


#verify insertion
master_id = 1
neighbors = conn.execute('SELECT name FROM master_vec WHERE master_id != %(id)s ORDER BY embedding <=> (SELECT embedding FROM master_vec WHERE master_id = %(id)s) LIMIT 5', {'id': master_id}).fetchall()
for neighbor in neighbors:
    print(neighbor[0])

this works as expected and gives me the 10 records that are inserted
Now doing the semantic search

from sentence_transformers import SentenceTransformer
import numpy as np
from pgvector.psycopg import register_vector
import psycopg


model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
conn = psycopg.connect(connection parameters)
register_vector(conn)

target = 'target string'
target_embedding = model.encode(target)
target_embedding = np.array(target_embedding)
neighbors = conn.execute('SELECT name FROM master_vec ORDER BY embedding <-> %s LIMIT 3', target_embedding).fetchall()
for neighbor in neighbors:
    print(neighbor[0])

This fails with TypeError: query parameters should be a sequence or a mapping, got ndarray
I get the same TypeError irrespective of whether I convert the embeddings to a numpy.array or not. If I use a list then it gives me the error ProgrammingError: the query has 1 placeholders but 384 parameters were passed. These same errors are obtained even if cast embeddings as a vector in the query. What am I doing wrong here?

How to get vectors within a certain distance in SQLalchemy

Hello,how do i query for vectors within a certain l2 distance from a given vector in sqlalchemy?
currently only django version of this query is documented:
Item.objects.alias(distance=L2Distance('embedding', [3, 1, 2])).filter(distance__lt=5)

how do i do the same thing in SQLalchemy

Error when querying for the most similar vector

Hi @ankane , we are trying to use pgvector to perform similarity search

We have successfully loaded embeddings to a postgres table using pgvector, psycopg3, and fastembed by following the readme doc:

https://github.com/vemonet/concept-resolver/blob/main/src/pubdict_load.py#L157

The table has been properly populated, I checked it with this command:

similar = conn.execute("SELECT * FROM pubdictionaries_embeddings LIMIT 5").fetchall()

But if I try to retrieve the most similar (full code here: https://github.com/vemonet/concept-resolver/blob/main/src/pubdict_search.py):

similar = conn.execute('SELECT * FROM pubdictionaries_embeddings ORDER BY embedding <-> %s LIMIT 5', (embeddings,)).fetchall()

I am getting an error where pg pretends it does not know the type I am passing:

Traceback (most recent call last):
  File "/app/src/pubdict_search.py", line 22, in <module>
    similar = conn.execute('SELECT * FROM pubdictionaries_embeddings ORDER BY embedding <-> %s LIMIT 5', (embeddings,)).fetchall()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/psycopg/connection.py", line 896, in execute
    raise ex.with_traceback(None)
psycopg.errors.UndefinedFunction: operator does not exist: vector <-> double precision[]
LINE 1: ...ROM pubdictionaries_embeddings ORDER BY embedding <-> $1 LIM...
                                                             ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

This error seems to indicate that the vector extension is not properly enabled, which does not seems right since:

  1. It has been enabled inside the pg db by the load script
  2. I am even rerunning the register_vector(conn) in the search script just in case (I would expect it is not needed, but since pg was complaining about not knowing anymore what a vector is I thought that could help)
  3. I am using the exact same function to generate embeddings for load and search, which return a list of list of floats (pythonic list, not ndarray nor np.array), so if it works for load it should also work for search:
        return [
            embedding.tolist() for embedding in self.embedding_model.embed(labels)
        ]

We also tried to just pass 1 embedding when performing the search (the list of floats, instead of a list of list of floats):

similar = conn.execute('SELECT * FROM pubdictionaries_embeddings ORDER BY embedding <-> %s LIMIT 5', (embeddings,)).fetchall()

But we are getting the exact same error.

We also tried to convert our pythonic list of list of floats with np.array()

        return [
            np.array(embedding.tolist()) for embedding in self.embedding_model.embed(labels)
        ]

But we are getting another error:

Traceback (most recent call last):
  File "/app/src/pubdict_load.py", line 157, in <module>
    cursor.execute(
  File "/usr/local/lib/python3.11/site-packages/psycopg/cursor.py", line 732, in execute
    raise ex.with_traceback(None)
psycopg.ProgrammingError: cannot adapt type 'ndarray' using placeholder '%s' (format: AUTO)

There is this error on stackoverflow that is a bit similar: https://stackoverflow.com/questions/75904637/how-to-fix-postgres-error-operator-does-not-exist-using-pgvector
But the conclusion is "enable the vector extension", which we have done already

Any idea how this could be fixed? We carefully followed the provided docs, and are not postgres extension experts, so we are a bit blocked now

Thanks a lot!

Unable to run the extension migration

I'm on Postgres13, getting this:

django.db.utils.OperationalError: could not open extension control file "/usr/local/share/postgresql/extension/vector.control": No such file or directory

Setting HNSW ef_search variable

No matter what I do, ef_search defaults to 40, so I'm always limited to 40 results. Can't seem to find a way to pass that variable in this wrapper, works fine in sql directly with
SET hnsw.ef_search = N;
Any way to do it now? (I'm using django)

Searching similar vectors in SQLModel

I want to build a function for fetching similar vectors with the distance metrics. I have used the exact command written in the README.md but looks like it's not working for me.

Here's the code:

async def get_similar_vectors(session: AsyncSession, query_vector, k=10, distance = "cosine_distance"):
    if distance == "cosine_distance":
        query = select(db_models.Vector).order_by(
            db_models.Vector.embedding.cosine_distance(query_vector)).limit(k)
    elif distance == "l2_distance":
        query = select(db_models.Vector).order_by(
            db_models.Vector.embedding.l2_distance(query_vector)).limit(k)
    else:
        query = select(db_models.Vector).order_by(
            db_models.Vector.embedding.max_inner_product(query_vector)).limit(k)
    
    execute_query = await session.execute(query)
    similar_vectors = execute_query.all()
    results = [
            {"id": vector.id, "embedding": vector.embedding}
            for vector in similar_vectors
        ]
    return results

Here's the issue:

select(db_models.Vector).order_by(
            db_models.Vector.embedding.cosine_distance(query_vector)).limit(k)

Cannot access member "cosine_distance" for type "List[float]"
Member "cosine_distance" is unknown

This goes same for all the other distance metrics.

To give more clarity on the Vector model:

class Vector(SQLModel, table=True):
    id: Optional[UUID] = Field(default=None, primary_key=True)
    embedding: List[float] = Field(sa_column=Column(Vector(embedding_dim)))

query_vector type is List[float]

How to fix this issue? Or more specifically how to search for similar embeddings using SQLModel?

How to use SQLAlchemy's most resent PEP 484 Annotated Declarative Table style?

Hi there I'm wondering how one can utilize the PEP 484 Annotated Declarative Table Style of SQLAlchemy with pgvector

See: https://docs.sqlalchemy.org/en/20/orm/declarative_tables.html#using-annotated-declarative-table-type-annotated-forms-for-mapped-column).

This is a Table definition from a small flask app I'm working on, but since I'm fairly new to SQLAlchemy, I'm not sure if this is the correct way of doing it.

from typing import Iterable

from pgvector.sqlalchemy import Vector
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import Mapped, mapped_column

from flask_app import db


class Product(db.Model):
    __tablename__ = "product"

    id: Mapped[str] = mapped_column(primary_key=True)
    data: Mapped[dict|list] = mapped_column(type_=JSONB)
    tokensize: Mapped[int]= mapped_column()
    checksum: Mapped[str] = mapped_column(db.String(40), index=True)
    embedding: Mapped[Iterable[float]] = db.mapped_column(Vector(1536))

How to rebuild indexes with Django?

Newbie questions here,

I am planning to use pgvector with Django. I am wondering how should I rebuild a IVFFLAT or HNSW index as I add more embeddings to the database?

Set thresholds

hi there !
It would be awesome if you could support threshold's with a param for all distance strategy's, along the line of the PGVECTOR extension.
pgvector/pgvector#16

Saw this in the readme.
Get items within a certain distance
session.scalars(select(Item).filter(Item.embedding.l2_distance([3, 1, 2]) < 5))

So please advise.

[Django] Using null=True raises error

If null=True is passed to a VectorField in Django, the field will be created successfully but both reads and writes to the field will fail.

pgvector/django/__init__.py", line 47, in validate
    super().validate(value.tolist(), model_instance)
    ^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'tolist'

Will look into writing a PR if I have time.

examples seem to not work

psycopg.errors.InsufficientPrivilege: permission denied to create extension "vector"
ok, commented out, installed it manually, that didn't just work either
note, if someone does this with sudo, could screw stuff up

raise psycopg.ProgrammingError('vector type not found in the database')

psycopg.ProgrammingError: vector type not found in the database
uhhh, yes it does

can you please test your examples.

why does this happen when i run pytest in the base of your folder

~/hax/externals/pgvector/pgvector-python (master) $
> pytest
=============================================== test session starts ===============================================
platform linux -- Python 3.10.9, pytest-7.4.4, pluggy-1.0.0
rootdir: /home/bion/hax/externals/pgvector/pgvector-python
configfile: pytest.ini
plugins: order-1.2.0, dotenv-0.5.2, ordering-0.6, envfiles-0.1.0, asyncio-0.23.3, xdist-3.2.0, anyio-3.6.2, hypothesis-6.70.2
asyncio: mode=auto
collected 0 items
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/python.py", line 617, in _importtestmodule
INTERNALERROR>     mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/pathlib.py", line 567, in import_path
INTERNALERROR>     importlib.import_module(module_name)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/importlib/__init__.py", line 126, in import_module
INTERNALERROR>     return _bootstrap._gcd_import(name[level:], package, level)
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
INTERNALERROR>   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
INTERNALERROR>   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
INTERNALERROR>   File "/home/bion/hax/externals/pgvector/pgvector-python/pgvector/asyncpg/__init__.py", line 1, in <module>
INTERNALERROR>     from ..utils import from_db, from_db_binary, to_db, to_db_binary
INTERNALERROR> ImportError: attempted relative import beyond top-level package
INTERNALERROR>
INTERNALERROR> The above exception was the direct cause of the following exception:
INTERNALERROR>
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/main.py", line 271, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/main.py", line 324, in _main
INTERNALERROR>     config.hook.pytest_collection(session=session)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_callers.py", line 60, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/main.py", line 335, in pytest_collection
INTERNALERROR>     session.perform_collect()
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/main.py", line 675, in perform_collect
INTERNALERROR>     self.items.extend(self.genitems(node))
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/main.py", line 842, in genitems
INTERNALERROR>     rep = collect_one_node(node)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/runner.py", line 546, in collect_one_node
INTERNALERROR>     ihook.pytest_collectstart(collector=collector)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_callers.py", line 60, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/pytest_asyncio/plugin.py", line 626, in pytest_collectstart
INTERNALERROR>     pyobject = collector.obj
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/python.py", line 310, in obj
INTERNALERROR>     self._obj = obj = self._getobj()
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/python.py", line 528, in _getobj
INTERNALERROR>     return self._importtestmodule()
INTERNALERROR>   File "/home/bion/miniconda3/envs/py310/lib/python3.10/site-packages/_pytest/python.py", line 642, in _importtestmodule
INTERNALERROR>     raise self.CollectError(
INTERNALERROR> _pytest.nodes.Collector.CollectError: ImportError while importing test module '/home/bion/hax/externals/pgvector/pgvector-python/pgvector/asyncpg/__init__.py'.
INTERNALERROR> Hint: make sure your test modules/packages have valid Python names.
INTERNALERROR> Traceback:
INTERNALERROR> /home/bion/miniconda3/envs/py310/lib/python3.10/importlib/__init__.py:126: in import_module
INTERNALERROR>     return _bootstrap._gcd_import(name[level:], package, level)
INTERNALERROR> pgvector/asyncpg/__init__.py:1: in <module>
INTERNALERROR>     from ..utils import from_db, from_db_binary, to_db, to_db_binary
INTERNALERROR> E   ImportError: attempted relative import beyond top-level package

============================================== no tests ran in 0.02s ==============================================

if a special command is needed, where's make test?

[Question] Ideas on clustering vectors (without defining no. clusters)

Hey there,

I'm using pgvector with Django and it's doing a great job.

I have a large table of text paired with embeddings, and I want to automatically organize them into categories, and later label the categories with GPT by sampling the text. That means that I don't know the number of clusters in advance. Preferably a solution would allow me to define a maximum distance between each vector inside the cluster, so I can adjust how general they will be. Any ideas on the best way of doing this?

Since HNSW or Ivfflat indices are already a sort of cluster, maybe we could query them somehow? Just throwing ideas around here. Perhaps it is simply a bad idea to do the clustering inside the database, I honestly don't know.

What do you think?

Thanks in advance.

Ideas

Ideas

  • Add support for Django

Corrupted NP Array after many inserts into DB

Hi there

Firstly, thank you for PGVector. This library is really awesome.

I'm noticing some interesting behavior when benchmarking lots of vector inserts, I will occasionally get an error like:

sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type vector: "0&37601622029971404"
LINE 1: ...e speaking a line', 'Line 23: This is meta', 220, '[0.693288...

or

sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type vector: "0.15788865441>03905"
LINE 1: ...e speaking a line', 'Line 13: This is meta', 263, '[0.966757...

We are inserting via SQLAlchemy like so:

inserted_object = ObjectClass(randNum1=random.randrange(0, 100), randNum2=random.randrange(0, 100),  embedding =  np.random.rand(1280).tolist()  )

In a fairly tight loop so we can test PGVector for our schema.

We run Postgres / PGVector from a Docker container, and have tested on a few platforms (M2 Mac, X86_64 Ubuntu Linux)

Our base docker image:

FROM python:3.8-slim

RUN apt-get update && apt-get install -y gcc git libpq-dev

ARG GID_
ARG UID_

RUN addgroup --gid $GID_ appuser
RUN adduser --disabled-password --gecos '' --uid $UID_ --gid $GID_ appuser

# Ensure installation directory used by pip is available on $PATH
ENV PATH="/home/appuser/.local/bin:${PATH}"

WORKDIR /app

ARG GITHUB_USER
ARG GITHUB_ACCESS_TOKEN

USER appuser

COPY --chown=appuser:appuser ./requirements/requirements.txt .

RUN pip3 install -r requirements.txt

Our Requirements.txt which is auto-run on docker building:

alembic
pgvector
#psycopg[binary,pool]
psycopg2-binary==2.9.6
SQLAlchemy==2.*
weaviate-client==3.15.5

We've noticed that on our Ubuntu Linux test system we get intermittent failures like the one above.

Is this a known issue, or perhaps something more to do with how np.Array is being ingested into SQLAlchemy that can sometimes cause errors?

We haven't seen this on Apple Silicon, and only see the error intermittently when running on Ubuntu after a few hundred thousand inserts.

Thank you in advance.

Bulk insert of vectors

I am using psycopg2 and wanted to understand if this library supports bulk insertion of vectors in a table using COPY command.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.