Comments (8)
When I run create index on documents using hnsw (embedding vector_ip_ops);
on Supabase based on this doc https://supabase.com/blog/increase-performance-pgvector-hnsw
I also met this issue: hnsw graph no longer fts into maintenance_work_mem after 9866 tuples
from pgvector.
I consider massive updates/deletes for pgvector a rare use case.
It could be recommended to set ef_search with overhead to the number of tuples in the limit clause rather than the exact value for these workload types (Or for all workloads just to be safe). It's easy to evaluate how many dead tuples are expected for a known update/delete rate between vacuums.
So setting ef_search to a fixed (big) value is not necessary.
from pgvector.
Hi @alanwli, I'm not sure there's a way to address this that doesn't affect search speed, but it may be worth the tradeoff. I've pushed a solution to kill-prior-tuple branch, but need to think about it more. The two things that will affect search speed are:
- Marking tuples as deleted (since
scan->kill_prior_tuple
is the only way to get this information). We could limit this to a certain number per scan to minimize the impact. - Needing to search more tuples. This wouldn't be fixed until the graph is repaired with vacuum.
from pgvector.
Hi @ankane
but it may be worth the tradeoff.
As an end user, I think people would prefer accuracy, and relevancy over speed as long as it doesn't get too slow.
Suppose Google search takes 0.3 seconds to display the search result, then 0.6 seconds (x2 search time) is still ok.
But if it takes like 2 seconds then it becomes noticeable and annoying.
from pgvector.
Here's a rough summary of the situation:
As elements get updated/deleted, either speed or recall has to give until the graph can be repaired (which currently happens on vacuum, as it's expensive). To guarantee results in every case, it's possible the entire graph would need to be visited (for instance, after deleting 99% of rows), which could be extremely slow (and likely cause SELECTs to tie up the database CPU).
With the current approach, speed will be consistent across queries, but recall will give.
The branch now gives up speed for recall by increasing ef_search
up to 2x. This means that queries can slow down to a certain point, and then recall will give.
The impact either approach has on recall and speed will be different for each dataset.
Also, by default, tables are autovacuumed when 20% of rows (+ 50) change, which should keep the graph in a healthy state in many cases with either approach.
from pgvector.
So,
On master branch:
ef_search = 10
-> 8 rows return (2 are deleted or updated and get filtered out)
on kill-prior-tuple branch:
ef_search = 10
-> 10 rows return (8 in 10 search
+ 2 in x search
up to 10 search?)
Do I understand it correctly?
from pgvector.
Hi @ankane - thanks for looking into a solution for the problem. I agree with the discussion thus far that this is yet another speed vs. recall tradeoff, and it's likely many other ANN algorithms don't have to worry about since they don't care about transactional semantics. Do you have a sense how much overhead your kill_prior_tuple solution would incur?
Hi @alanwli, I'm not sure there's a way to address this that doesn't affect search speed, but it may be worth the tradeoff. I've pushed a solution to kill-prior-tuple branch, but need to think about it more. The two things that will affect search speed are:
- Marking tuples as deleted (since
scan->kill_prior_tuple
is the only way to get this information). We could limit this to a certain number per scan to minimize the impact.- Needing to search more tuples. This wouldn't be fixed until the graph is repaired with vacuum.
from pgvector.
1. Marking tuples as deleted (since `scan->kill_prior_tuple` is the only way to get this information).
The built-in btree
AM does some optimistic cleanup for non-hot updates where the indexed value hasn't changed, and does this during index tuple insertion (since PostgreSQL 14).
Might you be able to do something similar? The underlying assumption of the optimization is "we've just had a non-hot update, there might be more non-hot updated tuples, and those other tuples may already be dead, so let's try to find some dead non-HOT-updated tuples to clean up".
Ref: _bt_bottomupdel_pass
from pgvector.
Related Issues (20)
- How to upgrade the vector extension version from 0.44 to 0.5.0. HOT 2
- Is subtraction operator self-commutative ? HOT 1
- why does the parameter ivfflat.probes losts when reconnecting the session? HOT 2
- 0.6.1 Plan HOT 1
- Building hnsw index fail with memory is temporarily unavailable HOT 1
- Build a hierarchical structure for each tenant HOT 3
- Query with HNSW index returns only small portion of results HOT 4
- Is here any possibility to crush? HOT 4
- Performance profile of `UPDATE` vs `INSERT`? HOT 1
- Does postgres extension support primary-secondary synchronization? HOT 1
- How to add to existing docker postgresql ? HOT 2
- HNSW Index scan is not called with vector_cosine_ops HOT 3
- little advice:add a tip in windows install (nmake may need root) HOT 4
- Unrelevant results by DB similarity search HOT 1
- Can I use index without ORDER BY or with ORDER BY different value? HOT 6
- Can one use the HNSW index without the `LIMIT` clause? HOT 2
- [enhancement] Supporting Top-p (Nucleus Sampling) for cut off HOT 3
- trouble while instaling postgres pgvector in windows HOT 5
- 0.6.2 Plan HOT 1
- Install on M1 Mac issues HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgvector.