Comments (5)
What you're describing is independent of where you run PostgreSQL.
The switch from EXTENDED
to PLAIN
storage does affect how PostgreSQL costs the serial scan and plans parallel workers. Specifically, PostgreSQL only accounts for the value of relpages
in the table, not the TOAST table (where vectors whose storage is above the threshold [typically the 2KB default] are stored out-of-line in the TOAST table). With PLAIN
storage, given the vectors are stored in the table pages, PostgreSQL sees a larger value of relpages
and will plan for more workers. While this post describes this in relation to creating workers for parallel index builds, this also applies to sequential scans.
For the rest of your question, it'd help to understand more about your experiment, e.g. how frequently you ran your queries, the output of EXPLAIN ANALYZE
to see if that changed the execution plan at all.
from pgvector.
Thanks @jkatz for your help !
It is a production system in the early stages with a request rate of about 1 query per second.
Every day we load about 50_000 vectors more and examine it's performance. The graph above shows a time span of about one week, so between the drop of the time to the right we loaded about 250_000 rows more
The explain analyze is :
Limit (cost=222555.47..222567.14 rows=100 width=24) (actual time=530.305..530.491 rows=100 loops=1)
-> Gather Merge (cost=222555.47..283149.37 rows=519340 width=24) (actual time=530.303..530.473 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=221555.45..222204.63 rows=259670 width=24) (actual time=519.330..519.349 rows=78 loops=3)
Sort Key: (abs((embedding <#> '[...]'::vector))) DESC
Sort Method: top-N heapsort Memory: 37kB
Worker 0: Sort Method: top-N heapsort Memory: 38kB
Worker 1: Sort Method: top-N heapsort Memory: 38kB
-> Parallel Seq Scan on vectors v (cost=0.00..211631.05 rows=259670 width=24) (actual time=0.008..449.042 rows=207736 loops=3)
Planning Time: 0.345 ms
Execution Time: 530.576 ms
and here is the explain analyze after i run the command:
SET max_parallel_workers_per_gather = 4;
Limit (cost=217152.52..217164.49 rows=100 width=24) (actual time=545.110..547.536 rows=100 loops=1)
-> Gather Merge (cost=217152.52..291815.21 rows=623568 width=24) (actual time=545.108..547.524 rows=100 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=216152.46..216542.19 rows=155892 width=24) (actual time=519.393..519.407 rows=85 loops=5)
Sort Key: (abs((embedding <#> '[...]'::vector))) DESC
Sort Method: top-N heapsort Memory: 38kB
Worker 0: Sort Method: top-N heapsort Memory: 38kB
Worker 1: Sort Method: top-N heapsort Memory: 37kB
Worker 2: Sort Method: top-N heapsort Memory: 38kB
Worker 3: Sort Method: top-N heapsort Memory: 37kB
-> Parallel Seq Scan on vectors v (cost=0.00..210194.38 rows=155892 width=24) (actual time=0.009..445.478 rows=124713 loops=5)
Planning Time: 0.095 ms
Execution Time: 547.578 ms
As you can see, setting the max_parallel_workers_per_gather to a higher value did not help. I also tried the same on my mac M1 which has lots of cpus.
from pgvector.
I should have added EXPLAIN (ANALYZE,BUFFERS)
to see how much you're reading from memory vs. going to disk, which could impact this, as well as your total data set size. From my local tests using a data set of 1_000_000 768-dim vectors (using EXTENDED
storage but I set min_parallel_table_scan_size
to 1
to maximize parallel workers), my execution time did continue to go down until I reached 6 parallel workers, but it wasn't as dramatic as when I had to retrieve pages from disk.
from pgvector.
The explain with buffers gives the following:
Limit (cost=230157.22..230168.89 rows=100 width=24) (actual time=648.258..648.449 rows=100 loops=1)
Buffers: shared hit=215569
-> Gather Merge (cost=230157.22..293014.38 rows=538738 width=24) (actual time=648.256..648.427 rows=100 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=215569
-> Sort (cost=229157.20..229830.62 rows=269369 width=24) (actual time=633.809..633.823 rows=81 loops=3)
Sort Key: ((embedding <#> '[...]'::vector))
Sort Method: top-N heapsort Memory: 38kB
Buffers: shared hit=215569
Worker 0: Sort Method: top-N heapsort Memory: 38kB
Worker 1: Sort Method: top-N heapsort Memory: 37kB
-> Parallel Seq Scan on vectors v (cost=0.00..218862.11 rows=269369 width=24) (actual time=0.017..542.565 rows=215495 loops=3)
Buffers: shared hit=215495
Planning:
Buffers: shared hit=100
Planning Time: 0.501 ms
Execution Time: 648.568 ms
(18 rows)
after setting
SET max_parallel_workers_per_gather = 4;
Limit (cost=224692.36..224704.33 rows=100 width=24) (actual time=612.711..613.165 rows=100 loops=1)
Buffers: shared hit=215643
-> Gather Merge (cost=224692.36..302098.89 rows=646484 width=24) (actual time=612.709..613.147 rows=100 loops=1)
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=215643
-> Sort (cost=223692.30..224096.36 rows=161621 width=24) (actual time=578.866..578.881 rows=86 loops=5)
Sort Key: ((embedding <#> '[,,,]'::vector))
Sort Method: top-N heapsort Memory: 38kB
Buffers: shared hit=215643
Worker 0: Sort Method: top-N heapsort Memory: 38kB
Worker 1: Sort Method: top-N heapsort Memory: 38kB
Worker 2: Sort Method: top-N heapsort Memory: 38kB
Worker 3: Sort Method: top-N heapsort Memory: 38kB
-> Parallel Seq Scan on vectors v (cost=0.00..217515.26 rows=161621 width=24) (actual time=0.008..493.254 rows=129297 loops=5)
Buffers: shared hit=215495
Planning Time: 0.124 ms
Execution Time: 613.212 ms
(18 rows)
as we can see the problem is not disk reads as u have a shared buffer hit.
Does this explain buffers make sense? It looks like my query is RAM, so using 4 workers should make i much faster. Just want to ensure that there is no sort of lock which prevents the workers to run in parallel or some similar issue.
from pgvector.
Hi @david-gang, the explain output looks consistent with the data being in memory. There are coordination costs with parallel workers, so at a certain point, it won't help performance (which isn't specific to pgvector).
from pgvector.
Related Issues (20)
- Does ivvflat index has max query result limit? HOT 2
- Query not using IVFFLAT index HOT 4
- Indexing parts of vector separately and together HOT 5
- [Question]Need help with the pgvector Installation with pgAdmin & version mismatch error HOT 9
- Parallel index builds for HNSW HOT 21
- Postgresql and pgvector: the extention vector is not available HOT 1
- [BUG] extension "pgvector" is not available HOT 2
- Drop support for PostgreSQL v11 HOT 3
- Use Array as vector type? HOT 1
- pgvector does not work for multi-schema postgresql DBs HOT 2
- Can HNSW/ IVVFlat INDEXES co exist? HOT 1
- Multi-User or Multi-Tenancy HOT 1
- Embedding precision in Vector data type HOT 2
- Pgvector v0.5.1 compatibility with PostgreSQL 16.1 - MacOS HOT 1
- Querying variable dimension vector with pgvector 0.5.1 HOT 2
- `ERROR: different vector dimensions` When there are vectors of different dimensions HOT 1
- Support For Hamming Distance HOT 8
- HNSW Indexing Times Issue HOT 1
- Error creating index on materialized view HOT 2
- Follow up #425 - pgvector issues with postgresql 16.1 on macOS Sonoma HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pgvector.