> The problem is that index builds are memory-intensive operations, and Postgres doesn’t have a g...

sgarland • yesterday at 2:47 PM • 3 replies • view on HN

> The problem is that index builds are memory-intensive operations, and Postgres doesn’t have a great way to throttle them.

maintenance_work_mem begs to differ.

> You rebuild the index periodically to fix this, but during the rebuild (which can take hours for large datasets), what do you do with new inserts? Queue them? Write to a separate unindexed table and merge later?

You use REINDEX CONCURRENTLY.

> But updating an HNSW graph isn’t free—you’re traversing the graph to find the right place to insert the new node and updating connections.

How do you think a B+tree gets updated?

This entire post reads like the author didn’t read Postgres’ docs, and is now upset at the poor DX/UX.

Replies

ayende • yesterday at 3:48 PM

> maintenance_work_mem

That kills the indexing process, you cannot let it run with limited amount of memory.

> How do you think a B+tree gets updated?

In a B+Tree, you need to touch log H of the pages. In HNSW graph - you need to touch literally thousands of vectors once your graph gets big enough.

➕ show 1 reply

tacoooooooo • yesterday at 3:47 PM

some fair points points on the specifics.

> maintenance_work_mem

sure, but the knob existing doesn't solve the operational challenge of safely allocating GBs of RAM on prod for hours-long index builds.

> REINDEX CONCURRENTLY

this is still not free not free—takes longer, needs 2-3x disk space, and still impacts performance.

> HNSW vs B+tree

it's not that graph updates are uniquely expensive. vector workloads have different characteristics than traditional OLTP, and pg wasn't originally designed for them

my broader point: these features exist, but using them correctly requires significant Postgres expertise. my thesis isn't "Postgres lacks features"—it's "most teams underestimate the operational complexity." dedicated vector DBs handle this automatically, and are often going to be much cheaper than the dev time put into maintaining pgvector (esp. for a small team)

➕ show 1 reply

whakim • yesterday at 10:42 PM

> maintenance_work_mem begs to differ.

HNSW indices are big. Let's suppose I have an HNSW index which fits in a few hundred gigabytes of memory, or perhaps a few terabytes. How do I reasonably rebuild this using maintenance_work_mem? Double the size of my database for a week? What about the knock-on impacts on the performance for the rest of my database-stuff - presumably I'm relying on this memory for shared_buffers and caching? This seems like the type of workload that is being discussed here, not a toy 20GB index or something.

> You use REINDEX CONCURRENTLY.

Even with a bunch of worker processes, how do I do this within a reasonable timeframe?

> How do you think a B+tree gets updated?

Sure, the computational complexity of insertion into an HNSW index is sublinear, the constant factors are significant and do actually add up. That being said, I do find this the weakest of the author's arguments.

alt Hacker News

Replies