Show HN: Postgres extension for BM25 relevance-ranked full-text search

132 points • by tjgreen • yesterday at 4:29 PM • 39 comments • view on HN

Last summer we faced a conundrum at my company, Tiger Data, a Postgres cloud vendor whose main business is in timeseries data. We were trying to grow our business towards emerging AI-centric workloads and wanted to provide a state-of-the-art hybrid search stack in Postgres. We'd already built pgvectorscale in house with the goal of scaling semantic search beyond pgvector's main memory limitations. We just needed a scalable ranked keyword search solution too.

The problem: core Postgres doesn't provide this; the leading Postgres BM25 extension, ParadeDB, is guarded behind AGPL; developing our own extension appeared daunting. We'd need a small team of sharp engineers and 6-12 months, I figured. And we'd probably still fall short of the performance of a mature system like Parade/Tantivy.

Or would we? I'd be experimenting long enough with AI-boosted development at that point to realize that with the latest tools (Claude Code + Opus) and an experienced hand (I've been working in database systems internals for 25 years now), the old time estimates pretty much go out the window.

I told our CTO I thought I could solo the project in one quarter. This raised some eyebrows.

It did take a little more time than that (two quarters), and we got some real help from the community (amazing!) after open-sourcing the pre-release. But I'm thrilled/exhausted today to share that pg_textsearch v1.0 is freely available via open source (Postgres license), on Tiger Data cloud, and hopefully soon, a hyperscalar near you:

https://github.com/timescale/pg_textsearch

In the blog post accompanying the release, I overview the architecture and present benchmark results using MS-MARCO. To my surprise, we were not only able to meet Parade/Tantivy's query performance, but exceed it substantially, measuring a 4.7x advantage on query throughput at scale:

https://www.tigerdata.com/blog/pg-textsearch-bm25-full-text-...

It's exciting (and, to be honest, a little unnerving) to see a field I've spent so much time toiling in change so quickly in ways that enable us to be more ambitious in our technical objectives. Technical moats are moats no longer.

The benchmark scripts and methodology are available in the github repo. Happy to answer any questions in the thread.

Thanks,

TJ ([email protected])

Comments

i_have_to_speak • today at 5:05 AM

This looks like a great addition to the Postgres ecosystem. When adding specialized extensions like this, it's always worth keeping an eye on how they impact overall system performance, especially memory usage and lock contention as the dataset grows. For anyone testing this out, I'd recommend using an open-source tool like *pgmetrics* (https://pgmetrics.io) to get a baseline and then monitor how the new indexes and search workloads affect your underlying metrics. It’s zero-dependency and gives you a very deep look into the internals without much overhead.

➕ show 1 reply

kev009 • today at 6:13 AM

The "term positions" caveat seems like a major limitation for human oriented searches of logs or products or whatever. I don't see it mentioned in what's next, will you address it in some future release or is it out of scope for your intended use cases?

P.S. been shipping it for a while https://www.freshports.org/databases/pg_textsearch/ :)

shreyssh • yesterday at 8:35 PM

Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?

➕ show 1 reply

jascha_eng • yesterday at 7:36 PM

FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

simonw • yesterday at 8:02 PM

This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;

That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.

➕ show 1 reply

landsman • today at 6:16 AM

Postgres is really amazing backbone for products. Love it.

andai • yesterday at 9:00 PM

Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.

➕ show 1 reply

zephyrwhimsy • yesterday at 9:10 PM

Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.

mattbessey • yesterday at 9:21 PM

Please oh please let GCP add this to the supported managed Postgres extensions...

gmassman • yesterday at 8:47 PM

Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.

➕ show 1 reply

hmokiguess • yesterday at 11:38 PM

I hope someone from Neon is watching this thread, definitely want to play with this asap

robotswantdata • yesterday at 10:08 PM

“Just use Postgres” greybeards right again. Looking forward to giving this a go soon

Unical-A • yesterday at 8:57 PM

Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

➕ show 1 reply

maweaver • yesterday at 9:41 PM

I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!

➕ show 1 reply

diwank • today at 2:20 AM

we have been using pg_textsearch in production for a few weeks now, and it's been fairly stable and super speedy. we used to use paradedb (aka pg_search -- it's quite annoying that the two or so similarly named), but paradedb was extremely unstable, led to serious data corruption a bunch of times. in fact, before switching to pg_textsearch, we just switched over to plain trigram search coz paradedb was tanking our db so often...

also shoutout to tj for being super responsive on github issues!

jackyliang • yesterday at 8:50 PM

VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?

➕ show 1 reply

piskov • yesterday at 9:44 PM

On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.

In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.

In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.

And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)

timedude • yesterday at 9:12 PM

When is this available on AWS in Aurora? Anyone from AWS here, add it pronto

bradfox2 • yesterday at 9:41 PM

Thank you!! Goodbye manticore if this works.

esafak • today at 4:52 AM

interesting documents:

https://timescale.github.io/pg_textsearch/benchmarks/compari...

https://github.com/timescale/pg_textsearch/blob/main/ROADMAP...

devmor • yesterday at 9:35 PM

This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.

gplprotects • yesterday at 8:04 PM

> ParadeDB, is guarded behind AGPL