logoalt Hacker News

bob1029yesterday at 6:19 PM2 repliesview on HN

I'm still stuck on whether or not vector search (regardless of vendor) is actually the right way to solve the kinds of problems that everyone seems to believe it's great at.

BM25 with query rewriting & expansion can do a lot of heavy lifting if you invest any time at all in configuring things to match your problem space. The article touches on FTS engines and hybrid approaches, but I would start there. Figure out where lexical techniques actually break down and then reach for the "semantic" technology. I'd argue that an LLM in front of a traditional lexical search engine (i.e., tool use) would generally be more powerful than a sloppy semantic vector space or a fine tuning job. It would also be significantly easier to trace and shape retrieval behavior.

Lucene is often all you need. They've recently added vector search capabilities if you think you really need some kind of hybrid abomination.


Replies

kgeistyesterday at 8:46 PM

I'm currently building RAG for our product (using Lucene). What I've found is that embeddings alone don't help much. With hybrid search (BM25+HNSW) they gave me only like +10% boost compared to BM25 alone (on average). In my evaluation datasets, the only case where they helped tremendously was for cases like "a user asks a question in French but the documents are all in English", it went from 6% retrieval to 65% on some datasets.

I got a significant boost (from 65% on average to over 80%) by adding a proper reranker and query rewriting (3 additional phrases to search for).

I think embeddings are overrated in that blog posts often make you believe they are the end of the story. What I've found is that they should be rather treated as a lightweight filtering/screening tool to quickly find a pool of candidates as a first stage, before you do the actual stuff (apply a reranker). If BM25 already works as well as a pre-filtering tool, you don't even need embeddings (with all the indexing headaches).

mhuffmanyesterday at 7:20 PM

I like lucene and have used it for many years, but sometimes a conceptually close match is what you want. Lucene and friends are fantastic about word matching, fuzzy searches, stem searches, phonetic searches, faceting and more but have nothing for conceptually or semantically close searches (I understand that they recently added new document vector searches). Also vector searches usually always return something which is not ideal in a lot of cases. I like Reciprocal Rank Fusion myself as it gives the best of both worlds. As a fun trick I use duckdb to do RRF with 5million+ documents and get low double-digit ms response time even under load