BM25 is an ancient algo developed in the 1970s. It’s basically a crappy statistical model and statisticians can do far better today. Search is strictly dominated by learning (that yes, can use search as an input). Not many folks realize that yet, and / or are incentivized to keep the old tech going as long as possible, but market pressures will change that.
While BM25 did emerge from earlier work in the 1970s and 1980s (specifically building on the probabilistic ranking principle), I'm curious about your perspective on a few things:
What specific modern statistical approaches are you seeing as superior replacements for BM25 in practical applications? I'm particularly interested in how they handle edge cases like rare terms and document length normalization that BM25 was explicitly designed to address.
While I agree learning-based approaches have shown impressive results, could you elaborate on what you mean by search being "strictly dominated" by learning methods? Are you referring to specific benchmarks or real-world applications?
Those are some really spicy opinions. It would seem that many search experts might not agree.
David Tippet (formerly opensearch and now at Github)
A great podcast with David Tippet and Nicolay Gerold entitled:
"BM25 is the workhorse of search; vectors are its visionary cousin"
I think there are also incentives to "sell new things". That's always been the case in search which has had a bazillion trends and "AI related things" as long as I've worked in it. We have massively VC funded vector search companies with armies of tech evangelists pushing a specific point of view right now.
Meanwhile, the amount of manual curation, basic, boring hand-curated taxonomies that actually drive things like "semantic search" at places like Google are simply staggering. Just nobody talks about them much at conferences because they're not very sexy.
Are those the same market pressures that made Google discard or repurpose a lot of working old search tech for new shiny ML-based search tech? The same tech that makes you add "+reddit" in every search so you can evade the adversarial SEO war?
PS: Ancient != bad. I don't know what weird technologist take worries about the age of an invention/discovery of a technique rather than its usefulness.