Thanks, yesterday I was thinking of adding BM25 to a little side project, so a well timed plug!
Do you know of any pure Python wrapper projects for managing large numbers of text and PDF documents? I thought of using Solr or ElasticSearch but that seems too heavy weight for what I am doing. I am considering using SQLite with pysqlite3 and PyPDF2 since SQLite uses BM25. Sorry to be off topic, but I imagine many people are looking at tools for building hybrid BM25 / vector store / LLM applications.
If we're shameless plugging passion projects, SearchArray is a pandas extension for fulltext (BM25) search for dorking around with things in google colab
https://github.com/softwaredoug/searcharray
I'll also plug Xing Han Lu's BM25S which is very popular with similar goals:
https://github.com/xhluca/bm25s