A generic search strategy is so different from something you want to target. The task should probably determine the tool.
So I don't know the answer, but I was recently handed about 3 million surveys with 10 free-form writing fields each, and tasked with surfacing the ones that might require action on the part of the company. I chose to use a couple of different small classifier models, manually strip out some common words based on obvious noise in the first 10k results, and then weight the model responses. It turned out to be almost flawless. I would NOT call this sort of thing "programming", it's more just tweaking the black-box output of various different tools until you have a set of results that looks good for your test cases. (And your client ;)
All stitching together small Hugging Face models running on a tiny server in nodejs, btw.
Nice, also find small classifiers work best for things like this. Out of interest, how many, if any, of the 3million were labelled?
Did you end up labelling any/more, or distilling from a generative model?