For the most part, I don’t do chatbots except for a couple of RAG based chatbots. It’s more behind the scenes stuff like image understanding, categorization, nuanced sentiment analsys, semantic alignment, etc.
I’ve created a framework that lets me test the quality in automated way between prompt changes and models and I compare costs/speed/quality.
The only thing that requires humans to judge the qualify out of all those are RAG results.
So who is the winner using the framework you created?