This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
This went straight to prod, even earlier than I'd opted for. What do you mean?
What does that look like in your opinion, what do you use?