logoalt Hacker News

deepsquirrelnetyesterday at 9:18 PM2 repliesview on HN

This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.

Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.


Replies

andy99yesterday at 9:24 PM

What does that look like in your opinion, what do you use?

loreyyesterday at 9:34 PM

This went straight to prod, even earlier than I'd opted for. What do you mean?

show 1 reply