logoalt Hacker News

jryan49yesterday at 9:48 PM2 repliesview on HN

Something that I always wonder with each blog post comparing different types of prompt engineering is did they run it once, or multiple times? LLMs are not consistent for the same task. I imagine they realize this of course, but I never get enough details of the testing methodology.


Replies

CuriouslyCyesterday at 11:27 PM

I always make a habit of doing a lot of duplicate runs when I benchmark for this reason. Joke's on me, in the time I spent doing 1 benchmark with real confidence intervals and getting no traction on my post, I could have done 10 shitty benchmarks or 1 shitty benchmark and 9x more blogspam. Perverse incentives rule us all.

only-one1701yesterday at 10:00 PM

This drives me absolutely crazy. Non-falsifiable and non-deterministic results. All of this stuff is (at best) anecdotes and vibes being presented as science and engineering.

show 1 reply