logoalt Hacker News

thatjoeoverthryesterday at 10:37 PM2 repliesview on HN

If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.


Replies

kyebyesterday at 10:48 PM

(disclaimer: I work at Falconer)

you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.

in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"

aryamanagrawyesterday at 11:34 PM

That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.