If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.

thatjoeoverthr • yesterday at 10:37 PM • 2 replies • view on HN

Replies

(disclaimer: I work at Falconer)

you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.

in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"

aryamanagraw • yesterday at 11:34 PM

That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.

alt Hacker News

Replies