If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.
That's the thing with documentation; there are hardly any situations where a simple true/false works. Product decisions have many caveats and evolving behaviors coming from different people. At that point, a numerical grading format isn't something we even want — we want reasoning, not ratings.
(disclaimer: I work at Falconer)
you would think so! but that's only optimal if the model already has all the information in recent context to make an optimally-informed decision.
in practice, this is a neat context engineering trick, where the different LLM calls in the "courtroom" have different context and can contribute independent bits of reasoning to the overall "case"