We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they&#...

aryamanagraw • yesterday at 10:14 PM • 3 replies • view on HN

We kept asking LLMs to rate things on 1-10 scales and getting inconsistent results. Turns out they're much better at arguing positions than assigning numbers— which makes sense given their training data. The courtroom structure (prosecution, defense, jury, judge) gave us adversarial checks we couldn't get from a single prompt. Curious if anyone has experimented with other domain-specific frameworks to scaffold LLM reasoning.

Replies

deevelton • yesterday at 11:29 PM

Experimented very briefly with a mediation (as opposed to a litigation) framework but it was pre-LLM and it was just a coding/learning experience: https://github.com/dvelton/hotseat-mediator

Cool write-up of your experiment, thanks for sharing. Would be interesting to see how results from one framework (mediation, whose goal is "resolution") differ from the other (litigation, whose goal is, basically, "truth/justice").

➕ show 1 reply

storystarling • yesterday at 11:00 PM

The reasoning gains make sense but I am wondering about the production economics. Running four distinct agent roles per update seems like a huge multiplier on latency and token spend. Does the claimed efficiency actually offset the aggregate cost of the adversarial steps? Hard to see how the margins work out if you are quadrupling inference for every document change.

➕ show 1 reply

thatjoeoverthr • yesterday at 10:37 PM

If you do want a numeric scale, ask for a binary (e.g. true / false) and read the log probs.

➕ show 2 replies

alt Hacker News

Replies