logoalt Hacker News

hamiltontyesterday at 9:14 PM6 repliesview on HN

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition


Replies

pocketarcyesterday at 9:21 PM

I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

tomjakubowskitoday at 6:22 PM

Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.

show 1 reply
piskovyesterday at 10:47 PM

How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

loreyyesterday at 9:27 PM

Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).

Imustaskforhelpyesterday at 9:22 PM

This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

show 1 reply
46493168yesterday at 9:33 PM

Isn’t this just rubrics?

show 1 reply