It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models r...

moffkalast • today at 1:19 PM • 1 reply • view on HN

It's a benchmark and eval issue. Guessing gets them the right result sometimes and the models rank better in error rate than they'd otherwise. We need the kind of benchmarks that penalize being wrong WAY more than saying "I don't know".

Of course there's a secondary problem that the model may then overuse the unintelligible option, but that's something that's a matter of training them properly against that eval.

You could also try thresholding the output based on perplexity to remove the parts that the model is less sure about, but that's not going to be super accurate I think.

Replies

user_7832 • today at 1:51 PM

Yeah I broadly agree with you. I've tried by explicitly adding a prompt to "ask questions and clarify", and even fairly decent models like Gemini pro (2.5 or 3) tend to make question for the sake of it.

Which reminds me that that's another big issue with LLMs - they'll blindly do whatever you ask them to, without pushback. (Again, I miss 3.5/3.6 era Sonnet which actually had half a spine. Fuck anthropic for blindly chasing coding benchmarks at the cost of everything else.)

I've engaged in several "CMVs" (or "tell me why X is bad") with LLMs, and very often it's clear it's just saying stuff to say it, giving very terrible points on unjustifiable positions that collapse the moment I counter argue even slightly rationally.

alt Hacker News

Replies