logoalt Hacker News

andaitoday at 1:35 PM4 repliesview on HN

This implies that other benchmarks (for which every AI provider is optimizing?) are actively encouraging bullshitting?


Replies

WarmWashtoday at 2:46 PM

There is a tradeoff where as factual accuracy increases, creativity decreases, and the model becomes more "rigid" and less general. Unfortunately it seems that creativity is a good quality for reasoning and ultimately problem solving.

So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.

wongarsutoday at 2:15 PM

Yes. Most benchmarks just measure how many answers are correct. The best way to optimize that is to confidently state something, in hopes it's correct. Which is exactly how most LLMs behave, despite plenty of evidence that they do know whether they "know" something

show 1 reply
whimblepoptoday at 1:59 PM

Bullshitting is how LLMs work. It doesn't require active encouragement. All it takes is a machine without consciousness or physical access to the world and an actually-lived life. A training set that contains lots of confident answers and few to no refusals doesn't help either.

show 1 reply
Zababatoday at 2:23 PM

They are, especially multiple choice questions. The same happens with humans exams:

Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.

If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.