logoalt Hacker News

Lercyesterday at 6:33 PM1 replyview on HN

The r's in strawberry presents a different level of task to what people imagine. It seems trivial to a naive observer because the answer is easily derivable from the question without extra knowledge.

A more accurate analogy for humans would be to imagine if every word had a colour. You are told that there are also a sequence of different colours that correspond to the same colour as that word. You are even given a book showing every combination to memorise.

You learn the colours well enough that you can read and write coherently using them.

Then comes the question of how many chocolate-browns are in teal-with-a-hint-of-red. You know that teal-with-a-hint-of-red is a fruit and you know that the colour can also be constructed by crimson followed by Disney-blond. Now, do both of those contain chocolate-brown or just one of them, how many?

It requires excersizing memory to do a task that is underrepresented in the training data because humans simply do not have to do the task at all when the answer can be derived from the question representation. Humans also don't have the ability that the LLMs need but the letter representation doesn't need that ability.


Replies

wahnfriedenyesterday at 7:19 PM

That’s what makes it a fair evaluation and something that requires improvement. We shouldn’t only evaluate agent skills by what is most commonly represented in training data. We expect performance from them on areas that existing training data may be deficient at providing. You don’t need to invent an absurdity to find these cases.

show 1 reply