logoalt Hacker News

ilaksh10/11/20241 replyview on HN

They did that for o1 and o1-preview. Which if you read the paper or do your own testing with that SOTA model you will see that the paper is nonsense. With the best models the problems they point out are mostly marginal like one or two percentage points when changing numbers etc.

They are taking poor performance of undersized models and claiming that proves some fundamental limitation of large models, even though their own tests show that isn't true.


Replies

foobarqux10/11/2024

You choose to ignore Figure 8 which shows a 18% drop when simply adding an irrelevant detail.

In the other test the perturbations aren’t particularly sophisticated and modify the problem according to a template. As the parent comment said this is pretty easy to generate test data for (and for the model to pattern match against) so maybe that is what they did.

A better test of “reasoning” would be to isolate the concept/algorithm and generate novel instances that are completely textually different from existing problems to see if the model really isn’t just pattern matching. But we already know the answer to this because it can’t do things like arbitrary length multiplication.

show 1 reply