logoalt Hacker News

dr_dshiv10/11/20246 repliesview on HN

It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?


Replies

ilaksh10/11/2024

They did that for o1 and o1-preview. Which if you read the paper or do your own testing with that SOTA model you will see that the paper is nonsense. With the best models the problems they point out are mostly marginal like one or two percentage points when changing numbers etc.

They are taking poor performance of undersized models and claiming that proves some fundamental limitation of large models, even though their own tests show that isn't true.

show 1 reply
MacsHeadroom10/11/2024

Yes, this is how o1 was trained. Math and programming, because they are verifiable.

This is also why o1 is not better at English. Math skills transfer to general reasoning but not so much to creative writing.

Davidzheng10/11/2024

In which distribution? Like school math or competition or unsolved problems? FWIW I think one and three and probably easier to generated as synethetically. It's harder to bound the difficulty but I think the recent David silver talk implies it doesn't matter much. Anyway there's some work on this you can find online--they claim to improve gsm8k and MATH a bit but not saturate it. Idk in practice how useful it is

bentice10/11/2024

Data is the wrong approach to develop reasoning. You we don't want LLM's to simply memorize 3x3 = 9 we want them to understand that 3 + 3 + 3 = 9 therefore 3x3 = 9 (obviously a trivial example). If they have developed reasoning very few examples should be needed.

The way I see it reasoning is actually the ability of the model to design and train smaller models that can learn with very few examples.

show 2 replies
aithrowawaycomm10/11/2024

It's easy enough to generate an enormous amount of formal math problems, but utterly quixotic to generate an enormous amount of quantitative reasoning problems, which is the thing LLMs are lacking.

ninetyninenine10/11/2024

I don’t think so. The data is biased towards being very general.