If we talk about a median well-educated human, o1 likely passes the bar. Quite a few tests of reasoning suggests that’s the case. An example:
“Preprint out today that tests o1-preview's medical reasoning experiments against a baseline of 100s of clinicians.
In this case the title says it all:
Superhuman performance of a large language model on the reasoning tasks of a physician
Link: https://arxiv.org/abs/2412.10849”. — Adam Rodman, a co-author of the paper https://x.com/AdamRodmanMD/status/186902305691786464
—-
Have you tried using o1 with a variety of problems?
The paper you linked claims on page 10 that machines have been performing comparably on the task since 2012, so I'm not sure exactly what the paper is supposed to show in this context.
Am I to conclude that we've had a comparably intelligent machine since 2012?
Given the similar performance between GPT4 and O1 on this task, I wonder if GPT3.5 is significantly better than a human, too.
Sorry if my thoughts are a bit scattered, but it feels like that benchmark shows how good statistical methods are in general, not that LLMs are better reasoners.
You've probably read and understood more than me, so I'm happy for you to clarify.