logoalt Hacker News

jumploops10/11/20241 replyview on HN

> Overall, while o1-preview and o1-mini exhibit significantly stronger results compared to current open models—potentially due to improved training data and post-training procedures—they still share similar limitations with the open models.

tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.

I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).

It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.

This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."

There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.


Replies

data_maan10/12/2024

Can you send a paper regarding that LLMs can build "reliable agents"?