“Smart enough” really depends on how many other people have encountered a problem close enough to yours and solved it somewhere on the open internet, IMO.
Most of the frontier models can, when prompted and tooled correctly, do a lot of “reasoning” tasks that amount to resolving how the user has explained a particular widely known paradigm.
The more difficult and obscure the issues you provide them with, the faster you notice them reward hacking by altering the criteria until they are no longer attempting to solve the problem. Using “advisor” style loops helps hold this off at the cost of tokens, but there is still a fairly short limit at which they will essentially give up if they can’t find all of the necessary information - sometimes the issue is actually worse if they find a small amount of information instead of nothing - they’ll extrapolate from that tiny piece of data and generate plausible-sounding hallucinations almost every time.
And god forbid your problem involves doing something a different way than the majority of people do it. Unless you can write a full spec on it, the models will repeatedly spiral back into adjusting everything about your problem until it matches one of the most popular approaches in their training data.
> how many other people have encountered a problem close enough to yours and solved it somewhere on the open internet
I'm 100% sure that all our web, cc, codex or whatsoever sessions are used in the training, RL or either both.
This makes the size of the universe models know about at least one order of magnitude bigger than the open internet.
This may have been the case one year ago, but with contemporary models such as Opus, I run into this less often.