Neither intelligence nor context are what really differentiate the most successful model for programming (Claude Opus 4.6) from slightly 'smarter' competitors (Codex 5.3, Gemini 3.1 Pro).
It's tool use and personality. If models stopped advancing today, we could still reach effective AGI with years of refining harnesses. There is still incredible untapped potential there.
I maintain a benchmark at https://gertlabs.com that competes models against each other in competitive, open-ended games. It's harder to game the benchmark because there's no correct answer (at least none that any of the models have gotten remotely close to) and it requires anticipation of other players' behavior.
One thing I've found is that Codex and Gemini models tend to perform the best at one-shotting problems, but when given a harness and tools to iterate towards a solution, Anthropic models continue improving where Codex and Gemini struggle to use tools they weren't trained on or take the initiative to follow the high level objectives.
“ If models stopped advancing today, we could still reach effective AGI with years of refining harnesses.”
Unless you’re a machine learning engineer with something to share, our current models are not even close to general AGI, and won’t make it.
My understanding (as just an engineer) is that LLMs continue to improve at crazy rates, but it’s clear this is not the answer for AGI.