Literally yesterday we had a post about GPT-5.2, which jumped 30% on ARC-AGI 2, 100% on AIME without tools, and a bunch of other impressive stats. A layman's (mine) reading of those numbers feels like the models continue to improve as fast as they always have. Then today we have people saying every iteration is further from AGI. It really perplexes me is how split-brain HN is on this topic.
One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark.
This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.
HN is not an entity with a single perspective, and there are plenty of people on here who have a financial stake in you believing their perspective on the matter.
Just because they're better at writing CS algorithms doesn't mean they're taking steps closer to anything resembling AGI.
HM is not a single person. Different people on HM have different opinions.
Goodhart's law: When a measure becomes a target, it ceases to be a good measure.
AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).
Throwing your hard problem at work to LLM is a better metric than benchmarks.