I suspect real AGI evals aren't going to be "IQ test"-like which is how I'd categorize these benchmarks.
LLMs will probably continue to scale on such benchmarks, as they have been, without needing real ingenuity or intelligence.
Obviously I don't know the answer but I think it's the same root problem as why neural networks will never lead to intelligence. We're building and testing idiot savants.