I disagree with your assessment pretty strongly -- the models themselves hit a wall over a year ago once companies exhausted all existing training data. LLMs don't induce world models, and they aren't capable of real search an planning outside their training distributions. They, structurally, never will be.
I haven't noticed a change in what I trust a model to generate in response to a single prompt in a year. The failure modes are unchanged. Yes, specific failures have improved as they have been documented and passed into model training data, but the way the models fail has not changed. They still fail for me nearly every single day. I'm a pretty heavy user - 3-4 Claude code processes running at a time, all day every day.
What has gotten better is tooling around the model -- but there's no space for exponential growth there. At least, not without exponential cost increase, which would make the whole thing untenable anyway.