Really fun read. To be this seems awful close to my experience using these models to code. When the prompts are simple and direct to follow the models do really good. Once the context overflows and you repopulate it, they start to hallucinate and it becomes very hard to bring them back from that.
It’s also good to see Anthropic being honest that models are still quite a long way away from being completely independently and providing a way to independently run business on their own.
It's likely that the weaknesses have a shared foundation: LLM pre-training fails to teach those LLMs to be good at agentic behavior, creating a lasting deficiency.
No known way to fully solve that as of yet, but, as always, we can mitigate with better training. Modern RLVR-trained LLMs are already much better at tasks like this than they were a year ago.