Interestingly to me, the problem I always find is that you will make a suggestion, the AI will go through a thinking loop, come to the exact wrong conclusion then blast out tokens make the solution to their own conclusions.
I honestly wish there was more "I'm not sure what you meant can you clarify this part" more often. It feels like I want a "confidence in itself slider"
I'm solving the "make the solution to their own conclusions" with rigorous "context engineering". Skills, MCPs, and, above all, context window switching.
E.g. with TDD, I find that a model that writes both the tests and the code, will almost always hone in on a solution, then -grudgingly- write a test for that, but quite certain with the final code "in mind" already.
So, I instruct it to use sub-agents; though I find the tooling on figuring out what context is and isn't passed between agents and subagents severely lacking.
Or, also worked pretty well, have one thread write the test. Only that. It cannot read code, it can only read the tests directory or even a subset thereof. Then another thread, entirely new context, must run the test, see it fail, start implementing and stop as soon as the test is green - it obviously cannot edit the test. Yet another new context then is instructed to refactor based on rigorous refactoring skills.
A lot of work - And ironically, skills written by agents are pretty bad, I found, so a lot of manual work. But the rewards are promising.