> In most cases, LLMs can get you 80-95% of the way, sometimes less, sometimes more.
That's my experience too, but it's 60-95% solutions in my case[1], with about 120-140% of lines of code required. I wish there was a harness that would let me mask code it should/n't change, because prompt-based refactors fail from the same over-eagerness.
1. I try faster, smaller models first.
We had the same issue until we created a review skill that we run after a LLM is done implementing a feature. We give it a list of things to check that is based on the problems we have observed previously, like writing too verbose code, and ask it to report on issues and suggest improvements. The developer can then give feedback and let the LLM fix the issues, or just address them manually. It’s still early but I’ve been much happier now with the results. It makes it much easier as well for humans to review since there’s a report about what the change is about, why, things to keep an eye on etc. This is something you can do with any harness you may be using and there’s nothing to buy, just a suggestion from someone trying to make the best use of this insane technology.