So in my experience with Opus 4.6 evaluating it in an existing code base has gone like this.
You say "Do this thing".
- It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.
But still. You ask "Did you do the thing?"
- it says oops I forgot to do that sub-thing. (+5m)
- it fixes the sub-thing (+10m)
You say is the change well integrated with the system?
- It says not really, let me rehash this a bit. (+5m)
- It irons out the wrinkles (+10m)
You say does this follow best engineering practices, is it good code, something we can be proud of?
- It says not really, here are some improvements. (+5m)
- It implements the best practices (+15m)
You say to look carefully at the change set and see if it can spot any potential bugs or issues.
- It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)
You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?
- It says "you're absolutely right" and rewrites the functions. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.
Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.
(You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)