This only works if you don't look at the code.
If all you're doing is reviewing behaviour and tests then yes almost 100% of the time if you're able to document the problem exact enough codex 5.3 will get it right.
I had codex 5.3 write flawless svelte 5 code only because I had already written valid svelte 5 code around my code.
The minute I started a new project and asked it to use svelte 5 and let it loose it not only started writing a weird mixture of svelte 3/4 + svelte 5 code but also straight up ignored tailwind and started writing it's own CSS.
I asked it multiple times to update the syntax to svelte 5 but it couldn't figure it out. So I gave up and just accepted it, that's what I think is going to happen more frequently. If the code doesn't matter anymore and it's just the process of evaluating inputs and outputs then whatever.
However if I need to implement a specific design I will 100% end up spending more time generating than writing it myself.
I'm working in a very mature codebase on product features that are not technically unprecedented, which probably is determining a lot of my experience so far. Very possible that I'm experiencing a sweet spot.
I can totally imagine that in greenfield, the LLM is going to explore huge search spaces. I can see that when observing the reasoning of these same models in non-coding contexts.