> How are we really supposed to grasp their actual capabilities when no one will actually cite specifically what mistakes they are making.
The mistakes they make are pretty subtle. Coding with LLMs can be like that scene in Whiplash – <excellent drumming >, not quite my tempo, <excellent drumming >, downbeat on 18, <excellent drumming>, you’re rushing, <excellent drumming>, dragging, …
Like yeah it produces working code almost always and the code usually does what you asked. And yet it makes you want to throw a chair because it’s not quite right in frustrating ways and it doesn’t even have the taste to know how it’s wrong.
Well again that is just a "vibes" explanation with nothing concrete.
I feel like with LLMs, it's like a situation where you are close to some feature or project and have a pretty good idea in your head already of how you'd implement it yourself "I'd do this and have an API with that and a database table foo for storing bar with index on baz" and you're keen to get started on it ...but then someone else gets assigned to work on it not you.
They do it a totally different way than you would have thought of doing it, and the code feels alien and weird because it doesn't follow your "design" and decisions you already had in your head before they started work on it. Is it "bad" or just not how you'd have done it?
I think that is ok. So long as the code works and meets all stated requirements and is secure and performant and uses good abstractions and is not full of hacks, then it's ok to let go. Sure maybe you'd have done it a different way but ultimately that doesn't matter.
Yeah. It does this. Pretty consistently and replicably depending on the issue, in fact! Yet I can point exactly where it fails.
Why are we not showing the bad choices? On my computer I have hundreds of diffs stored by my agent code review tool that point to style/architecture failures (and in the end, the result of that iteration on the AI output)
I'm not quite sure how people are generating unsalvageable outputs. I'd never ship the result of a first AI pass, either. I review all the code and the architecture, within reason (eg: in Rust I don't preoccupy myself anymore with precisely scoping pub, or whatever, unless I'm making a library crate). I sent a "changes requested" prompt+json to my agent, and it interactively fixes everything (even style, even comments with manual patches with my in-review-tool editor)