> Make your coding agent prove it first
Agents love to cheat. That's an issue I don't see a horizon for change.
Here's Opus 4.5 trying to cheat its way out of properly implementing compatibility and cross-platform, despite the clear requirements:
https://gist.github.com/alganet/8531b935f53d842db98157e1b8c0...
> Should popen handles work with fgets/fread/fwrite? PHP supports this. Option A: Create a minimal pipe_io_stream device / Option B: Store FILE* in io_private with a flag / Option C: Only support pclose, require explicit stream wrapper for reads.
If I asked for compatibility, why give me options that won't fully achieve it?
It actually tried to "break check" my knowledge about the interpreter (test me if I knew enough to catch it), and proposed shortcuts all the way through the chat.
I don't want to have to pepper my chats with variations on "don't cheat". I mean, I can do it, but it seems like boilerplate.
I wish I had some similar testing-related chats to share. Agents do that all the time.
This is the major blocker right now for AI-assisted automated verification, and one of the reasons why this isn't well developed beyond general directions (give it screenshots, make it run the command, etc).