logoalt Hacker News

gaigalaslast Thursday at 3:22 PM0 repliesview on HN

> Make your coding agent prove it first

Agents love to cheat. That's an issue I don't see a horizon for change.

Here's Opus 4.5 trying to cheat its way out of properly implementing compatibility and cross-platform, despite the clear requirements:

https://gist.github.com/alganet/8531b935f53d842db98157e1b8c0...

> Should popen handles work with fgets/fread/fwrite? PHP supports this. Option A: Create a minimal pipe_io_stream device / Option B: Store FILE* in io_private with a flag / Option C: Only support pclose, require explicit stream wrapper for reads.

If I asked for compatibility, why give me options that won't fully achieve it?

It actually tried to "break check" my knowledge about the interpreter (test me if I knew enough to catch it), and proposed shortcuts all the way through the chat.

I don't want to have to pepper my chats with variations on "don't cheat". I mean, I can do it, but it seems like boilerplate.

I wish I had some similar testing-related chats to share. Agents do that all the time.

This is the major blocker right now for AI-assisted automated verification, and one of the reasons why this isn't well developed beyond general directions (give it screenshots, make it run the command, etc).