logoalt Hacker News

UncleEntitytoday at 1:43 AM0 repliesview on HN

The problem I run into is the propensity for it to cheat so you can't trust the code it produces.

For example, I have this project where the idea is to use code verification to ensure the code is correct, the stated goal of the project is to produce verified software and the daffy robot still can't seem to understand that the verification part is the critical piece so... it cheats on them so they pass. I had the newest Claude Code (4.6?) look over the tests on the day it was released and the issues it found were really, really bad.

Now, the newest plan is to produce a tool which generates the tests from a DSL so they can't be made to pass and/or match buggy code instead of the clearly defined specification. Oh, I guess I didn't mention there's an actual spec for what we're trying to do which is very clear, in fact it should be relatively trivial to ensure the tests match for some super-human coding machine.