logoalt Hacker News

edude03yesterday at 2:34 PM4 repliesview on HN

I have the same experience despite using claude every day. As an funny anecdote:

Someone I know wrote the code and the unit tests for a new feature with an agent. The code was subtly wrong, fine, it happens, but worse the 30 or so tests they added added 10 minutes to the test run time and they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests


Replies

monoosoyesterday at 3:25 PM

There was an article on HN last week (?) which described this exact behaviour in the newer models.

Older, less "capable", models would fail to accomplish a task. Newer models would cheat, and provide a worthless but apparently functional solution.

Hopefully someone with a larger context window than myself can recall the article in question.

show 1 reply
sReinwaldyesterday at 4:00 PM

From my experience: TDD helps here - write (or have AI write) tests first, review them as the spec, then let it implement.

But when I use Claude code, I also supervise it somewhat closely. I don't let it go wild, and if it starts to make changes to existing tests it better have a damn good reason or it gets the hose again.

The failure mode here is letting the AI manage both the implementation and the testing. May as well ask high schoolers to grade their own exams. Everyone got an A+, how surprising!

show 1 reply
jermaustin1yesterday at 3:32 PM

This happens with me every time I try to get claude to write tests. I've given up on it. Instead I will write the tests if I really care enough to have tests.

antonvsyesterday at 3:12 PM

> they all essentially amounted to `expect(true).to.be(true)` because the LLM had worked around the code not working in the tests

A very human solution

show 1 reply