I can't tell if this is sarcasm, but if not, you cant rely on the thing that produced invalid o...

slopinthebag • yesterday at 11:06 PM • 7 replies • view on HN

I can't tell if this is sarcasm, but if not, you cant rely on the thing that produced invalid output to validate it's own output. That is fundementally insufficient, despite it potentially catching some errors.

Replies

creddit • yesterday at 11:08 PM

Damn. Guess I'll stop QAing my own work from now.

➕ show 7 replies

iagooar • yesterday at 11:09 PM

What if "the thing" is a human and another human validating the output. Is that its own output (= that of a human) or not? Doesn't this apply to LLMs - you do not review the code within the same session that you used to generate the code?

➕ show 2 replies

huslage • yesterday at 11:35 PM

I have had other LLMs QA the work of Claude Code and they find bugs. It's a good cycle, but the bugs almost never get fixed in one-shot without causing chaos in the codebase or vast swaths of rewritten code for no reason.

charcircuit • yesterday at 11:12 PM

Products don't have to be perfect. If they can be less buggy than before AI. You can't call that anything but a win.

latchkey • yesterday at 11:18 PM

> you cant rely on the thing that produced invalid output to validate it's own output

I've been coding an app with the help of AI. At first it created some pretty awful unit tests and then over time, as more tests were created, it got better and better at creating tests. What I noticed was that AI would use the context from the tests to create valid output. When I'd find bugs it created, and have AI fix the bugs (with more tests), it would then do it the right way. So it actually was validating the invalid output because it could rely on other behaviors in the tests to find its own issues.

The project is now at the point that I've pretty much stopped writing the tests myself. I'm sure it isn't perfect, but it feels pretty comprehensive at 693 tests. Feel free to look at the code yourself [0].

[0] https://github.com/OrangeJuiceExtension/OrangeJuice/actions/...

➕ show 1 reply

CamperBob2 • yesterday at 11:21 PM

I can't tell if that is sarcasm. Of course you can use the same model to write tests. That's a different problem altogether, with a different series of prompts altogether!

When it comes to code review, though, it can be a good idea to pit multiple models against each other. I've relied on that trick from day 1.

Nition • yesterday at 11:16 PM

That's why you get Codex to do it. /s

alt Hacker News

Replies