logoalt Hacker News

CuriouslyCtoday at 5:56 PM4 repliesview on HN

Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.


Replies

kaicianflonetoday at 6:16 PM

I agree with this almost completely. The hard part isn’t generation anymore, it’s validation of intent vs outcome. Especially once decisions are high-stakes or irreversible, think pkg updates or large scale tx

What I’m working on (open source) is less about replacing human validation and more about scaling it: using multiple independent agents with explicit incentives and disagreement surfaced, instead of trusting a single model or a single reviewer.

Humans are still the final authority, but consensus, adversarial review, and traceable decision paths let you reserve human attention for the edge cases that actually matter, rather than reading code or outputs linearly.

Until we treat validation as a first-class system problem (not a vibe check on one model’s answer), most of this will stay in “cool demo” territory.

show 1 reply
cronin101today at 6:02 PM

This obviously depends on what you are trying to achieve but it’s worth mentioning that there are languages designed for formal proofs and static analysis against a spec, and I have suspicions we are currently underutilizing them (because historically they weren’t very fun to write, but if everything is just tokens then who cares).

And “define the spec concretely“ (and how to exploit emerging behaviors) becomes the new definition of what programming is.

show 1 reply
varispeedtoday at 6:16 PM

AI also quickly goes off the rails, even the Opus 2.6 I am testing today. The proposed code is very much rubbish, but it passes the tests. It wouldn't pass skilled human review. Worst thing is that if you let it, it will just grow tech debt on top of tech debt.

simianwordstoday at 6:03 PM

did you read the article?

>StrongDM’s answer was inspired by Scenario testing (Cem Kaner, 2003).

show 1 reply