> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored wh...

japhyr • today at 5:13 PM • 3 replies • view on HN

> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.

This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.

The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.

I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.

Question for people who are already doing this: How much are you spending on tokens?

That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.

Replies

Lwerewolf • today at 6:00 PM

Re: $1k/day on tokens - you can also build a local rig, nothing "fancy". There was a recent thread here re: the utility of local models, even on not-so-fancy hardware. Agents were a big part of it - you just set a task and it's done at some point, while you sleep or you're off to somewhere or working on something else entirely or reading a book or whatever. Turn off notifications to avoid context switches.

Check it: https://news.ycombinator.com/item?id=46838946

dist-epoch • today at 7:18 PM

I wouldn't be surprised if agents start "bribing" each other.

➕ show 1 reply

verdverm • today at 5:56 PM

Do you know what those hold out twats should look like before thoroughly iterating on the problem?

I think people are burning money on tokens letting these things fumble about until they arrive at some working set of files.

I'm staying in the loop more than this, building up rather than tuning out

alt Hacker News

Replies