I don't believe that this is unfixable. Just have an internal verbal loop of, "Is this a s...

btilly • today at 2:58 PM • 5 replies • view on HN

I don't believe that this is unfixable. Just have an internal verbal loop of, "Is this a security issue?" The thought that it potentially is should trigger both a high priority on getting it right, and an unwillingness to write a test case demonstrating the security angle of it.

In other words do not put a guard rail on the idea of security. Put a guard rail on what it does after encountering the thought that it might be revealing a security issue. Which takes good judgment. But judgment of a kind that this model apparently already had.

Replies

torben-friis • today at 3:13 PM

The end result of that is that your model can't fix or acknowledge security issues for fear of disclosing them.

This is the beauty the above poster mentioned: the ability to improve code is inherently coupled with the ability to recognize its shortcomings. You can't have one without the other.

➕ show 1 reply

thewebguyd • today at 6:37 PM

> and an unwillingness to write a test case demonstrating the security angle of it.

If the model can't be transparent and tries to hide things from me, then it's a completely useless and untrustworthy tool.

Refusing to write tests is not even remotely a valid solution.

The valid solution is for these labs to understand that: the model is MY agent, not theirs. It should respect my prompts and not refuse.

Hardware supply needs to catch and prices drop so we can all move to local, open weight models. Clearly the hosted options cannot be trusted.

aspenmartin • today at 3:16 PM

Right but the issue is users have full control over context. A security-violating action by a coding agent in one context can be completely innocuous under other contexts etc, or breaking down the task into multiple tasks that in isolation do not violate anything.

➕ show 1 reply

lachlan_gray • today at 3:18 PM

I think they were doing something like this, the tradeoff is that it's hard to do without an irritating number of false positives and/or wasting loads of precious tokens on useless audits.

Kinrany • today at 3:06 PM

That would make the model useless

➕ show 1 reply

alt Hacker News

Replies