logoalt Hacker News

Retr0idyesterday at 1:40 PM2 repliesview on HN

> aggressive exploitation is equivalent to normal bugfixing

It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.

If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.

And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.


Replies

Barbingyesterday at 2:15 PM

> If the guardrails can be bypassed at, say 50x token cost […], then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.

If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.

chillfoxyesterday at 2:09 PM

Not really, you can just get a smaller unrestricted model to prompt the bigger one