> aggressive exploitation is equivalent to normal bugfixing It isn't, though. The venn dia...

Retr0id • yesterday at 1:40 PM • 2 replies • view on HN

> aggressive exploitation is equivalent to normal bugfixing

It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.

If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.

And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.

Replies

Barbing • yesterday at 2:15 PM

> If the guardrails can be bypassed at, say 50x token cost […], then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.

If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.

chillfox • yesterday at 2:09 PM

Not really, you can just get a smaller unrestricted model to prompt the bigger one

alt Hacker News

Replies