logoalt Hacker News

Retr0idtoday at 12:00 PM2 repliesview on HN

Something being possible doesn't mean it's easy. Transforming a problem from a forbidden shape into an allowed shape could well be harder than just solving the original problem.


Replies

roenxitoday at 1:23 PM

I think the article just proved that aggressive exploitation is equivalent to normal bugfixing, so it seems like there are some large and important classes of transform that are easy.

It took me a minute of thinking to understand how this could even be considered a jailbreak; if Anthropic are going to turn out models that can't handle "find and develop regression test scripts for bugs in this program" as a prompt then it is going to take serious model crippling. To be able to prompt the model someone will need to already understand secure programming - the model itself won't be able to independently detect security problems without active guidance.

show 1 reply
OutOfHeretoday at 2:41 PM

It could be easier when you use a less smart uncensored model to control the smarter but censored one.