logoalt Hacker News

HarHarVeryFunnytoday at 12:22 PM3 repliesview on HN

Exactly - it effectively is a "jail break" since it accomplishes something the model's security filter was trying to prevent, and the ridiculous simplicity of it shows just how broken that type of security is.

I wonder if Dario is now regretting hyping up how dangerous the model is? How does he walk this back? Do the feds let him just put a band-aid on it?


Replies

bitexplodertoday at 1:06 PM

I also have a 100% success rate jail breaking them by breaking the work down into small pieces and stripping all security related language. Smaller tasks, test engineering and normal programming language. Fable found a few bugs in my harness for me before they pulled it. I was testing it vs ChatGPT, Gemini, and Opus. It was doing well at bug hunting.

show 4 replies
MPSimmonstoday at 1:35 PM

I think it's a side effect of the Transformer architecture. The worldview where all input is equally trusted, and there's no concept of "the other", makes it hard to build effective guardrails where some input is trusted and other input is not trusted.

show 1 reply
an0maloustoday at 2:29 PM

Cheapest option is to gift an enormous golden statue of Trump for his ballroom

show 1 reply