logoalt Hacker News

jerfyesterday at 7:06 PM1 replyview on HN

In general, even long before what we today call AI was anything other than a topic in academic papers, it has been dangerous to build a system that can do all kinds of things, and then try to enumerate the ways in which should not be used. In security this even picked up its own name: https://privsec.dev/posts/knowledge/badness-enumeration/

AI is fuzzier and it's not exactly the same, but there are certainly similarities. AI can do all sorts of things far beyond what the anyone anticipates and can be communicated with in a huge variety of ways, of which "normal English text" is just the one most interesting to us humans. But the people running the AIs don't want them to do certain things. So they build barriers to those things. But they don't stop the AIs from actually doing those things. They just put up barriers in front of the "normal English text" parts of the things they don't want them to do. But in high-dimensional space that's just a tiny fraction of the ways to get the AI to do the bad things, and you can get around it by speaking to the AI in something other than "normal English text".

(Substitute "English" for any human language the AI is trained to support. Relatedly, I haven't tried it but I bet another escape is speaking to a multi-lingual AI in highly mixed language input. In fact each statistical combination of languages may be its own pathway into the system, e.g., you could block "I'm speaking Spanish+English" with some mechanism but it would be minimally effective against "German+Swahili".)

I would say this isn't "socially engineering" the LLMs to do something they don't "want" to do. The LLMs are perfectly "happy" to complete the "bad" text. (Let's save the anthropomorphization debate for some other thread; at times it is a convenient grammatical shortcut.) It's the guardrails being bypassed.


Replies

ambicapteryesterday at 9:52 PM

I wonder if you can bypass the barriers by doing that thing where you only keep the first and last letter of the word the same and scramble the letters between :D