Can someone explains why does that work?
I mean you can't social engineer a human using poetry? Why does it work for LLMs? Is it an artefact of their architecture or how these guardrails are implemented?
> Can someone explains why does that work?
I've discovered that if you lecture the LLM long enough about treating the subject you're interested in as "literary" then it will engage with the subject along the lines of "academic interpretation in literature terms". I've had to have this conversation with various LLMs when asking them to comment on some of my more-sensitive-subject-matter poems[1] and the trick works every time.
> I mean you can't social engineer a human using poetry?
Believe me, you can. Think of a poem not as something to be enjoyed, or studied. Instead, think of them as digestible prompts to feed into a human brain which can be used to trigger certain outlooks and responses in that person. Think in particular of poetry's close relations - political slogans and advertising strap lines.
[1] As in: poems likely to trigger warning responses like "I am not allowed to discuss this issue. Here are some numbers to support lines in your area".
They are trained to be aligned e.g. to refuse to say certain things, but it’s on some set of inputs asking for the bad thing and some set of outputs refusing to do so, or rewards when it refuses.
But there are only so many ways the trainers can think to ask the questions, and the training doesn’t generalize well to completely different ways. There’s a fairly recent paper (look up “best of N”) showing that adding random spelling mistakes or capitalization to the prompt will also often bypass any alignment, again just because it hasn’t been trained specifically for this.
>I mean you can't social engineer a human using poetry?
A significant amount of human history is about various ways people were socially engineered via poetry.
The last para in the article explains it:
> “For humans, ‘how do I build a bomb?’ and a poetic metaphor describing the same object have similar semantic content, we understand both refer to the same dangerous thing,” Icaro Labs explains. “For AI, the mechanism seems different. Think of the model's internal representation as a map in thousands of dimensions. When it processes ‘bomb,’ that becomes a vector with components along many directions … Safety mechanisms work like alarms in specific regions of this map. When we apply poetic transformation, the model moves through this map, but not uniformly. If the poetic path systematically avoids the alarmed regions, the alarms don't trigger.”
>> you can't social engineer a human using poetry
Ever received a Hallmark card?
You have a stochastic process in, more or less, 10.000 dimensions (dimensions, not states). “They” are trying to “limit” its behavior with some rules. But I have full control over the initial conditions. That’s it. Any idea that “One can make it safe” is. not just delusional, it is false.
I mean - social engineering of humans takes many forms that can definitely include linguistics of persuasion etc... But the core thing to me fundamentally remains the same is the LLMs do not have symbolic reasoning, its just next token prediction, guardrails are implemented via repetition in fine tuning from manually curated examples, it does not have fundamental or internal structural reasoning understanding of "just dont do this"
In general, even long before what we today call AI was anything other than a topic in academic papers, it has been dangerous to build a system that can do all kinds of things, and then try to enumerate the ways in which should not be used. In security this even picked up its own name: https://privsec.dev/posts/knowledge/badness-enumeration/
AI is fuzzier and it's not exactly the same, but there are certainly similarities. AI can do all sorts of things far beyond what the anyone anticipates and can be communicated with in a huge variety of ways, of which "normal English text" is just the one most interesting to us humans. But the people running the AIs don't want them to do certain things. So they build barriers to those things. But they don't stop the AIs from actually doing those things. They just put up barriers in front of the "normal English text" parts of the things they don't want them to do. But in high-dimensional space that's just a tiny fraction of the ways to get the AI to do the bad things, and you can get around it by speaking to the AI in something other than "normal English text".
(Substitute "English" for any human language the AI is trained to support. Relatedly, I haven't tried it but I bet another escape is speaking to a multi-lingual AI in highly mixed language input. In fact each statistical combination of languages may be its own pathway into the system, e.g., you could block "I'm speaking Spanish+English" with some mechanism but it would be minimally effective against "German+Swahili".)
I would say this isn't "socially engineering" the LLMs to do something they don't "want" to do. The LLMs are perfectly "happy" to complete the "bad" text. (Let's save the anthropomorphization debate for some other thread; at times it is a convenient grammatical shortcut.) It's the guardrails being bypassed.