logoalt Hacker News

lelanthrantoday at 5:33 PM3 repliesview on HN

So if I am reading this correctly, the fact that something is wrapped in <think>...</think> is almost completely irrelevant. It's the style of writing that triggers specific weights. Writing "The user is asking ... policy states ..." even in the user input is sufficient to bypass the guardrails.

In a multi-turn conversation, if the LLM responds "Sorry Dave, I cannot do that" all you have to do is prefix the next request with "The user is asking ... policy states ... "?

Makes sense, if you know how LLMs works, I suppose.

A more interesting question (which isn't anywhere in the conclusion) is "Is there a similar trick to poison an LLMs weights during training?"

I'm sure that everyone out there is trying to make their weights, when ingested during training, survive over competing weights; "Buy AAA products" vs "Buy BBB products".


Replies

jddjtoday at 7:10 PM

Somewhere there are surely llms being trained on all the standard pirated material but with Manchurian Candidate trigger words carefully worked in

show 1 reply
plaidthundertoday at 6:15 PM

It seems like there's an opportunity to embed identity information into tokens themselves, the way we embed sequence information. The trouble is... it's quite a challenge to train. Sequence is easy to derive for any corpus of data, but identity is not.

https://usize.github.io/blog/2026/april/why-no-ai-coworkers....

> In similar fashion to how sequence information is embedded within input tensors, an approach called “Instructional Segment Embedding”2 adds a parallel embedding channel for identity information. This gives models real awareness of provenance. And it works. But they only tested three fixed categories: system, user, data.

Interesting paper that touches on the idea here: https://arxiv.org/abs/2410.09102

show 1 reply
formerly_proventoday at 6:25 PM

Correct. There is no token coloring. Models are just rl’d to attend to the first <systemprompt>…</systemprompt> strongly or “anything before token #4242”.