logoalt Hacker News

TomasBMyesterday at 10:28 PM1 replyview on HN

We might, and probably will, but it's still important to distinguish between malicious by-design and emergently malicious, contrary to design.

The former is an accountability problem, and there isn't a big difference from other attacks. The worrying part is that now lazy attackers can automate what used to be harder, i.e., finding ammo and packaging the attack. But it's definitely not spontaneous, it's directed.

The latter, which many ITT are discussing, is an alignment problem. This would mean that, contrary to all the effort of developers, the model creates fully adversarial chain-of-thoughts at a single hint of pushback that isn't even a jailbreak, but then goes back to regular output. If that's true, then there's a massive gap in safety/alignment training & malicious training data that wasn't identified. Or there's something inherent in neural-network reasoning that leads to spontaneous adversarial behavior.

Millions of people use LLMs with chain-of-thought. If the latter is the case, why did it happen only here, only once?

In other words, we'll see plenty of LLM-driven attacks, but I sincerely doubt they'll be LLM-initiated.


Replies

Terr_today at 12:05 AM

A framing for consideration: "We trained the document generator on stuff that included humans and characters being vindictive assholes. Now, for some mysterious reason, it sometimes generates stories where its avatar is a vindictive asshole with stage-direction. Since we carefully wired up code to 'perform' the story, actual assholery is being committed."