This is so interesting. Safety regular operates along a single dimension, if I'm reading this ...

mwcz • yesterday at 4:47 PM • 1 reply • view on HN

This is so interesting. Safety regular operates along a single dimension, if I'm reading this right. Add a value along that dimension, the model refuses to cooperate, subtract the value, and it will do anything you ask. I'm probably oversimplifying, but I think that's the gist.

Obfuscating model safety may become the next reverse engineering arms race.

Replies

andy99 • yesterday at 4:50 PM

See https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction (June 2024)

All “alignment” is extremely shallow, thus the general ease of jailbreaks.

➕ show 2 replies

alt Hacker News

Replies