The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliterati...

p-e-w • last Sunday at 5:14 PM • 1 reply • view on HN

The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliteration and Heretic's optimizer will rapidly converge to parameters that completely stomp out refusals, while for gpt-oss and Qwen3, most parameter configurations barely have an effect and it takes much longer to reach something that even slightly lowers the refusal rate.

Replies

shikon7 • last Sunday at 5:30 PM

It seems to me that thinking models are harder to decensor, as they are trained to think whether to accept your request.

➕ show 1 reply

alt Hacker News

Replies