logoalt Hacker News

andy99last Sunday at 4:50 PM2 repliesview on HN

See https://arxiv.org/abs/2406.11717 Refusal in Language Models Is Mediated by a Single Direction (June 2024)

All “alignment” is extremely shallow, thus the general ease of jailbreaks.


Replies

mwczlast Sunday at 7:36 PM

Yes, I wasn't clear, that is the paper I was reading, not the heretic readme.

show 1 reply
p-e-wlast Sunday at 5:14 PM

The alignment has certainly become stronger though. Llama 3.1 is trivial to decensor with abliteration and Heretic's optimizer will rapidly converge to parameters that completely stomp out refusals, while for gpt-oss and Qwen3, most parameter configurations barely have an effect and it takes much longer to reach something that even slightly lowers the refusal rate.

show 1 reply