Refusal in Language Models Is Mediated by a Single Direction

53 points • by fagnerbrack • today at 1:15 PM • 19 comments • view on HN

Comments

For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.

➕ show 1 reply

akersten • today at 2:44 PM

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056

➕ show 3 replies

beaker52 • today at 5:02 PM

I have had LLMs refuse several of my requests. I still got my answers, but at least they tried.

➕ show 1 reply

alt Hacker News

Refusal in Language Models Is Mediated by a Single Direction

Comments