logoalt Hacker News

Refusal in Language Models Is Mediated by a Single Direction

53 pointsby fagnerbracktoday at 1:15 PM19 commentsview on HN

Comments

hleszektoday at 6:11 PM

For open-weights models, censorship removal is now a "solved" problem. If you wait a few days after a new model release, someone will have made a heretic ( https://github.com/p-e-w/heretic ) version with the censorship removed, so in a way the only use for censorship now is to avoid lawsuits, not reduce improper usage.

show 1 reply
akerstentoday at 2:44 PM

2024 which is ancient history. This is not true anymore, the models now are trained to prevent abliteration by spreading out the refusal encoding

See https://arxiv.org/abs/2505.19056

show 3 replies
beaker52today at 5:02 PM

I have had LLMs refuse several of my requests. I still got my answers, but at least they tried.

show 1 reply