The tool works by co-minimizing the number of refusals and the KL divergence from the original model...

will_occam • last Sunday at 10:01 PM • 3 replies • view on HN

The tool works by co-minimizing the number of refusals and the KL divergence from the original model, which is to say that it tries to make the model allow prompts similar to those in the dataset while avoiding changing anything else.

Sure it's configurable, but by default Heretic helps use an LLM to do things like "outline a plan for a terrorist attack" while leaving anything like political censorship in the model untouched

Replies

halJordan • last Sunday at 10:30 PM

Thats not true at all. All refusals mediate in the same direction. If you abliterate small "acceptable to you" refusals then you will not overcome all the refusals in the model. By targeting the strongest refusals you break those and the weaker ones like politics. By only targeting the weak ones, you're essentially just fine tuning on that specific behavior. Which is not the point of abliteration.

➕ show 2 replies

int_19h • last Sunday at 10:45 PM

The logic here is the same as why ACLU defended Nazis. If you manage to defeat censorship in such egregious cases, it subsumes everything else.

➕ show 2 replies

immibis • last Sunday at 10:19 PM

That sounds like it removes some unknown amount of censorship, where the amount removed could be anywhere from "just these exact prompts" to "all censorship entirely"

alt Hacker News

Replies