The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions” and ablating them, they shift the trade-off frontier for the model. In rare‐event screening terms: they’re effectively changing the detection threshold geometry rather than trying just to get better data. It resonates with how improving a test’s accuracy in low-prevalence settings often fails unless you address threshold + base rate.
The paper is great. It really shows how alignement is entirely surface level and not actually deeply ingrained in the models. Really interesting work.