logoalt Hacker News

Vera_Wildetoday at 6:45 AM1 replyview on HN

The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions” and ablating them, they shift the trade-off frontier for the model. In rare‐event screening terms: they’re effectively changing the detection threshold geometry rather than trying just to get better data. It resonates with how improving a test’s accuracy in low-prevalence settings often fails unless you address threshold + base rate.


Replies

xmcqdpt2today at 2:11 PM

The paper is great. It really shows how alignement is entirely surface level and not actually deeply ingrained in the models. Really interesting work.