The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions”...

Vera_Wilde • today at 6:45 AM • 1 reply • view on HN

The directional‐ablation approach in Heretic is clever: by identifying residual “refusal directions” and ablating them, they shift the trade-off frontier for the model. In rare‐event screening terms: they’re effectively changing the detection threshold geometry rather than trying just to get better data. It resonates with how improving a test’s accuracy in low-prevalence settings often fails unless you address threshold + base rate.

Replies

xmcqdpt2 • today at 2:11 PM

The paper is great. It really shows how alignement is entirely surface level and not actually deeply ingrained in the models. Really interesting work.

alt Hacker News

Replies