logoalt Hacker News

NitpickLawyeryesterday at 4:14 PM2 repliesview on HN

I'm surprised the article doesn't mention the biggest use of steering vectors, which is the potential to remove refusals from models (a.k.a. abliteration or uncensoring).

There was an earlier paper that found that "most refusals are on a single vector", and you can identify and "nerf" that vector so the model will skip refusals and answer "any" request normally. This was very doable for earlier models trained with SFT for refusals, seems to be a bit more complicated for newer models, but still doable to some extent.

There are already some libraries to automate this process and reduce refusals, but usually they focus on identifying and then modifying the models and releasing them as uncensored models. This technique of steering lets you enable this vector changing dynamically, so you don't need to change models if the abliteration process somehow hurts accuracy on other unrelated tasks.


Replies

electroglyphyesterday at 8:56 PM

p-e-w was just talking about this the other day in his Discord. seems doing the one neuron method is quite bad for KLD and that's why the newer techniques have stuck.

show 1 reply
cyanydeezyesterday at 4:32 PM

not sure why youre fixed on censoring. if we invert your POV censoring includes not reporting falsehoods "vaccines are harmful". Science and logic often tackle these subject via censoring, but a model given a equal sampling of Internet, would think vacinnes are harmful. a less naive correction would censor this problematic context.

so im cofised as to why you think unmasking whatever bias you think is censored will result in improvement in generic use case.

show 6 replies