They are trained to be aligned e.g. to refuse to say certain things, but it’s on some set of inputs ...

andy99 • yesterday at 7:03 PM • 0 replies • view on HN

They are trained to be aligned e.g. to refuse to say certain things, but it’s on some set of inputs asking for the bad thing and some set of outputs refusing to do so, or rewards when it refuses.

But there are only so many ways the trainers can think to ask the questions, and the training doesn’t generalize well to completely different ways. There’s a fairly recent paper (look up “best of N”) showing that adding random spelling mistakes or capitalization to the prompt will also often bypass any alignment, again just because it hasn’t been trained specifically for this.

alt Hacker News