It is likely that the training set contains stuff like rationalizations, or euphemisms, in contexts ...

gaigalas • last Friday at 5:38 PM • 0 replies • view on HN

It is likely that the training set contains stuff like rationalizations, or euphemisms, in contexts that are not harmful. I think those are inevitable.

Eventually, and specially in reasoning models, these behaviors will generalize outsite their original context.

The "honesty" training seems to be an attempt to introduce those confession-like texts in training data. You'll then get a chance of the model engaging in confessing. It won't do it if it has never seen it.

It's not really lying, and it's not really confessing, and so on.

If you reward pure honesty always, the model might eventually tell you that he wouldn't love you if you were a worm, or stuff like that. Brutal honesty can be a side effect.

What you actually want is to be able to easily control which behavior the model engages, because sometimes you will want it to lie.

Also, lies are completely different from hallucinations. Those (IMHO) are when the model displays behavior that is non-human and jarring. Side effects. Probably inevitable too.

alt Hacker News