The H-neuron paper[0] found something similar (if not more general): the same bits of the model resp...

kmeisthax • yesterday at 5:47 PM • 1 reply • view on HN

The H-neuron paper[0] found something similar (if not more general): the same bits of the model responsible for hallucination also make the model a sycophant, and also make the model easier to jailbreak.

[0] https://arxiv.org/abs/2512.01797

Replies

js8 • yesterday at 6:08 PM

Doesn't surprise me. But I don't think this is caused by friendliness, but by obedience. And I think we want the agents to be obedient. And I am afraid there is a tradeoff - more obedience means more willful ignorance of common sense ethical constraints.

alt Hacker News

Replies