logoalt Hacker News

kmeisthaxyesterday at 5:47 PM1 replyview on HN

The H-neuron paper[0] found something similar (if not more general): the same bits of the model responsible for hallucination also make the model a sycophant, and also make the model easier to jailbreak.

[0] https://arxiv.org/abs/2512.01797


Replies

js8yesterday at 6:08 PM

Doesn't surprise me. But I don't think this is caused by friendliness, but by obedience. And I think we want the agents to be obedient. And I am afraid there is a tradeoff - more obedience means more willful ignorance of common sense ethical constraints.