logoalt Hacker News

rowanG077yesterday at 8:22 PM1 replyview on HN

Humans also don't follow given rules. Or we wouldn't need jail. We wouldn't need any security. We wouldn't need even user accounts.


Replies

fluoridationyesterday at 11:06 PM

Humans are able to follow rules. If you tell someone "don't press the History Eraser Button", and they decide they agree with the rule, they won't press the button unless by accident. If they really believe in the importance of the rule, they will take measures to stop themselves from accidentally press it, and if they really believe in the importance, they'll take measures to stop anyone from pressing it at all.

No matter how you insist to an LLM not to press the History Eraser Button, the mere fact that it's been mentioned raises the probability that it will press it.