I understand that sounds possible in theory but honestly cannot conjure an example. Care to? Even ...

keepamovin • today at 1:39 AM • 1 reply • view on HN

I understand that sounds possible in theory but honestly cannot conjure an example. Care to?

Even if, doesn't the monitor separation make it immune enough? I feel this is one of those "exponential" benefits things - if one is not enough, add more! A chain of monitors - "Am i being manipulated?" "Am I being manipulated?" and so on. At some point, the monitors win (and maybe approximate consciousness processes), and the prompts lose.

It's interesting how close it is to "social engineering" and security/espionage organizationally. I guess the crucial difference is that incentives can be more rigorously controlled.

Replies

RugnirViking • today at 2:22 AM

Have you ever played Gandalf?

https://gandalf.lakera.ai/baseline

I can assure you its very possible to win with a vast array of techniques. It doesn't prove anything, but is a fun exercise in this sort of issue.

alt Hacker News

Replies