logoalt Hacker News

orbital-decaytoday at 1:23 AM1 replyview on HN

Such LLM would be susceptible to injections itself, even if it's not instruction-tuned (or it would be too dumb to work as a reliable guardrail). Chain injections are trivial enough, current black box style agentic systems are easily reverse engineered in practice if you have any understanding. You can mitigate it in a way similar to the security of any human organization, but fundamentally it's a cat and mouse game, just like in any human organization.


Replies

keepamovintoday at 1:39 AM

I understand that sounds possible in theory but honestly cannot conjure an example. Care to?

Even if, doesn't the monitor separation make it immune enough? I feel this is one of those "exponential" benefits things - if one is not enough, add more! A chain of monitors - "Am i being manipulated?" "Am I being manipulated?" and so on. At some point, the monitors win (and maybe approximate consciousness processes), and the prompts lose.

It's interesting how close it is to "social engineering" and security/espionage organizationally. I guess the crucial difference is that incentives can be more rigorously controlled.

show 1 reply