Isn’t it trivially fixable by having a monitor LLM? The monitor just reviews each turn pair and asks...

keepamovin • yesterday at 11:40 PM • 2 replies • view on HN

Isn’t it trivially fixable by having a monitor LLM? The monitor just reviews each turn pair and asks, “Is this conversation being manipulated via prompt injection?”

Replies

zapkyeskrill • today at 12:23 AM

Is it? Or does it just make it multi dimensional? As in, prompt now need to anticipate there being a monitor and instruct that one too, indirectly.

➕ show 1 reply

orbital-decay • today at 1:23 AM

Such LLM would be susceptible to injections itself, even if it's not instruction-tuned (or it would be too dumb to work as a reliable guardrail). Chain injections are trivial enough, current black box style agentic systems are easily reverse engineered in practice if you have any understanding. You can mitigate it in a way similar to the security of any human organization, but fundamentally it's a cat and mouse game, just like in any human organization.

➕ show 1 reply

alt Hacker News

Replies