logoalt Hacker News

keepamovinyesterday at 11:40 PM2 repliesview on HN

Isn’t it trivially fixable by having a monitor LLM? The monitor just reviews each turn pair and asks, “Is this conversation being manipulated via prompt injection?”


Replies

zapkyeskrilltoday at 12:23 AM

Is it? Or does it just make it multi dimensional? As in, prompt now need to anticipate there being a monitor and instruct that one too, indirectly.

show 1 reply
orbital-decaytoday at 1:23 AM

Such LLM would be susceptible to injections itself, even if it's not instruction-tuned (or it would be too dumb to work as a reliable guardrail). Chain injections are trivial enough, current black box style agentic systems are easily reverse engineered in practice if you have any understanding. You can mitigate it in a way similar to the security of any human organization, but fundamentally it's a cat and mouse game, just like in any human organization.

show 1 reply