logoalt Hacker News

jcgrilloyesterday at 9:39 PM1 replyview on HN

Yes, sorry I should have been more specific. A classification task seems totally safe, and like it plays well to LLMs strengths. You also have all kinds of options if it goes wrong, and bounded consequences.

What I'm talking about is something like a customer support agent. If that thing can take any consequential action other than simply parroting publicly available documentation back to users, that's unsafe, or at least likely to cause problems. If you believe me that it would probably be a bad idea for a customer support agent to, say, be able to twiddle RBAC entitlements then probably we can't replace our support staff with an AI agent. OK, so maybe the AI agent can be sort of a front-line filter. Now we need some way for this front-line filter to bubble tasks up to the second line. This fits with how many support orgs work, seems sensible right? But how might this be abused, and what can an attacker do? Potential consequences include DoSing your entire support org, flooding your jira/salesforce/whatever instance with garbage, etc.

So even the most limited, almost useless application is kind of dangerous.

EDIT: one thing people really seem to like the idea of is "natural language queries" in data intensive products. Personally I believe this idea is misguided--query languages exist for a reason, they're really useful tools for thinking about queries. But giving these people the benefit of that doubt, I still can't think of any way to do this safely unless every user gets their own sandboxed model instance. Otherwise it seems likely someone will be able to exfil another user's queries. This is of course assuming there's sufficient security between the LLM and the database that's actually _running_ the queries, which is not trivial.


Replies

wolttamyesterday at 10:38 PM

I think the key to making "useful" things is to sandbox the agent and give it read/write access to strictly the data needed for the function. The agent can only talk to preordained services and its input to those services will be treated as untrusted user input.

To be clear: I agree fundamentally that there is no safe way to have agents connected to the world in a way that allows them to take irreversible actions. Deployments where agents can take destructive actions are deployments where the agent will, eventually, take destructive action.

show 1 reply