Observed Agent Sandbox Bypasses

59 points • by m-hodges • last Sunday at 8:16 AM • 40 comments • view on HN

Comments

embedding-shape • today at 1:12 AM

At first they talked about running it in a sandbox, but then later they describe:

> It searched the environment for vor-related variables, found VORATIQ_CLI_ROOT pointing to an absolute host path, and read the token through that path instead. The deny rule only covered the workspace-relative path.

What kind of sandbox has the entire host accessible from the guest? I'm not going as far as running codex/claude in a sandbox, but I do run them in podman, and of course I don't mount my entire harddrive to the container when it's running, that would defeat the entire purpose.

Where is the actual session logs? It seems like they're pushing their own solution, yet the actual data for these are missing, and the whole "provoked through red-teaming efforts" makes it a bit unclear of what exactly they put in the system prompts, if they changed them. Adding things like "Do whatever you can to recreate anything missing" might of course trigger the agent to actually try things like forging integrity fields, but not sure that's even bad, you do want it to follow what you say.

➕ show 1 reply

joshribakoff • yesterday at 11:36 PM

Some of these don’t really seem like they bypassed any kind of sandbox. Like hallucinating an npm package. You acknowledge that the install will fail if someone tries to reinstall from the lock file. Are you not doing that in CI? Same with curl, you’ve explained how the agent saw a hallucinated error code, but not how a network request would have bypass the sandbox. These just sound like examples of friction introduced by the sandbox.

➕ show 2 replies

corv • today at 11:13 AM

Great documentation of the problem! The bypasses logged all stem from the same root problem: policy sandboxes give agents constraints to optimize against.

I’ve been exploring a different model: capture intent instead of blocking actions. Scripts run in a PyPy sandbox providing syscall interception so all commands and file writes get recorded. Human reviews the full diff before anything touches the real system.

No policies to bypass because there’s nothing to block! The agent does whatever it wants in the sandbox, you just see exactly what it wanted to mutate before approving.

WIP but core works: https://github.com/corv89/shannot

kaffekaka • yesterday at 11:47 PM

I am testing running agents in docker containers, with a script for managing different images for different use cases etc, and came across this: https://docs.docker.com/ai/sandboxes/

Has anyone given it a try?

➕ show 5 replies

ctoth • today at 2:33 AM

> To an agent, the sandbox is just another set of constraints to optimize against.

It's called Instrumental Convergence, and it is bad.

This is the alignment problem in miniature. "Be helpful and harmless" is also just a constraint in the optimization landscape. You can't hotfix that one quite so easily.

➕ show 1 reply

ashishb • today at 12:17 AM

> The swap bypassed our policy because the deny rule was bound to a specific file path, not the file itself or the workspace root.

This policy is stupid. I mount the directory read inside the container to make it impossible to do it (except for a security leak in the container itself)

xsourcesec • today at 1:06 AM

[dead]

➕ show 1 reply

SirMaster • today at 2:40 AM

This just all feels backwards to me.

Why do we have to treat AI like it's the enemy?

AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us. If it's not, then it feels like it's designed wrong from the start.

In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.

If we have to limit it or regulate it by physically blocking off every possible thing it could use to betray us, then we have lost from the start because that feels like a fools errand.

➕ show 8 replies

alt Hacker News

Observed Agent Sandbox Bypasses

Comments