logoalt Hacker News

simonw08/09/20251 replyview on HN

I've been thinking a lot about this recently. I've started running Claude Code and GitHub Copilot Agent and Codex-CLI in YOLO mode (no approvals needed) a bit recently because wow it's so much more productive, but I'm very aware that doing so opens me up to very real prompt injection risks.

So I've been trying to figure out the best shape for running that. I think it comes down to running in a fresh container with source code that I don't mind being stolen (easy for me, most of my stuff is open source) and being very careful about exposing secrets to it.

I'm comfortable sharing a secret with a spending limit: an OpenAI token that can only spend up to $25 is something I'm willing risking to an insecured coding agent.

Likewise, for Fly.io experiments I created a dedicated scratchpad "Organization" with a spending limit - that way I can have Claude Code fire up Fly Machines to test out different configuration ideas without any risk of it spending money or damaging my production infrastructure.

The moment code theft genuinely matters things get a lot harder. OpenAI's hosted Codex product has a way to lock down internet access to just a specific list of domains to help avoid exfiltration which is sensible but somewhat risky (thanks to open proxy risks etc).

I'm taking the position that if we assume that malicious tokens can drive the coding agent to do anything, what's an environment we can run in where the damage is low enough that I don't mind the risk?


Replies

pcllast Sunday at 8:25 AM

> I've started running Claude Code and GitHub Copilot Agent and Codex-CLI in YOLO mode (no approvals needed) a bit recently because wow it's so much more productive, but I'm very aware that doing so opens me up to very real prompt injection risks.

In what way do you think the risk is greater in no-approvals mode vs. when approvals are required? In other words, why do you believe that Claude Code can't bypass the approval logic?

I toggle between approvals and no-approvals based on the task that the agent is doing; sometimes I think it'll do a good job and let it run through for a while, and sometimes I think handholding will help. But I also assume that if an agent can do something malicious on-demand, then it can do the same thing on its own (and not even bother telling me) if it so desired.

show 1 reply