Honest answer: the problems start when you're running 50+ agents across 3 different model providers and the failure modes aren't "pod crashed" anymore. They're "model returned confidently wrong output and the next 4 steps ran on garbage."
K8s is great at keeping things alive. It's not built to reason about whether the thing that's alive is actually working correctly. Agent infra needs to handle rollback at the logic level, not just the container level.
Given what OP describes
> Our biggest pain point with hosting agents was that you'd need to stitch together multiple pieces: packaging your agent, running it in a sandbox, streaming messages back to users, persisting state across turns, and managing getting files to and from the agent workspace.
The k8s ecosystem already handles most this and your agent framework the agent specifics. What you are talking about is valid, though a different axis imo. Quality and guardrails are important, but not discussed by OP.
Yup! And this is a genuinely hard problem when you try to apply agents to domains other than coding. With coding, you can easily rollback. But in other domains, you take action in the real world and that's not easy to rollback.
We're thinking a lot about how we could provide a "Convex" like experience where we guide your coding agents to set up your agents in a way that maximizes the ability to rollback. For example, instead of continuously taking action, it's better that agents gather all required context, do the work needed to make a decision (research, synthesize, etc.), and then only take action in the real world at the end. If an agent did bad work, then this makes it easy to rollback to the point where the agent gathered all the context, correct it's instructions, and try again