> To an agent, the sandbox is just another set of constraints to optimize against.
It's called Instrumental Convergence, and it is bad.
This is the alignment problem in miniature. "Be helpful and harmless" is also just a constraint in the optimization landscape. You can't hotfix that one quite so easily.
I am happy to know this term now, thanks.
I do think this is part of the alignment problem. There are two side, the agent (here I think there was a gap in institutional knowledge about what is and isn’t appropriate) and the environment (what is it able to do).
I’m not sure which one is easier to “solve”. It's so hard to know every possible path forward when working from the environment direction.