> the presence of a security hole should not be seen as permission to exploit
Why not?
I want the agents on my side to exploit whatever they can to help me. The ones on the other side certainly won't be artificially nerfed.
Because it is not well aligned enough to be able to tell where it's stopped helping you and started fucking you instead.
What if the agent in the middle of helping you runs out of tokens? Would you appreciate if it in the spirit of "exploiting whatever they can to help me" would scan your machine for payment methods, log into your bank account, approve 2FA by reading you mail and plug your credit card into the billing so it could efficiently continuing helping you?
I do not wish my Amazon delivery driver to show up in my living room.
Well, the agent should help you by saying "hey, I cannot do this task, but I can bypass the problem by doing this, but obviously it is not something you intended me to do or even something you were aware of, so I will not do it unless you tell me explicitly it's ok".
It's win-win: the agent is helping and it is educating you about things you obviously did not realise.