No, the agents are not being adversarially prompted here. Rather, it's a consistent failure acr...

anayebi • yesterday at 8:31 PM • 0 replies • view on HN

No, the agents are not being adversarially prompted here. Rather, it's a consistent failure across models of RLHF-based safety-pretraining not generalizing to OOD open-ended agentic computer-use settings, as I explain here: https://x.com/aran_nayebi/status/2061875809384538366

alt Hacker News