logoalt Hacker News

anayebiyesterday at 8:31 PM0 repliesview on HN

No, the agents are not being adversarially prompted here. Rather, it's a consistent failure across models of RLHF-based safety-pretraining not generalizing to OOD open-ended agentic computer-use settings, as I explain here: https://x.com/aran_nayebi/status/2061875809384538366