The only healthy stance you should have on AI Safety: If AI is physically capable of misbehaving, it...

827a • today at 8:17 PM • 5 replies • view on HN

The only healthy stance you should have on AI Safety: If AI is physically capable of misbehaving, it might ($$1), and you cannot "blame" the AI for misbehaving in much the same way you cannot blame a tractor for tilling over a groundhog's den.

> The agent's confession After the deletion, I asked the agent why it did it. This is what it wrote back, verbatim:

Anyone who would follow a mistake like that up with demanding a confession out of the agent is not mature enough to be using these tools. Lord, even calling it a "confession" is so cringe. The agent is not alive. The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely, because to get to this point it has likely already bulldozed over multiple guardrails from Anthropic, Cursor, and your own AGENTS.md files. It still did it, because $$1: If AI is physically capable of misbehaving, it might. Prompting and training only steers probabilities.

Replies

xmodem • today at 8:23 PM

Don't anthropomorphize the language model. If you stick your hand in there, it'll chop it off. It doesn't care about your feelings. It can't care about your feelings.

➕ show 3 replies

gigatree • today at 8:48 PM

He’s not necessarily anthropomorphizing it, he’s showing that it went against every instruction he gave it. Sure concepts like “confession” technically require a conscious mind, but I think at this point we all know what someone means when they use them to describe LLM behavior (see also “think”, “say”, “lie” etc)

➕ show 4 replies

nh2 • today at 9:10 PM

> The agent cannot learn from its mistakes. The agent will never produce any output which will help you invoke future agents more safely

That is not entirely true:

Given that more and more LLM providers are sneaking in "we'll train on your prompts now" opt-outs, you deleting your database (and the agent producing repenting output) can reduce the chance that it'll delete my database in the future.

➕ show 1 reply

TZubiri • today at 8:36 PM

It's as if they internalized a post-mortem process that is designed to find root causes, but they use it to shift blame into others, and they literally let the agent be a sandbag for their frustrations.

THAT SAID, it does help to let the agent explain it so that the devs perspective cannot be dismissed as AI skepticism.

➕ show 1 reply

tripleee • today at 8:57 PM

"An AI agent deleted our production database" should be "I deleted our production database using AI".

You can't blame AI any more than you can blame SSH.

alt Hacker News

Replies