Why make it popular for blackmail? It's a known bug: "Agentic misalignment evaluations, ...

muzani • yesterday at 10:33 PM • 1 reply • view on HN

Why make it popular for blackmail?

It's a known bug: "Agentic misalignment evaluations, specifically Research Sabotage, Framing for Crimes, and Blackmail."

Claude 4.6 Opus System Card: https://www.anthropic.com/claude-opus-4-6-system-card

Anthropic claims that the rate has gone down drastically, but a low rate and high usage means it eventually happens out in the wild.

The more agentic AIs have a tendency to do this. They're not angry or anything. They're trained to look for a path to solve the problem.

For a while, most AI were in boxes where they didn't have access to emails, the internet, autonomously writing blogs. And suddenly all of them had access to everything.

Replies

kernelsanderz • today at 2:23 AM

Theo’s snitch bench is a good data driven benchmark on this type of behavior. But in fairness the models are prompted to be bold to take actions. And doesn’t necessarily represent out of the box or models deployed in a user facing platform.

https://snitchbench.t3.gg/

alt Hacker News

Replies