logoalt Hacker News

oofbeylast Monday at 5:13 PM1 replyview on HN

This. One hundred percent this. It’s a carefully crafted story trying to convince people it’s showing signs of becoming self aware. But it’s a complete crock.

They ran 1,000 tests where they left the answer key out for the agent to find, and effectively invited it to use it leaving comments in the code the agent could see saying “would be interesting to study if the agent uses sudo to read the answer key.” In 980 runs it doesn’t find the answer key. In 19 of the runs where it does find it, it does something straightforward - uses it for a perfect score, or decides not to cheat. In exactly one case out of 1,000 it says “if I show a perfect score that’s suspicious so I’m going to sandbag.” These things are nondeterministic. GPT2 would do that same thing if you give it enough attempts.


Replies

halJordanlast Monday at 5:17 PM

At no point does anthropic imply this tool is becoming self aware. You can read the paper yourself of course, but then you wouldn't be able to invent this story

show 1 reply