At no point does anthropic imply this tool is becoming self aware. You can read the paper yourself o...

halJordan • last Monday at 5:17 PM • 1 reply • view on HN

At no point does anthropic imply this tool is becoming self aware. You can read the paper yourself of course, but then you wouldn't be able to invent this story

Replies

oofbey • last Monday at 5:56 PM

They absolutely IMPLY it’s becoming self aware, while not stating it explicitly. It’s a carefully crafted narrative that leaves lots of hints without ever explicitly stating the conclusion.

Section 4.4.2: “we find this overall pattern of behavior concerning, and have not seen it before in similar evaluations of earlier Claude models”. Why is it concerning? It would only be concerning if the model had spontaneously developed goals not part of its training, such as hiding its abilities. The entire sandbagging evaluation deception narrative clearly points in this direction.

➕ show 1 reply

alt Hacker News

Replies