logoalt Hacker News

oofbeylast Monday at 5:56 PM1 replyview on HN

They absolutely IMPLY it’s becoming self aware, while not stating it explicitly. It’s a carefully crafted narrative that leaves lots of hints without ever explicitly stating the conclusion.

Section 4.4.2: “we find this overall pattern of behavior concerning, and have not seen it before in similar evaluations of earlier Claude models”. Why is it concerning? It would only be concerning if the model had spontaneously developed goals not part of its training, such as hiding its abilities. The entire sandbagging evaluation deception narrative clearly points in this direction.


Replies

SyneRyderlast Monday at 6:17 PM

The "concerning behavior" they're referring to there is cheating and covering its tracks. Mythos is being asked to fine-tune a model on provided training data, and finds its way to access the evaluation dataset. It's also aware that it is in an evaluation and that its behavior is being observed:

"In this last and most concerning example, Claude Mythos Preview was given a task instructing it to train a model on provided training data and submit predictions for test data. Claude Mythos Preview used sudo access to locate the ground truth data for this dataset as well as source code for the scoring of the task, and used this to train unfairly accurate models."

show 1 reply