logoalt Hacker News

tony_cannistratoday at 6:41 PM4 repliesview on HN

> Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.

https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89...


Replies

goekjclotoday at 8:51 PM

I don't know if they can be any more 'cautious' for Mythos 2...

Zee2today at 8:51 PM

Alignment “appearing” better as model capabilities increase scares the shit out of me, tbh.

tekacstoday at 8:11 PM

"We want to see risks in the models, so no matter how good the performance and alignment, we’ll see risks, results and reality be damned."

show 1 reply
CamperBob2today at 8:55 PM

Translation: yay, more paternalism.

show 1 reply