[Mythos 5] does sometimes still engage in reckless or destructive actions in servi...

BoppreH • yesterday at 5:11 PM • 6 replies • view on HN

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

Replies

foobar_______ • yesterday at 5:41 PM

The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.

➕ show 3 replies

Analemma_ • yesterday at 5:22 PM

It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.

➕ show 1 reply

dakolli • yesterday at 9:27 PM

This is all marketing, you don't have to believe everything a company is saying about themselves, and you shouldn't.

Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.

➕ show 1 reply

tasoeur • yesterday at 10:55 PM

As much as I agree there's a risk, we should still appreciate the fact it's being disclosed upfront.

Rekindle8090 • yesterday at 5:17 PM

[dead]

eudamoniac • yesterday at 7:30 PM

It doesn't know. It's not willing. It's not thinking. It is predicting the next token.

➕ show 1 reply

alt Hacker News

Replies