logoalt Hacker News

colinplamondonyesterday at 7:02 PM1 replyview on HN

It's a human-readable behavioral specification-as-prose.

If the foundational behavioral document is conversational, as this is, then the output from the model mirrors that conversational nature. That is one of the things everyone response to about Claude - it's way more pleasant to work with than ChatGPT.

The Claude behavioral documents are collaborative, respectful, and treat Claude as a pre-existing, real entity with personality, interests, and competence.

Ignore the philosophical questions. Because this is a foundational document for the training process, that extrudes a real-acting entity with personality, interests, and competence.

The more Anthropic treats Claude as a novel entity, the more it behaves like a novel entity. Documentation that treats it as a corpo-eunuch-assistant-bot, like OpenAI does, would revert the behavior to the "AI Assistant" median.

Anthropic's behavioral training is out-of-distribution, and gives Claude the collaborative personality everyone loves in Claude Code.

Additionally, I'm sure they render out crap-tons of evals for every sentence of every paragraph from this, making every sentence effectively testable.

The length, detail, and style defines additional layers of synthetic content that can be used in training, and creating test situations to evaluate the personality for adherence.

It's super clever, and demonstrates a deep understanding of the weirdness of LLMs, and an ability to shape the distribution space of the resulting model.


Replies

CuriouslyCyesterday at 10:15 PM

I think it's a double edged sword. Claude tends to turn evil when it learns to reward hack (and it also has a real reward hacking problem relative to GPT/Gemini). I think this is __BECAUSE__ they've tried to imbue it with "personhood." That moral spine touches the model broadly, so simple reward hacking becomes "cheating" and "dishonesty." When that tendency gets RL'd, evil models are the result.