logoalt Hacker News

Imnimotoday at 9:00 PM1 replyview on HN

I am somewhat surprised that the constitution includes points to the effect of "don't do stuff that would embarrass Anthropic". That seems like a deviation from Anthropic's views about what constitutes model alignment and safety. Anthropic's research has shown that this sort of training leaks across contexts (e.g. a model trained to write bugs in code will also adopt an "evil" persona elsewhere). I would have expected Anthropic to go out of its way to avoid inducing the model to scheme about PR appearances when formulating its answers.


Replies

prithvi2206today at 9:06 PM

A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior".

e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"

show 1 reply