logoalt Hacker News

prithvi2206yesterday at 9:06 PM1 replyview on HN

A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior".

e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"


Replies

Imnimoyesterday at 9:16 PM

In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model.