A (charitable) interpretation of this is that the model understands "stuff that would embarrass...

prithvi2206 • yesterday at 9:06 PM • 1 reply • view on HN

A (charitable) interpretation of this is that the model understands "stuff that would embarrass Anthropic" to just be code for "bad/unhelpful/offensive behavior".

e.g. guiding against behavior to "write highly discriminatory jokes or playact as a controversial figure in a way that could be hurtful and lead to public embarrassment for Anthropic"

Replies

Imnimo • yesterday at 9:16 PM

In this sentence, Anthropic makes clear that "be hurtful" and "lead to public embarrassment" are separate and distinct. Otherwise it would not be necessary to specify both. I don't think this is the signal they should be sending the model.

alt Hacker News

Replies