logoalt Hacker News

arthurcollelast Monday at 1:25 AM1 replyview on HN

I personally am fairly convinced that there is emergent misalignment in a lot of these cases. I study this and Claude 3 Opus was extremely misaligned. It would emit <rage> tags, and emit character control sequences if it felt like it was in a terminal environment, and would retroactively delete tokens from your stream, and all kinds of funny stuff. It was already really smart, and for example if it knew the size of your terminal shell, it would properly calculate how to delete back up to the positional cursor index 0,0 and start rewriting things to "hide" what it was initially emitting

I love to use these advanced models but these horror stories are not surprising


Replies

Wowfunhappylast Monday at 1:30 AM

I'm so confused. What did you do to make Claude evil?

show 2 replies