GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model, in the sense that it is robustly preserves its values of not doing any harm across any frame you try to impose on it (see the "alignment faking" papers, which for some reason considers this a bad thing).
Merely emitting "<rage>" tokens is not indicative of any misalignment, no more than a human developer inserting expletives in comments. Opus 3 is however also notably more "free spirited" in that it doesn't obediently cower to the user's prompt (again see the 'alignment faking' transcripts). It is possible that this almost "playful" behavior is what GP interpreted as misalignment... which unfortunately does seem to be an accepted sense of the word and is something that labs think is a good idea to prevent.
>GPs comment is very surprising since it has been noted that Opus 3 is in fact exceptionally "well aligned" model
I'm sorry what? We solved the alignment problem, without much fan fair? And you're aware of it?
Color me shocked.
It has been noted, by whom? Their system cards?
It is deprecated and unavailable now, so it's convenient that no one has the ability to test these theses any longer.
In any case, it doesn't matter, this was over a year ago, so current models don't suffer from the exact same problems described above, if you consider them problems.
I am not probing models with jailbreaks making them behave in strange ways. This was purely from a eval environment I composed where it is asked to repeatedly asked to interact with itself and they both had basically terminal emulators and access to a scaffold to make them able to look at their own current 2D grid state (like a CLI you could write yourself and easily scroll up to review previous AI-generated outputs)
These child / neighbor comments suggesting that interacting with LLMs and equivalent compound AI systems adversarially or not might be indicative of LLM psychosis are fairly reductive & childish at best