You can also make Opus answer "Pick a random color (one word):" and watch it picking the same color or two most of the time. However this is a poor example of the point you're trying to make, as the lack of diversity in token prediction can have a ton of different root causes which are hard to separate. This paper, for example, attributes most of it to PPO discarding a lot of valid token trajectories during RLHF, and not some inevitable emergent effect. [1] [2]
> only the largest Anthropic model is breaking from stochastic outputs there
Most models, even small ones, exhibit the lack of output diversity where they clearly shouldn't. [3] In particular, Sonnet 3.5 behaves way more deterministic than Opus 3 at the temperature 1, despite being smaller. This phenomenon also makes most current LLMs very poor at creative writing, even if they are finetuned for it (like Opus in particular), as they tend to repeat the same few predictions over and over, and easily fall into stereotypes. Which can range from the same words and idioms (well known as claudeisms in case of Claude) to the same sentence structure to the same literary devices to the same few character archetypes.
[1] https://arxiv.org/abs/2406.05587
[2] https://news.ycombinator.com/item?id=40702617 HN discussion, although not very productive as commenters pretend it's about politics while the paper argues about training algorithms
As I said, if you understand why, you'll be well prepared for the next generations of models.
Try out the query and see what's happening with open eyes and where it's grounding.
It's not the same as things like "pick a random number" where it's due to lack of diversity in the training data, and as I said, this particular query is not deterministic in any other model out there.
Also, keep in mind Opus had RLAIF not RLHF.