Relatedly, I think it's worth noting that Anthropic models have consistently been top-scoring in BullshitBench[0], in a league of their own, really.
Not affiliated with the bench in any way, but I think it surfaces important differences between the behavior of the models from different labs.
TLDR: The benchmark is measuring pushback in response to nonsensical requests and questions, as opposed to going with it and hallucinating a nonsensical answer.
[0]: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
> I found my interactions with Fable to be extremely impressive; it made other models, including GPT 5.5 and Opus 4.8, feel small and dumb.
> Anthropic models have consistently been top-scoring in BullshitBench[0]
eyeroll I find that Anthropic models feel big and dumber.
https://www.endorlabs.com/research/ai-code-security-benchmar... puts Fable 5th, which seems about right to me.
I'm interested in code utility and correctness, even if the majority of AI use is not focused on that.
TBH this is the main thing that made me start trusting Claude enough to actually find it useful, and I'm surprised other models haven't caught up. I assumed they had and I just wasn't aware because I'm not using them in the same way.