I find the same. Someone posted this benchmark here: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...
It measures whether models push back on bullshit prompts or just go along with it, and Claude models are all the top performers.