I find the same. Someone posted this benchmark here:

mcintyre1994 • today at 8:00 AM • 0 replies • view on HN

It measures whether models push back on bullshit prompts or just go along with it, and Claude models are all the top performers.

alt Hacker News