Could this be Goodhart's Law in action? AI tools like to showcase benchmarks in bar graphs to show how well they perform compared to other models.
Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?
Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.