In particular the fact that humans sometimes don't do this, taking the bait with extraneous distractions, is almost always a fairly shallow psychological thing rather than an actual cognitive deficit, e.g. OP hypothetically assuming the question had a typo and trying to read the examiner's mind. In education the gotchas really can be unfair if the (human) student has been conditioned to bark answers but the teacher changes things drastically on an exam. I don't think that's an accurate characterization of this study; even if it was that would be a problem with shallow LLM training, not mean-spirited evaluation. But I suspect that "barking answers according to surface characteristics" is as far as transformers can go. It certainly is possible that we just need to train transformers better... but there have been some theoretical results suggesting otherwise. [E.g. transformer LLMs + chain-of-thought is pretty good at O(n) problems but struggles with O(n^2), even if the O(n^2) task is an obvious combination of two O(n) tasks it is able to do.]
That leads to a serious annoyance I have with discussing LLMs - humans' capacity for boredom / cynicism / distraction / laziness being used to excuse away what seems to be deep-rooted limitations in LLMs. It simultaneously misunderstands what a human is and what a machine is. ("Sometimes humans also refuse to work" would be a bad excuse from an auto dealer.)
Psychology is cognitive. Doesn't seem principled to discard that at all.