It always feels to me like these types of tests are being somewhat intentionally ignorant of how LLM cognition differs from human cognition. To me, they don't really "prove" or "show" anything other than simply - LLMs thinking works different than human thinking.
I'm always curious if these tests have comprehensive prompts that inform the model about what's going on properly, or if they're designed to "trick" the LLM in a very human-cognition-centric flavor of "trick".
Does the test instruction prompt tell it that it should be interpreting the image very, very literally, and that it should attempt to discard all previous knowledge of the subject before making its assessment of the question, etc.? Does it tell the model that some inputs may be designed to "trick" its reasoning, and to watch out for that specifically?
More specifically, what is a successful outcome here to you? Simply returning the answer "5" with no other info, or back-and-forth, or anything else in the output context? What is your idea of the LLMs internal world-model in this case? Do you want it to successfully infer that you are being deceitful? Should it respond directly to the deceit? Should it take the deceit in "good faith" and operate as if that's the new reality? Something in between? To me, all of this is very unclear in terms of LLM prompting, it feels like there's tons of very human-like subtext involved and you're trying to show that LLMs can't handle subtext/deceit and then generalizing that to say LLMs have low cognitive abilities in a general sense? This doesn't seem like particularly useful or productive analysis to me, so I'm curious what the goal of these "tests" are for the people who write/perform/post them?
I thought adversarial testing like this was a routine part of software engineering. He's checking to see how flexible it is. Maybe prompting would help, but it would be cool if it was more flexible.
This is the first time I hear the term LLM cognition and I am horrified.
LLMs don‘t have cognition. LLMs are a statistical inference machines which predict a given output given some input. There are no mental processes, no sensory information, and certainly no knowledge involved, only statistical reasoning, inference, interpolation, and prediction. Comparing the human mind to an LLM model is like comparing a rubber tire to a calf muscle, or a hydraulic system to the gravitational force. They belong in different categories and cannot be responsibly compared.
When I see these tests, I presume they are made to demonstrate the limitation of this technology. This is both relevant and important that consumers know they are not dealing with magic, and are not being sold a lie (in a healthy economy a consumer protection agency should ideally do that for us; but here we are).
The marketing of these products is intentionally ignorant of how LLM cognition differs from human cognition.
Let's not say that the people being deceptive are the people who've spotted ways that that is untrue...