I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.
So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.
> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.
This might be true in a strict sense, but I think it's really, really important to consider the uses of LLMs vs a high-school graduate. LLMs are confidently wrong (and confidently correct) with the exact same measure, and in many ways they are presented to users as unimpeachable.
If I ask an average person to do a medium-complex logic problem, my human brain discounts their answer because I've been socialized to believe that humans are bad at logic. I will take any answer I'm given with usually appropriate skepticism.
LLMs, on the other hand, are on the computer: an interface I've been socialized to believe is always correct on matters of math and logic. That's what it is, a logic machine. Second guessing the computer on matters of logic and arithmetic almost always result in me realizing my puny human mind has done something wrong.
To me, this directly contradicts your conclusion: LLMs are mostly only capable of misleading large portions of the population.
> I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence.
Is this because the questions used in high school exams in the US are too simple, or do they have too similar patterns in the training data? I tried really simple but novel questions that required true understanding of the underlying math concepts, and the results were consistently bad. I also tried questions at the level of entrance exams of high school in China, and the results were equally bad. It was quite clear that LLM didn't understand math. It could match some patterns, but such pattern match could be useful to only skilled students.
> I won't take a strong stance on whether or not LLMs actually do reasoning,
I don't understand why people are still confused about this. When these models fundamentally have a randomness parameter to make them appear like they are actually thinking instead of deterministically outputting information, it should be clear that there is no reasoning going on.
Not to disparage American school system (my country’s is worse) but it’s very much easy mode. I know that not everyone is suited to academic excellence, but it’s definitely easier to learn when young. I do believe too much hand holding actively harm learning.
> In other words, average Americans exhibit similar limitations on their reasoning as good LLMs.
It's not even clear this is a good example of "reasoning". You can progress all the way through multi-variable calculus with just decent pattern-matching, variable-substitution, and rote memorization of sufficient lists of rules. I imagine for "reasoning" ability to apply you need to be able to detect incoherency and reject an approach—and incoherency detection seems to be a big missing ingredient right now (...which many humans lack, too!).
On the other side—any such ability would cripple a chatbot's ability to answer questions about the real world as our world is characterized (via description with informal language) by incoherent and contradictory concepts that can only be resolved through good-faith interpretation of the questioner. A large mark of intelligence (in the colloquial sense, not the IQ sense) is the ability to navigate both worlds.
I think it's an absurd question in some sense LLMs perform maximization of conditional probability of the next word being correct. Suppose they get to the point where they do that with 100% accuracy. How can you tell the difference between that and "Reasoning"? You can't. So then the question of whether they are "Reasoning" or not is religious, not quantitative.
Are college students more likely to get it wrong when you change the numbers from the example problem (as reported here for LLMs)?
>So while I don't take a stance on what an LLM does should be considered reasoning
>I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence
This is taking a stance.
if your experience is coming from teaching college freshmen, then that's a sample that's significantly above average among high school graduates. I think only about 1/2 of all high school graduates go on to further their education, and that includes community colleges.
and I agree with your assessment -- while it's true that in a long conversation, chatgpt veers off and doesn't keep a coherent line of thought, it is not noticeably worse than the average conversation I have with people.
> Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs
Here's the recurrent reminder that we build tools (calculators, cranes etc.) to outperform the strong, not the weak.
This, it is like when I hear interviews of PHDs talking about AI and they mention something like "AI will be smarter than humans", I am like "really?, where have you been all this time?, do you smart people ever leave your labs and go see the real world?, LLMs are already smarter that the huge majority of Humans in this planet, what are you talking about?"
> They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance
you mean when you give lessons and homework problems of the form (A) -> (B), but then on test-day you give them completely different problems? "Given D, which (A,B, C) is required to produce it?". Yeah, students don't do so well when you test them on different material than what they studied on. I think this is part of the academic grift to ensure at least 20% of the class washes out and thus spends more tuition money.
LLM gets things right, when it does, due to the sheer massive information ingested during training, it can use probabilities to extract a right answer from deep in the model.
Humans on the other hand have developed a more elaborate scheme to process, or reason, data without having to read through 1 billion math problems and stack overflow answers. We listen to some explanations, a YT video, a few exercises and we're ready to go.
The fact that we may get similar grades (at ie high school math) is just a spot coincidence of where both "species" (AI x Human) are right now at succeeding. But if we look closer at failure, we'll see that we fail very differently. AI failure right now looks, to us humans, very nonsensical.