This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.
I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.
Real discourse has tons of irrelevant information for all sorts of reasons.
There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.
A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.
Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.
That said, if you've got a hundred billion dollars betting on your AI project achieving AGI, you benefit a lot by conflating those contexts. In that case, grinding on formal SAT, LSAT, GRE, etc problems amounts to tuning for microbenchmarks rather than real world use cases.
Irrelevant info is taught in grade skill and is a skill for the SAT for example.
Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.
The point is having an answer that you can defend logically and most people would agree.
If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.
I say this as a RLHF’er who sees and is told to write similar questions at times.
At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.
I think this is valid though. Transformer models don't explicitly do logic but implicitly "vibe" out the answer from the input sequence (using the attention mechanism) and learnt knowledge - they're predicting text sequences after all. So adding more irrelevant context to the input would quite likely influence the the output.
I could see attention possibly being able to overcome this, but if not that would be a pretty big gotcha for real-world applications and reliability in real-world scenarios where, as others have said, it's not immediately clear what is relevant info. These models would be a lot less useful if a human had to decide which information to feed them and the output would be dependent on human judgement. I understand it's where we're at right now and that they are quite useful already but the valuations hint at investors expecting more imo.
I think it’s an important result because filtering signal from noise is just as, if not more, important than forming conclusions from signal.
That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:
1. Bing was gaslighting me into 9.11 being greater than 9.9
2. ChatGPT said that 7x7/7+7/7+7/7 was 24.
3. When expanding (x+1)^2 the output was 2x^2+2.
Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.
What is worse in an educational context is that it is actively harmful.
> LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information
"Attention is all you need" /
(It is part of the general problem solving process to evaluate what is relevant and what is not.)
Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.
I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.
Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?
I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.
I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.
It's interesting that I use deliberately artificial remarks to encourage more "creative" or random outputs from LLMs. In this approach, I'm not seeking an exact or precise response to prompts, but rather something more open-ended.
The problem here is that throwing in little gotchas like that is a tactic used by math and physics educators to ensure that students actually understand the topic by reasoning through new problems, rather than mindlessly turning the crank from learning the "surface structure" of earlier problem sets. The argument here is that the LLM is not reasoning, it's mindlessly turning a crank.
I don't think this exact question would be out of place on a 6th grade math test. I distinctly remember being taught this skill in "word problems," learning to identify information that actually pertains to the question rather than being distracted by red herrings the teacher threw in.