I love it how people are transitioning from “LLMs can’t reason” to “LLMs can’t reliably reason”.
Frontier models went from not being able to count the number of 'r's in "strawberry" to getting gold at IMO in under 2 years [0], and people keep repeating the same clichés such as "LLMs can't reason" or "they're just next token predictors".
At this point, I think it can only be explained by ignorance, bad faith, or fear of becoming irrelevant.
Well, I was hedging a bit because I try to not overstate the case, but I'm just as happy to say: LLM's can't reason. Because it's not what they're built to do. They predict what text is likely to appear next.
But even if they can appear to reason, if it's not reliable, it doesn't matter. You wouldn't trust a tax advisor that makes things up 1/10 times, or even 1/100 times. If you're going to replace humans, "reliable" and "reproducible" are the most important things.