I don't disagree, however I'm optimistic because most of the current reasoning "ability" of LLMs comes from the accidental reasoning embedded in language patterns.
For example, the prompt completion: "The mouse has a unique digestive system compared to other rodents, however the sparrow" on GPT-4o is
"exhibits a highly specialized digestive system adapted for rapid processing of food, particularly seeds and insects, through structures like the crop and gizzard, which are not found in rodents."
Claude 3.5 completes it as
"has a completely different digestive anatomy as a bird. Birds like sparrows have adaptations for flight, including a lightweight skeletal system and a specialized digestive tract. Unlike mice, sparrows have a crop for storing food, a gizzard for grinding it, and generally shorter intestines to reduce weight. They also lack teeth, instead using their beak to manipulate food."
What appears to be a thoughtful contrast is merely a language pattern. Similarly, a prompt like "Assume -B, A->B. Under what circumstances is B true?" will simply follow the gradient to return output that is likely correct. Prompts like "what is 2+2" fail only because nobody bothers to write about it so simple arithmetic was not in the training data.
However the way that multi-modal LLMs handle images is inspiring as it effectively converts from the visual domain into the sequential token domain. The same could be done for symbolic systems, etc.