Most models are multi-paradigm, and so they get... Fixated on procedural language design. Concepts like the stack, backtracking, etc. violate the logic they've absorbed, leading to... Burning tokens whilst it corrects itself.
This won't show up in a smaller benchmark, because the clutching at straws tends to happen nearer to the edge of the window. The place where you can get it to give up obvious things that don't work, and actually try the problem space you've given.
I haven’t tried the extremes. Context rot says it’ll likely degrade there anyway.
What I’m investigating is if more compact languages work for querying data.
What makes you think it’s going to clutch at straws more? What makes you think it won’t do better with a more compact, localized representation?