Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.
It would be interesting to actively track how far long each progressive model gets...
I just tried it in ChatGPT "Auto" and it didn't work
> Yes — ((((()))))) is balanced.
> It has 6 opening ( and 6 closing ), and they’re properly nested.
Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.
> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).
> A balanced version would be: ((((()))))
Testing a couple of different models without a harness such that no tool calls are possible would be interesting
Even more interesting to track how many of those are just ad-hoc patched.
You are trying it on a production model. The paper is using models with tool calls disabled.
Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.
So yes.
And the valuations. Trillion dollar grifter industry.
It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.