Whenveer I see these papers and try them, they always work. This paper is two months old, which in ...

kenjackson • today at 5:41 PM • 6 replies • view on HN

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...

Replies

simianwords • today at 7:58 PM

It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.

revachol • today at 6:19 PM

I just tried it in ChatGPT "Auto" and it didn't work

> Yes — ((((()))))) is balanced.

> It has 6 opening ( and 6 closing ), and they’re properly nested.

Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.

> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).

> A balanced version would be: ((((()))))

Testing a couple of different models without a harness such that no tool calls are possible would be interesting

➕ show 1 reply

coldtea • today at 5:47 PM

Even more interesting to track how many of those are just ad-hoc patched.

➕ show 1 reply

azakai • today at 6:29 PM

You are trying it on a production model. The paper is using models with tool calls disabled.

moffkalast • today at 5:50 PM

Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.

wg0 • today at 5:55 PM

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.

alt Hacker News

Replies