logoalt Hacker News

kenjacksontoday at 5:41 PM6 repliesview on HN

Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.

It would be interesting to actively track how far long each progressive model gets...


Replies

simianwordstoday at 7:58 PM

It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.

revacholtoday at 6:19 PM

I just tried it in ChatGPT "Auto" and it didn't work

> Yes — ((((()))))) is balanced.

> It has 6 opening ( and 6 closing ), and they’re properly nested.

Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.

> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).

> A balanced version would be: ((((()))))

Testing a couple of different models without a harness such that no tool calls are possible would be interesting

show 1 reply
coldteatoday at 5:47 PM

Even more interesting to track how many of those are just ad-hoc patched.

show 1 reply
azakaitoday at 6:29 PM

You are trying it on a production model. The paper is using models with tool calls disabled.

moffkalasttoday at 5:50 PM

Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.

wg0today at 5:55 PM

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.