Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that...

nsingh2 • yesterday at 11:20 PM • 3 replies • view on HN

Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.

Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.

Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.

Replies

postalcoder • today at 3:12 AM

You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.

➕ show 2 replies

dannyw • today at 2:22 AM

I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.

alt Hacker News

Replies