logoalt Hacker News

XCSmeyesterday at 11:06 PM1 replyview on HN

I made my own benchmarks, very basic questions, and Claude 4.6 is actually worse than the free Stepfun 3.5 version: https://aibenchy.com

It is smart, but it fails at basic instruction following sometimes.

I remember this is a Claude thing for quite a while, where I kept trying to make it output just JSON (without structured output), and it always kept adding quotes or new lines.


Replies

XCSmetoday at 1:21 AM

After looking more into it, Claude DOES give the correct answer, just not in the format that it's asked, it always adds more info at the end, even when asked to just give the answer...

show 1 reply