logoalt Hacker News

jampekkatoday at 8:56 AM1 replyview on HN

(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench

I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.


Replies

zozbot234today at 9:04 AM

It's picking strange tasks that don't really play to GPT-Pro's strengths (that model is roughly comparable to Mythos, intended for very hard reasoning and research-level problems) and then completely ignoring quite a few cases where GPT-Pro actually got some things more correct than DeepSeek did. The auto-AI ranking is just not reliable for this stuff.