I think ppl only care about how Claude or codex does.

ipunchghosts • today at 12:43 PM • 3 replies • view on HN

Replies

GPT-5.4 and Opus 4.7, specifically, agree between themselves on 65% of the claims - 95% CI 62–68%. I.e., in at least 35% of the claims, one of the two models is wrong under this 4-bucket rubric.

➕ show 1 reply

spprashant • today at 12:48 PM

Looks like they land at the average number of 67% disagreement.

airstrike • today at 12:44 PM

I agree but the market is pricing way beyond that

alt Hacker News

Replies