... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.
> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
Pretty small sample size here, but it's hard to avoid the conclusion that DeepSeek and friends will start to put some serious downward pressure on frontier lab token pricing.
Hopefully this dynamic continues long enough to make local/private inference the leading solution for coding.
The OP uses tons of typical AI turns of phrase, and Pangram classified it as AI with high confidence.
So it doesn't surprise me at all that the methodology is weak, too.
grok-4-1-fast was retired about a month ago.
Requests to grok-4-1-fast-non-reasoning now silently route to grok-4.3 (a 5x more expensive model), with reasoning set to "none".
https://docs.x.ai/developers/migration/may-15-retirement
TFA was published today, which implies grok-4.3 was used.