It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.
The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.
Consider the lead:
> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.
“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.
1 star.
(Three out of) four experiments is anecdotal for sure, but the result meshes with more established instruction following benchmarking (although DeepSeek V4 pro does not top these): https://artificialanalysis.ai/evaluations/ifbench
I found the writing clear and quite even handed. The lead is a bit salesy, but leads typically are. Knee-jerk dismissals based on vibes that something is LLM generated are quite low-effort.