If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.
Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.
“What is the word that starts with S, ends with E, and contains A? → SAEA”
Then when I said that’s not a word and you gave me the answer already, no fun, it said
“I do not have access to confirm that word.”
this is exactly why strongest model gonna lose out to weaker models if the later ones have more data
for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.
So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.
Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.
model can be smarter but if it does not have the data to base its inference upon it's useless.
That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.
Edit: found this analysis, it's on the HN frontpage right now
> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.
https://x.com/jxmnop/status/1953899426075816164