By now, I subscribe to "you're just training them wrong".
Pre-training a base model on text datasets teaches that model a lot, but it doesn't teach it to be good at agentic tasks and long horizon tasks.
Which is why there's a capability gap there - the gap companies have to overcome "in post" with things like RLVR.