It's not a tired argument, and not just a semantic one it's a foundational characteristic of LLM.
> A token-predictor could still be trained to predict the tokens “I’m not sure what you mean because of points x, y, and z; could you elaborate?”
This is entirely true, and the key insight is even right in your sentence but you don't seem to grasp it. “could still be trained”: you can train an LLM into doing whatever you want it to, but you have to train it specifically for that!
In the beginning of LLM we witnessed this impressive phenomenon where the LLM exhibited emergent capabilities (I'm particularly thinking about LLMs being few shots learners about stuff that wasn't in their training corpus). And these emergent capabilities legitimately raised the question about “how intelligent these things are, really”.
But for the past three years, the key lesson is that this kind of emergent effect is too small to be useful, and the focus has been put towards creating purposely built datasets (with tons of “artificial data”) to train the model to explicitly do things we want it to do. And it works pretty well, as models' capabilities kept improving at a fast pace (and in particular, I don't see would we couldn't overcome the problem highlighted by this paper, with more synthetic data specifically designed for multi-turn conversation). But their progress is now strictly limited by their makers' own intelligence. You cannot just scrap the web throw compute at the problem and expect emergent intelligence to occur anymore. It's more “simulated intelligence” than “artificial intelligence”, really.
It's definitely a tired and semantical one because as he said, it brings no insight and is not even good at the analogy level. I can't have a conversation with Dracula and Dracula can't make decisions that affect the real world, so LLMs already break key aspects and assumptions of the 'Document Simulator'.
Pre-trained LLMs will ask clarifying questions just fine. So I think this is just another consequence of post-training recipes.