You could have continual learning on text and still be stuck in the same "remixing baseline human communications" trap. It's a nasty one, very hard to avoid, possibly even structurally unavoidable.
As for the "just put a vision LLM in a robot body" suggestion: People are trying this (e.g. Physical Intelligence) and it looks like it's extraordinarily hard! The results so far suggest that bolting perception and embodiment onto a language-model core doesn't produce any kind of causal understanding. The architecture behind the integration of sensory streams, persistent object representations, and modeling time and causality is critically important... and that's where world models come in.